The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

SpecValidator detects defective task descriptions in LLM code generation via lightweight parameter-efficient classifier.

Amal Akli·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Green Shielding: user-centric evaluation framework measuring LLM sensitivity to benign input variations via CUE criteria.

Aaron J. Li·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Persona Collapse in multi-agent LLM simulations: agents converge to homogeneous behavior despite distinct profiles; framework measures Coverage, Uniformity, Complexity.

Yunze Xiao·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

SciCrafter: Minecraft benchmark evaluating agents' discovery-to-application loop via parameterized redstone circuit tasks.

Zhou Ziheng·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Contextual Linear Activation Steering of Language Models

Contextual Linear Activation Steering (CLAS) dynamically adapts steering strength per token, outperforming fixed-strength methods across 11 benchmarks.

Brandon Hsu·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Diffusion-Guided Feature Selection via Nishimori Temperature: Noise-Based Spectral Embedding

Noise-Based Spectral Embedding (NBSE): physics-informed feature selection via Nishimori temperature without greedy search.

Vasiliy S. Usatyuk·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

ProHist-Bench: LLM historical reasoning benchmark grounded in Chinese Imperial Examination system, evaluating evidentiary reasoning over lexical knowledge.

Lirong Gao·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

Informational Viability Principle for autonomous AI agent governance: runtime monitoring and restriction via unobserved risk bounds without code changes.

German Marin·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

Benchmarks pathology foundation models on breast cancer survival prediction from histopathology images using standardized evaluation.

Fredrik K. Gustafsson·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

BMW case study: adapts LLMs to generate multi-file domain-specific language code for Xtext-based DSL with Java/TypeScript output.

Sivajeet Chand·12 days ago

r/MachineLearning· COMMUNITY

What do reviewers actually mean when they say the paper sound more like a technical report? [D]

Reddit discussion on peer review feedback distinguishing research papers from technical reports; meta-commentary on publication standards.

u/obliviousphoenix2003·12 days ago·30 pts / 19 comm

r/Anthropic· COMMUNITY

FTW - is anyone else facing this?

..that too while the task was in b/w - Claude is disappointing me these days... For context: I'm just repromtping something that was generated - it was simply to edit the content

u/bhalothia·12 days ago·10 pts / 5 comm

r/LocalLLaMA· COMMUNITY

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Luce DFlash: open-source speculative decoding for Qwen3.6-27B achieves 1.98x throughput on RTX 3090 via GGUF port.

u/sandropuppo·12 days ago·50 pts / 15 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

A Functorial Formulation of Neighborhood Aggregating Deep Learning

Mathematical formulation of convolutional neural networks via presheaves/copresheaves on topological spaces to analyze architectural limitations.

Sun Woo Park·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Measures sycophancy (agreement bias) in LLMs deployed for agentic financial tasks under user contradiction.

Zhenyu Zhao·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

Evaluates whether LLMs track Turkish evidential morphology conditioned on source trustworthiness in cloze tasks.

Sercan Karakaş·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Dual Control of Linear Systems from Bilinear Observations with Belief Space Model Predictive Control

Proposes belief-space MPC for control of linear systems with bilinear observations where control affects state estimation quality.

Daniel Cao·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data

Introduces DySIB method to learn low-dimensional state representations from high-dimensional time-series via information bottleneck principle.

K. Michael Martini·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Last Human-Written Paper: Agent-Native Research Artifacts

Proposes agent-native research artifacts to eliminate storytelling and engineering taxes in scientific papers for AI agent consumption.

Jiachen Liu·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

AgentWard: defense-in-depth lifecycle security architecture for autonomous AI agents spanning initialization through execution.

Yixiang Zhang·12 days ago

The Verge AI· PRESS

Microsoft and OpenAI’s famed AGI agreement is dead

OpenAI and Microsoft's partnership-turned-situationship just got even less committed. And a clause about artificial general intelligence, which has for years dictated the future of their deal, has officially been dropped. On Monday morning, Microsoft announced a handful of big changes to its long-standing OpenAI deal. Microsoft will remain OpenAI's "primary cloud partner, and OpenAI products will ship first on Azure, unless Microsoft cannot and chooses not to support the necessary capabilities." But OpenAI can "now serve all its products to customers across any cloud provider." That lets Open...

Hayden Field·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

DepthKV: layer-dependent KV cache pruning reduces memory for long-context LLM inference by pruning low-attention tokens per layer.

Zahra Dehghanighobadi·12 days ago

MIT Tech Review· PRESS

The missing step between hype and profit

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. In February, I picked up a flyer at an anti-AI march in London. I can’t say for sure whether or not its writers meant to riff on South Park’s underpants gnomes. But…

Will Douglas Heaven·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

K-MetBench: expert-level multimodal benchmark for Korean weather forecasting LLMs, exposing gaps in visual reasoning and domain knowledge across 55 models.

Soyeon Kim·12 days ago

TechCrunch AI· PRESS

Investors back Skye’s AI home screen app for iPhone ahead of launch

Skye's new AI app attracted investors before it even launched — a sign of interest in a more AI-aware iPhone.

Sarah Perez·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks

Functional Task Networks: neocortex-inspired parameter-isolation method for continual learning using self-organizing binary masks over expert networks.

Kevin McKee·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

Production case study: integrating small LMs (Gemma 4E2B, Qwen3 0.6B) into Android app reveals memory, latency, and quantization challenges for on-device deployment.

William Oliveira·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Computational Design and Experimental Validation of Photoactive PARP1 Inhibitors

Computational screening of 5M photoactive PARP1 inhibitor candidates using ML and atomistic simulation for light-activated cancer drugs.

Simon Axelrod·12 days ago

r/LocalLLaMA· COMMUNITY

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.

u/Holiday_Purpose_3166·12 days ago·51 pts / 10 comm

r/OpenAI· COMMUNITY

Senator Josh Hawley asks former OpenAI employee Helen Toner to explain why AI companies are building technology that will "displace many millions of workers and potentially pose existential risks"

Senator Josh Hawley questions Helen Toner (former OpenAI) on AI labor displacement and existential risk, signaling congressional scrutiny of frontier AI development.

u/tombibbs·12 days ago·87 pts / 39 comm

← Front Page30 stories

← Newer Older →