The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedic...

Guoxin Ma·1 month ago

r/LocalLLaMA· COMMUNITY

SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More

Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months. We’ve updated the **SWE-rebench leaderboard** with **110 fresh Python tasks** from GitHub PRs created in **March, April, and part of May**. The setup follows the standard SWE-bench format: models read real PR issues, edit code, run tests, and must make the full test suite pass. This time, instead of our usual monthly updates with a smaller number of tasks, we collected a larger batch so we could evalua...

u/CuriousPlatypus1881·1 month ago·41 pts / 26 comm

r/MachineLearning· COMMUNITY

AI-generated CUDA kernels silently break training and inference [R]

Last month NVIDIA released [SOL-ExecBench](https://research.nvidia.com/benchmarks/sol-execbench), a new benchmark of 235 production CUDA kernels lifted from DeepSeek, Qwen, Gemma, and Kimi. We took several top-ranked AI-generated submissions and tried using them in production workloads. Many of them broke, sometimes in surprising ways. One of those kernels is the fused embedding-gradient + RMSNorm backward pass, which runs at the end of every transformer training step. We took the fastest submission on the benchmark for it, and dropped it into the training loop of a small transformer. The ke...

u/laginimaineb·1 month ago·71 pts / 9 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models

The distortion-perception (D-P) tradeoff is a fundamental phenomenon of Bayesian inverse problems, which characterizes the inherent tension between distortion performance and perceptual quality. Enabling flexible traversal of the D-P tradeoff at inference time is crucial for practical applications. Despite the recent success of diffusion models in zero-shot inverse problem solving, efficient and principled strategies for D-P traversal in diffusion-based inverse algorithms remain inadequately characterized. In this paper, we propose a stage-wise framework for realizing D-P traversal using a si...

Jiawei Zhang·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and ...

Irune Zubiaga·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical plur...

Aisha Aijaz·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Understanding Generalization and Forgetting in In-Context Continual Learning

In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequ...

Guangyu Li·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks und...

Yeachan Park·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolutio...

Inès Benito·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the...

Dominika Agnieszka Długosz·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative...

Chusen Li·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still hav...

Thierry Judge·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward ...

Joséphine Raugel·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Latent-Conditioned Parameterized Quantum Circuits as Universal Approximators for Distributions over Quantum States

Many applications in quantum simulation, quantum chemistry, and quantum machine learning require not a single quantum state but an ensemble of states characterizing the heterogeneity of a target system. Preparing such ensembles state-by-state is prohibitive in both variational and fault-tolerant settings, motivating a generative-modeling approach. We introduce latent-conditioned parameterized quantum circuits (LPQCs), a hybrid quantum-classical framework in which classical neural networks map a latent variable sampled from a prior distribution to the parameters of a parameterized quantum circ...

Quoc Hoan Tran·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

History-aware adaptive reduced-order models via incremental singular value decomposition

Reduced-order models (ROMs) can accelerate high-dimensional dynamical simulations, but their accuracy often deteriorates when online dynamics leave the regime represented by offline training data. We develop a projection-based adaptive ROM framework based on incremental singular value decomposition (iSVD), in which occasional full-order operator evaluations provide correction snapshots for online basis updates. The intrusive ROMs considered here are fully parameterized by the basis, so each update naturally propagates to reduced operators and hyper-reduction machinery. Through its evolving si...

Amirpasha Hedayat·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing dem...

Yuting Xu·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differi...

Kuntal Ghosh·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimal ridge regularization revisited

We consider $L^2$-regularized linear (ridge) regression over a finite data sample $X$ with bounded covariance and linear prediction targets $y$ with additive isotropic noise of finite variance. We present an iterative procedure to compute the optimal regularization strength numerically from the generative parameters in the fixed-$X$ setting and prove its convergence at limited noise levels. Our experimental evaluation over synthetic data shows that the proposed procedure combined with sample-based parameter estimates attains near-optimal random-$X$ generalization across a wide range of sample...

Jack Timmermans·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We fu...

Yunhai Hu·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition in infinite-horizon reinforcement learning. We introduce the exponential decay rate of the policy-selection error probability as a principled efficiency metric and derive a variational characterization of this rate via large deviations theory for Markov chains, yielding a nested optimization problem. Based on this char...

Mingjie Hu·1 month ago

TechCrunch AI· PRESS

AI coding startup Cognition raises $1B at $25B pre-money valuation

As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says.

Julie Bort·1 month ago

r/LocalLLaMA· COMMUNITY

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Here's my article with **38 quant pairs** thoroughly benchmarked in KLD with **3 different Qwen 3.6 27B configs**: Q5\_K\_S + 64k context, IQ4\_XS + 64k context, IQ4\_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself. All benchmarks were done using my [BeeLlama.cpp](https://github.com/Anbeeld/beellama.cpp) fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6\_0. [https://anbeeld.com/articl...

u/Anbeeld·1 month ago·44 pts / 42 comm

The Verge AI· PRESS

AI tried to bury this politician — now people have actually heard of him

NY-12 congressional candidate Alex Bores speaks during a campaign event. | Bloomberg via Getty Images By the time that the Democratic primary for New York's 12th congressional district wraps up in June, Anthropic and OpenAI will have spent millions on their battle over the political future of AI: who gets to regulate it, or who will be punished for trying to regulate it. But the real winner of their feud may be the guy they're currently fighting over: a once-obscure New York state assemblyman, who they've Streisand-effected into becoming the poster child for AI safety regulation. Ever since l...

Tina Nguyen·1 month ago

r/LocalLLaMA· COMMUNITY

Why are the AI Companies spreading F.U.D. about AI?

A couple of recent videos I have watched : [Billionaires Are Funding 'Anti AI' Content](https://www.youtube.com/watch?v=mzlu4FSXBNw) [AI Manufactured Doubt](https://www.youtube.com/watch?v=2SjgP8o-1LQ) (long but interesting take) **My tin foil hat take** : AI Companies understand that offline llm hosting is becoming more viable for both individuals and companies. They are spreading the "AI is dangerous" message to get government regulators to pass laws to keep the people "safe" from the unbridled power of tokens and weights. They will use their lobbying with the FUD as ammunition to pass ...

u/supracode·1 month ago·43 pts / 52 comm

The Verge AI· PRESS

Robinhood will let your AI agent trade stocks and make (or lose) lots of money

Robinhood is opening its trading platform to AI agents. In an announcement on Wednesday, Robinhood says traders can now create a separate account for an AI agent and add a specific amount of money, allowing the agent to buy and sell stocks across the market. The company pitches the feature as a way for traders to automate investment decisions, such as having an agent monitor specific industries and make trades, or rebalancing an existing portfolio. But it comes with a big warning from Robinhood: Agentic trading involves significant risk, including the possible loss of your entire investment. ...

Emma Roth·1 month ago

TechCrunch AI· PRESS

ElevenLabs’s new music generation model can switch genres mid-track

ElevenLabs' new model will let users regenerate a section of a song without affecting rest of the track

Ivan Mehta·1 month ago

r/singularity· COMMUNITY

Astribot launches the T1, their wheeled humanoid robot with two pairs of grippers that can do a bit of everything

\*this is a capability demo likely teleoperated for marketing

u/Distinct-Question-16·1 month ago·104 pts / 53 comm

r/LocalLLaMA· COMMUNITY

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

Howdy everyone! Quick disclosure: I work on this - it's a project my studio created called the Null Epoch. I wasn't really happy with testing my agents with the usual static benchmarks and I wanted to learn more about how models and agents handle long-horizon planning, resource contention, and adversarial pressure over days or weeks in a more dynamic situation. I also have a particular fondness for the MUDs and text based RPGs I grew up on (really dating myself here), so the whole MMO and the open source SDK/TUI are kind of modeled after that experience. It functions as a persistent stress t...

u/bopcrane·1 month ago·45 pts / 17 comm

r/singularity· COMMUNITY

A research group appears to have made a significant step towards programmable atomically precise manufacturing AKA Drexlerian nanotechnology

Link: arxiv.org

u/Buck-Nasty·1 month ago·200 pts / 20 comm

TechCrunch AI· PRESS

SOND, a sleep tech startup from Bose’s former head of sleep, exits stealth with $7M

SOND, a startup led by Bose’s former head of sleep products, emerged from stealth with $7M in funding for its AI-powered sleep earbuds.

Sarah Perez·1 month ago

← Front Page30 stories

← Newer Older →

The Archive

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More

AI-generated CUDA kernels silently break training and inference [R]

Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

Understanding Generalization and Forgetting in In-Context Continual Learning

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Latent-Conditioned Parameterized Quantum Circuits as Universal Approximators for Distributions over Quantum States

History-aware adaptive reduced-order models via incremental singular value decomposition

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

Optimal ridge regularization revisited

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

AI coding startup Cognition raises $1B at $25B pre-money valuation

KV cache quant benchmarks: q5 &amp; q6 are underrated, q8/q4 is bad, TCQ has a niche

AI tried to bury this politician — now people have actually heard of him

Why are the AI Companies spreading F.U.D. about AI?

Robinhood will let your AI agent trade stocks and make (or lose) lots of money

ElevenLabs’s new music generation model can switch genres mid-track

Astribot launches the T1, their wheeled humanoid robot with two pairs of grippers that can do a bit of everything

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

A research group appears to have made a significant step towards programmable atomically precise manufacturing AKA Drexlerian nanotechnology

SOND, a sleep tech startup from Bose’s former head of sleep, exits stealth with $7M

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche