The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike ...

Jiayu Wang·19 days ago

The Archive

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Time series Foundation Models based on Physics-Informed Synthetic Histories for Cold-Start Photovoltaic Forecasting

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

2026.23: Power Shifts

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

This is your laptop… on AI

Sycophantic Praise: Evaluating Excessive Praise in Language Models

Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Discovering Multiscale Deep Formulas in Complex Systems via Neural-Guided Lambda Calculus

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

Video-Based Prediction of In-Flight Particle Characteristics in Atmospheric Plasma Spraying

Sparsely gated tiny linear experts

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

The Proxy Benders Decomposition

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients

Automatic, Debiased, and Invariant Counterfactual Generation under General Interventions

The Fitbit Air is a good wearable weighed down by a chatty AI "coach"

Online Pandora's Box for Contextual LLM Cascading

New York lawmakers pass one-year ban on new data centers

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

Unified Geometry-Guided ML-FTLE for Tracking Transient Chaos from Scalar Time Series

RhinoVLA Technical Report

Covariance Shrinkage via Stochastic Interpolation

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests