The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, ...

Jinnuo Liu·23 days ago

The Archive

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

Physics-Guided Policy Optimization with Self-Distillation

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

Set-Preserving Calibration from Conformal P-Values to E-Values

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

Training a Predictive Coding Network on ImageNet using Equilibrium Propagation

AutoTail-BSFGM: Class-Balance-Aware Fine-Tuning for Chinese Scholarly Text Classification

Few-Shot Prediction for Pulsar Noise with Long Short-Term Memory Network

Gemini Spark is the most impressive and terrifying AI experience I’ve had yet

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

Learned Non-Maximum Suppression for 3D Object Detection

ZeroDrift raises $10 million to protect AI models from themselves