The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Link lets users connect cards, banks, and subscriptions, then authorize AI agents to spend securely via approval flows.

Sarah Perez·7 days ago

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

STEF enables schema-agnostic evaluation of text-to-SQL agents in production without ground-truth queries, addressing real-world deployment gaps.

Taslim Jamal Arif·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Study finds persona prompting in multimodal LLMs produces stable but limited behavioral variation in urban sentiment judgment tasks.

Neemias B da Silva·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents

CARE methodology systematizes LLM agent engineering in scientific domains via three-party collaboration between SMEs, developers, and helper agents.

Rahul Ramachandran·7 days ago

NVIDIA Dev Blog· INFRA

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and... NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and matrix multiply-accumulate—rather than manually coordinating threads, warps, and shared memory. cuTile.jl brings the same tile-based approach to the dynamic programming language Julia. Users can write custom GPU kernels without dropping… Source

Zhengyi Zhang·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Pattern Language for Resilient Visual Agents

Architectural pattern language for vision language agents balances latency/non-determinism of VLMs against real-time enterprise control requirements.

Habtom Kahsay Gidey·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

Comparative evaluation of three LLM agent paradigms (domain-specific, computer-use, coding) on scientific visualization tasks across 15 benchmarks.

Jackson Vonderhorst·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

D3-Gym dataset: 565 verifiable tasks from real scientific repositories for evaluating LLM agents on data-driven discovery.

Hanane Nour Moussa·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation

Language model agents design mechanical linkages via symbolic lifting: LLMs explore topologies while numerical optimizers tune parameters, validated on six motion targets.

João Pedro Gandarela·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Survey of RL+GUI agents for long-horizon automation in visual interfaces; proposes framework toward autonomous digital inhabitants with safe exploration.

Junan Hu·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Schema-grounded external memory for agents outperforms text-retrieval approaches by enabling exact fact tracking, state updates, and structured queries.

Alex Petrov·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Graph World Models: Concepts, Taxonomy, and Future Directions

Survey formalizing graph-based world models for agents, decomposing environments into entity nodes and edges to improve robustness vs. flat-tensor approaches.

Jiawei Liu·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs

On-demand persona-based agent generation framework enabling dynamic multi-agent workflow customization without hard-coded architectures.

Giuseppe Arbore·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Modeling Clinical Concern Trajectories in Language Model Agents

Lightweight clinical agent architecture using integrated state dynamics to surface pre-escalation risk signals in LLM clinical deployment.

Sukesh Subaharan·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

KellyBench: long-horizon sequential decision benchmark using 2023-24 Premier League sports betting; evaluates agents on non-stationary open-ended optimization.

Thomas Grady·7 days ago

r/ClaudeAI· COMMUNITY

[Open Source] We built a local code search MCP for Claude Code that uses ~98% fewer tokens than grep+read

Working on large codebases with Claude Code, we kept running into the same issue: when Claude looks for relevant code, it falls back to grep, reading full files, or launching multiple subagents. This burns through tokens, and often misses the relevant code. There are some existing solutions (that we also benchmarked against), but they all had issues (too slow, needs API keys, quality not good enough, etc). We built [Semble](https://github.com/MinishLab/semble) to fix this. It's a local MCP server that gives Claude Code high quality code search: instead of reading files to find what's relevan...

u/Pringled101·7 days ago·38 pts / 5 comm

r/ClaudeAI· COMMUNITY

How to be better than 99% of Claude Code users while doing less, imo:

tl;dr: your skill in AI is a measure of your **quality** and **scale**. Use **success criteria** and **subagents** intentionally to get excellent results. Use skills and .md docs when you find repeating patterns in your daily work, not before. **---** **Quality** comes from telling the agent what outcome you want, and the **success criteria** that you will use to measure a “good” outcome. This helps avoid Claude's tendency to rush completion. Note this is specifically *not* telling it what to *do*, but instead what to *achieve*. If you come from the old world, you might remember terms like ...

u/brionicle·7 days ago·29 pts / 7 comm

r/ClaudeAI· COMMUNITY

Absolutely blown away by the utility of the Claude Word add-in

I can have multiple, dense legal documents on my screen, each 40, 60, or 100+ pages each with the Claude Word add-in agents syncing, pushing and pulling information between them, pinging each other, and providing helpful context so that I can draft all three or four in parallel or ensure that an entire package is consistent. I can have a lengthy spreadsheet workbook open containing 10 worksheets and the information is analyzed and pulled in by the agents when needed. I am absolutely blown away at how well this is implemented and the improvement in quality, consistency and efficiency. It ...

u/VitruvianVan·8 days ago·25 pts / 6 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym framework for systematic development of file/tool-based agents, includes 13.5K synthesized tasks for agent training and evaluation.

Fei Bai·8 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually cura...

Bochao Liu·8 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

FutureWorld: live environment for training LLM-based agents on real-world outcome prediction and continual learning.

Zhixin Han·8 days ago

Mistral AI· FRONTIER

Remote agents in Vibe. Powered by Mistral Medium 3.5.

Mistral AI launches Mistral Medium 3.5 with remote coding agents in Vibe and Work mode in Le Chat for complex tasks.

Mistral AI·8 days ago

TechCrunch AI· PRESS

Coby Adcock’s Scout AI raises $100 million to train its models for war. We visited its bootcamp

We visited Scout AI's training ground where it's working on AI agents that give individual soldiers control of fleets of autonomous vehicles.

Tim Fernholz·8 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

DV-World benchmark evaluates data visualization agents across 260 real-world tasks spanning spreadsheet manipulation, chart creation, and dashboard repair.

Jinxiang Meng·9 days ago

Stratechery· ANALYST

An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman About Bedrock Managed Agents

Sam Altman and Matt Garman discuss OpenAI-AWS partnership on Bedrock Managed Agents; Stratechery covers OpenAI-Microsoft deal implications.

Ben Thompson·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

ADEMA architecture enables long-horizon LLM-agent tasks via explicit knowledge-state bookkeeping, dual-evaluator governance, and checkpoint-resumable persistence.

Zhou Hanlin·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Agora-Opt combines decentralized multi-agent debate with memory-augmented LLMs for automated optimization modeling from natural-language requirements.

Jianghao Lin·9 days ago

Hugging Face· INFRA

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Hugging Face·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Toward Scalable Terminal Task Synthesis via Skill Graphs

SkillSynth automates terminal task synthesis via skill graphs to improve trajectory diversity for training command-line execution agents.

Zhiyuan Fan·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Salesforce production inference architecture for compound AI systems supporting heterogeneous model composition, agents, and retrieval at scale.

Srikanta Prasad S·9 days ago

← Front Page30 matches

← Newer Older →

The Archive

Stripe introduces Link, a digital wallet that autonomous AI agents can use, too

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

A Pattern Language for Resilient Visual Agents

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Graph World Models: Concepts, Taxonomy, and Future Directions

Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs

Modeling Clinical Concern Trajectories in Language Model Agents

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

[Open Source] We built a local code search MCP for Claude Code that uses ~98% fewer tokens than grep+read

How to be better than 99% of Claude Code users while doing less, imo:

Absolutely blown away by the utility of the Claude Word add-in

ClawGym: A Scalable Framework for Building Effective Claw Agents

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

Remote agents in Vibe. Powered by Mistral Medium 3.5.

Coby Adcock’s Scout AI raises $100 million to train its models for war. We visited its bootcamp

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman About Bedrock Managed Agents

ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Toward Scalable Terminal Task Synthesis via Skill Graphs

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study