Stripe introduces Link, a digital wallet that autonomous AI agents can use, too
Link lets users connect cards, banks, and subscriptions, then authorize AI agents to spend securely via approval flows.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Link lets users connect cards, banks, and subscriptions, then authorize AI agents to spend securely via approval flows.
STEF enables schema-agnostic evaluation of text-to-SQL agents in production without ground-truth queries, addressing real-world deployment gaps.
Study finds persona prompting in multimodal LLMs produces stable but limited behavioral variation in urban sentiment judgment tasks.
CARE methodology systematizes LLM agent engineering in scientific domains via three-party collaboration between SMEs, developers, and helper agents.
NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and... NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and matrix multiply-accumulate—rather than manually coordinating threads, warps, and shared memory. cuTile.jl brings the same tile-based approach to the dynamic programming language Julia. Users can write custom GPU kernels without dropping… Source
Architectural pattern language for vision language agents balances latency/non-determinism of VLMs against real-time enterprise control requirements.
Comparative evaluation of three LLM agent paradigms (domain-specific, computer-use, coding) on scientific visualization tasks across 15 benchmarks.
D3-Gym dataset: 565 verifiable tasks from real scientific repositories for evaluating LLM agents on data-driven discovery.
Language model agents design mechanical linkages via symbolic lifting: LLMs explore topologies while numerical optimizers tune parameters, validated on six motion targets.
Survey of RL+GUI agents for long-horizon automation in visual interfaces; proposes framework toward autonomous digital inhabitants with safe exploration.
Schema-grounded external memory for agents outperforms text-retrieval approaches by enabling exact fact tracking, state updates, and structured queries.
Survey formalizing graph-based world models for agents, decomposing environments into entity nodes and edges to improve robustness vs. flat-tensor approaches.
On-demand persona-based agent generation framework enabling dynamic multi-agent workflow customization without hard-coded architectures.
Lightweight clinical agent architecture using integrated state dynamics to surface pre-escalation risk signals in LLM clinical deployment.
KellyBench: long-horizon sequential decision benchmark using 2023-24 Premier League sports betting; evaluates agents on non-stationary open-ended optimization.
Working on large codebases with Claude Code, we kept running into the same issue: when Claude looks for relevant code, it falls back to grep, reading full files, or launching multiple subagents. This burns through tokens, and often misses the relevant code. There are some existing solutions (that we also benchmarked against), but they all had issues (too slow, needs API keys, quality not good enough, etc). We built [Semble](https://github.com/MinishLab/semble) to fix this. It's a local MCP server that gives Claude Code high quality code search: instead of reading files to find what's relevan...
tl;dr: your skill in AI is a measure of your **quality** and **scale**. Use **success criteria** and **subagents** intentionally to get excellent results. Use skills and .md docs when you find repeating patterns in your daily work, not before. **---** **Quality** comes from telling the agent what outcome you want, and the **success criteria** that you will use to measure a “good” outcome. This helps avoid Claude's tendency to rush completion. Note this is specifically *not* telling it what to *do*, but instead what to *achieve*. If you come from the old world, you might remember terms like ...
I can have multiple, dense legal documents on my screen, each 40, 60, or 100+ pages each with the Claude Word add-in agents syncing, pushing and pulling information between them, pinging each other, and providing helpful context so that I can draft all three or four in parallel or ensure that an entire package is consistent. I can have a lengthy spreadsheet workbook open containing 10 worksheets and the information is analyzed and pulled in by the agents when needed. I am absolutely blown away at how well this is implemented and the improvement in quality, consistency and efficiency. It ...
ClawGym framework for systematic development of file/tool-based agents, includes 13.5K synthesized tasks for agent training and evaluation.
Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually cura...
FutureWorld: live environment for training LLM-based agents on real-world outcome prediction and continual learning.
Mistral AI launches Mistral Medium 3.5 with remote coding agents in Vibe and Work mode in Le Chat for complex tasks.
We visited Scout AI's training ground where it's working on AI agents that give individual soldiers control of fleets of autonomous vehicles.
DV-World benchmark evaluates data visualization agents across 260 real-world tasks spanning spreadsheet manipulation, chart creation, and dashboard repair.
Sam Altman and Matt Garman discuss OpenAI-AWS partnership on Bedrock Managed Agents; Stratechery covers OpenAI-Microsoft deal implications.
ADEMA architecture enables long-horizon LLM-agent tasks via explicit knowledge-state bookkeeping, dual-evaluator governance, and checkpoint-resumable persistence.
Agora-Opt combines decentralized multi-agent debate with memory-augmented LLMs for automated optimization modeling from natural-language requirements.
SkillSynth automates terminal task synthesis via skill graphs to improve trajectory diversity for training command-line execution agents.
Salesforce production inference architecture for compound AI systems supporting heterogeneous model composition, agents, and retrieval at scale.