The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation ...

Yiming Liao·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence,...

Yiqi Wang·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty a...

Xiaochen Zhu·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are ine...

Wanqi Yang·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt th...

Chensong Huang·22 days ago

TechCrunch AI· PRESS

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

The startup's own stack for Africa and Middle East is now handling more than 17,000 calls per day.

Ivan Mehta·22 days ago

TechCrunch AI· PRESS

Publishers will be able to opt out of AI Search, thanks to new regulation

U.K. regulators are requiring Google offer a tool allowing website publishers to opt-out of generative AI search features. The option will be tested in the UK then rolled out globally.

Sarah Perez·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SAID: Accelerating Diffusion-Based Language Models via Scaffold-Aware Iterative Decoding

Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fe...

Na Li·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Be Fair! Can Machine Learning Engineering Agents Adhere to Fairness Constraints?

Machine learning engineering (MLE) agents promise to automate end-to-end ML pipeline development from raw data and natural language instructions, potentially making ML accessible to non-technical domain experts. However, in sensitive and regulated domains, this abstraction creates a responsibility gap: end-users may lack visibility into design choices that affect correctness, robustness, fairness, and regulatory compliance. We argue that existing benchmarks are insufficient to assess whether MLE agents can be safely applied in such settings. We propose desiderata for a responsibility-centered...

Anna Richter·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Pl...

Kaustav Kundu·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development thro...

Sanderson Oliveira de Macedo·22 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs

Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provid...

Xinrui Song·22 days ago

The Verge AI· PRESS

Microsoft and OpenAI broke up — now they’re ready to fight

At Microsoft's annual Build conference on Tuesday, the company announced a slew of new or expanded AI initiatives, including a super app, in-house reasoning models, a cybersecurity tool, and OpenClaw-esque AI agents. All this news added up to a clear message: Microsoft is positioned to be one of the biggest players in AI, and it's finally acting like it. For years, Microsoft's AI business leaned hard on its early and exclusive partnership with OpenAI. But the drama-filled marriage slowly devolved into a situationship, and the pair effectively separated in late April (though Microsoft is still...

Hayden Field·22 days ago

TechCrunch AI· PRESS

Meta’s AI agent for WhatsApp Business is now available globally

WhatsApp will charge businesses for using its AI agent based on token usage

Ivan Mehta·22 days ago

Ars Technica AI· PRESS

Inside Meta's attempts to play catch-up with AI

Doubts linger over whether Meta can close the gap with rivals.

Hannah Murphy, Financial Times ·22 days ago

OpenAI· FRONTIER

Introducing new capabilities to GPT-Rosalind

GPT-Rosalind advances life sciences research with enhanced biological reasoning, medicinal chemistry expertise, genomics analysis, and experimental workflow capabilities.

OpenAI·22 days ago

TechCrunch AI· PRESS

Coralogix raises $200M on bet that someone needs to watch the AI agents

The Series F round values Coralogix at $1.6 billion and comes less than a year after its previous raise.

Jagmeet Singh·22 days ago

Anthropic· FRONTIER

Introducing the Services Track and Partner Hub of the Claude Partner Network

Anthropic·22 days ago

Google AI (Gemma)· FRONTIER

5 ways Google Search can level up your thrift and vintage shopping

Uncover second-hand scores with AI tools in Google Search and Shopping.

{"$":{"xmlns:author":"http://www.w3.org/2005/Atom"},"name":["Megan Stoner"],"title":["Keyword Contributor"],"department":[""],"company":[""]}·22 days ago

Hugging Face· INFRA

Direct Preference Optimization Beyond Chatbots

Hugging Face·22 days ago

Simon Willison· ANALYST

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs I wrote the other day about Uber blowing its 2026 AI budget in four months, and how that wasn't particularly surprising given they would have set that budget in 2025, before anyone could have predicted how popular token-burning coding agents were about to become. Natalie Lung for Bloomberg: The rideshare giant is limiting all employees to $1,500 in monthly token spending per AI coding tool, an Uber spokesperson said in response to a Bloomberg News inquiry. That means spending on one tool doesn’t have a bearing on the budget for anot...

Simon Willison·22 days ago

OpenAI· FRONTIER

How Wasmer used Codex to build a Node.js runtime for the edge

See how Wasmer used Codex with GPT-5.5 to build a Node.js runtime for the edge, accelerating development 10x to 20x and shipping in weeks instead of months.

OpenAI·22 days ago

Anthropic· FRONTIER

What we learned mapping a year’s worth of AI-enabled cyber threats

As AI transforms the nature of and methods behind cyberattacks, how well do the techniques and frameworks used by the security community hold up? In a new report, we seek to answer that question.

Anthropic·22 days ago

Stratechery· ANALYST

The Nvidia AI PC, Project Solara, Microsoft AI

The Nvidia AI PC feels like a relic of another AI era; Microsoft's vision for devices at Build was much more compelling.

Ben Thompson·22 days ago

OpenAI· FRONTIER

OpenAI public policy agenda

OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure AI benefits society.

OpenAI·22 days ago

OpenAI· FRONTIER

A blueprint for democratic governance of frontier AI

OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security.

OpenAI·22 days ago

The Verge AI· PRESS

AI has a water problem. Google thinks it has a fix

In the face of widespread backlash to the AI data center buildout throughout the US, Google is touting its efforts to minimize the environmental impact by actually increasing water for local communities. The company laid out five commitments around water use in a new blog post published Wednesday, including a goal to replenish more water than it uses at its data centers by 2030. Google also said it will invest in local water infrastructure, identify alternative water sources to power its facilities, and be transparent about its water use overall. "We're just one of dozens of players in the sp...

Lauren Feiner·22 days ago

The Verge AI· PRESS

Google must let publishers opt out of AI Search features, rules UK

Online publishers are getting more control over whether their websites appear in Google's AI Search features, thanks to a UK regulatory ruling. The new conduct rule imposed by the Competition and Markets Authority (CMA) requires Google to let website owners keep their content out of features like AI Overviews, and prevent it from being used for the "fine-tuning" of Google's AI models. "In a world first, publishers will now have effective tools to prevent their content being used to power AI features in search, such as AI Overviews," the CMA announced. "This will put publishers, like news orga...

Jess Weatherbed·22 days ago

Latent Space· ANALYST

[AINews] Microsoft Build: MAI-Thinking-1 and MAI Family models

Microsoft Build recap, and new MAI model technical details

Latent Space·23 days ago

Cohere· FRONTIER

Why more businesses choose private deployments of AI

Private deployment offers more peace of mind from data security risks. Learn how to tackle the complexities to launch successfully.

Cohere·23 days ago

← Front Page30 stories

← Newer Older →

The Archive

New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

Publishers will be able to opt out of AI Search, thanks to new regulation

SAID: Accelerating Diffusion-Based Language Models via Scaffold-Aware Iterative Decoding

Be Fair! Can Machine Learning Engineering Agents Adhere to Fairness Constraints?

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs

Microsoft and OpenAI broke up — now they’re ready to fight

Meta’s AI agent for WhatsApp Business is now available globally

Inside Meta's attempts to play catch-up with AI

Introducing new capabilities to GPT-Rosalind

Coralogix raises $200M on bet that someone needs to watch the AI agents

Introducing the Services Track and Partner Hub of the Claude Partner Network

5 ways Google Search can level up your thrift and vintage shopping

Direct Preference Optimization Beyond Chatbots

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs

How Wasmer used Codex to build a Node.js runtime for the edge

What we learned mapping a year’s worth of AI-enabled cyber threats

The Nvidia AI PC, Project Solara, Microsoft AI

OpenAI public policy agenda

A blueprint for democratic governance of frontier AI

AI has a water problem. Google thinks it has a fix

Google must let publishers opt out of AI Search features, rules UK

[AINews] Microsoft Build: MAI-Thinking-1 and MAI Family models

Why more businesses choose private deployments of AI