The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays ...

Kunfeng Chen·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organ...

Xucong Wang·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven ...

Yongmin Kim·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them...

Sara Candussio·12 days ago

Anthropic· FRONTIER

Introducing Claude Corps

We’re launching Claude Corps, a national fellowship program for people early in their careers who are passionate about extending the benefits of AI to communities across America.

Anthropic·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direc...

Abubakar Hamisu Kamagata·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization ...

Mariya Pavlova·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a ...

Oliver Aleksander Larsen·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Simultaneous Latent Budget Trees for Stratified Classification

In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneou...

Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attenti...

Guozhen Zhang·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work,...

Wei Li·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behav...

Samuel Erickson·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world data...

Beinan Xu·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel sema...

Pratyush Chaudhari·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable...

Aitor Sánchez-Ferrera·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas...

Kirato Yoshihara·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-de...

Rawan Hesham·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement lear...

Rongxin Yang·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the govern...

Paolo Muratore·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify ...

Minlin Zeng·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affil...

Anna-Maria Velentza·12 days ago

The Verge AI· PRESS

Anthropic apologizes for invisible Claude Fable guardrails

Anthropic has apologized for stealthily throttling its new AI model, Claude Fable 5, with hidden guardrails that undermine both researchers and rivals using it to develop competing systems. The company says it is reversing course and will be more transparent about when the restrictions kick in, even if that means Fable refuses more queries. Fable is the first widely available model in Anthropic's Mythos class of AI systems, a group the company has spent months warning are too dangerous for public release. Anthropic says it has addressed some of those risks by launching Fable with safeguards t...

Robert Hart·12 days ago

MIT Tech Review· PRESS

Google DeepMind is worried about what happens when millions of agents start to interact

Google DeepMind is funding research into the potential dangers of millions of different AI agents interacting with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can carry out tasks without human oversight and follow instructions given to them by other agents creates…

Will Douglas Heaven·12 days ago

Stratechery· ANALYST

An Interview with Ben Bajarin About Apple, AI, and Compute

An interview with Ben Bajarin about WWDC and the status of the AI compute industry.

Ben Thompson·12 days ago

The Verge AI· PRESS

Deezer launches an AI music detector for other streaming services

Deezer will now scan your playlists on other streaming platforms to detect AI-generated music. Deezer was the first of the big streaming services to start labeling AI-generated music. It even offered its tech to other platforms, but it doesn't seem like it had many buyers. Qobuz launched its own detection tech, while Apple and Spotify have opted for a voluntary tagging system. "No other company has followed our lead yet, so we decided to make it possible for everyone to check if their playlists include synthetic music, no matter which streaming platform they use," Deezer CEO Alexis Lanternier...

Terrence O’Brien·12 days ago

Simon Willison· ANALYST

asyncinject 0.7

Release: asyncinject 0.7 I built this utility library to support an asyncio dependency injection pattern a few years ago. I was using it with Datasette and Claude Fable 5 spotted some bugs in the dependency which it then fixed for me. It's a very proactive model! Tags: async , projects , python , claude-mythos

Simon Willison·12 days ago

TechCrunch AI· PRESS

Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing

The decision comes as India emerges as the world’s largest GCC market.

Jagmeet Singh·12 days ago

TechCrunch AI· PRESS

Anthropic’s Dario Amodei has just one direct report

If you doubted his genius, doubt no more.

Connie Loizos·12 days ago

Simon Willison· ANALYST

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude Big scoop for Maxwell Zeff at Wired: “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.” There's been a huge outcry about Anthropic's policy, tucked away in their system card , that Claude Fable/Mythos would identify "requests targeting frontier LLM development" and "limit effectiveness" without notifying the user. It's very good news that they're droppin...

Simon Willison·12 days ago

Latent Space· ANALYST

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo

a quiet day lets us reflect on a great essay

Latent Space·12 days ago

← Front Page30 stories

← Newer Older →

The Archive

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

Introducing Claude Corps

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

Simultaneous Latent Budget Trees for Stratified Classification

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

Anthropic apologizes for invisible Claude Fable guardrails

Google DeepMind is worried about what happens when millions of agents start to interact

An Interview with Ben Bajarin About Apple, AI, and Compute

Deezer launches an AI music detector for other streaming services

asyncinject 0.7

Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing

Anthropic’s Dario Amodei has just one direct report

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo