Vol. I · No. 25THU, MAY 14, 2026
Archive

The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

If Bible characters had Instagram

Speculative social media parody post unrelated to AI technology, models, research, or industry developments.

··

Alignment has a Fantasia Problem

Paper identifies 'Fantasia interactions' where users engage AI systems with incomplete goals, proposing realignment of alignment research beyond prompt-intent matching.

·

A group of users leaked Anthropic's AI model Mythos by reportedly guessing where it was located

The AI model that Anthropic billed as too dangerous to release has reportedly been accessed by an unauthorized third party, and the incident raises concerns about the future of cybersecurity. The Mythos model was reportedly accessed by a handful of users in a private Discord chat on the day it was announced publicly, Bloomberg reported. Earlier this month, the group was able to access the program in part because one of the members of the group is a third party contractor for Anthropic, according to Bloomberg. Using this access, the group was able to guess where the model was located based ...

··

Unusable after token usage nerfs

Moving somewhere else next billing cycle. Two hours of coding on Max and I'm capped. Whatever you changed, change it back or say something. Silent nerfs to a paid product are a bad look.

··

Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our...

·

Misinformation Span Detection in Videos via Audio Transcripts

Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research eff...

·

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions ground...

·

PrismaDV: Automated Task-Aware Data Unit Test Generation

Data is a central resource for modern enterprises, and data validation is essential for ensuring the reliability of downstream applications. However, existing automated data unit testing frameworks are largely task-agnostic: they validate datasets without considering the semantics and requirements of the code that consumes the data. We present PrismaDV, a compound AI system that analyzes downstream task code together with dataset profiles to identify data access patterns, infer implicit data assumptions, and generate task-aware executable data unit tests. To further adapt the data unit tests ...

·

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrieve these skills at inference time to guide future reasoning. Unlike the prevailing \emph{reasoning from scratch} paradigm, our approach first recalls relevant skills for each query, helping the model avoid redundant detours and focus on effective solution paths. We evaluate our method on coding and mathematical reason...

·
30 stories