The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift...

Idris Abdulmumin·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening

Continuous and global detection of large methane emissions is a crucial step for global warming mitigation. Satellite observations, such as from S5P/TROPOMI, combined with plume detection algorithms, can play a key role in this effort. However, not all TROPOMI plume detections that look like methane emission plumes are the result of actual emissions. A significant part of the plume-like features in the data are retrieval artifacts. Such artifacts could be the result of variations in elevation or albedo gradients, high concentrations of aerosols, coastal lines, water bodies, etc. Previous work...

Solomiia Kurchaba·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval co...

Zafar Hussain·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis

Collaborative analysis of decentralized confidential datasets is important, but direct sharing of original datasets is often restricted by privacy and institutional constraints. Data collaboration (DC) analysis transforms each dataset into privacy-preserving intermediate representations via party-specific obfuscation functions and integrates them into common collaboration representations using an anchor dataset. However, many existing DC analysis methods rely on linear transformations for data obfuscation and integration, which may increase reconstruction risk. Although nonlinear dimensionali...

Yamato Suetake·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on t...

Juan Cruz-Benito·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled en...

Yuxin Chen·1 month ago

NVIDIA Dev Blog· INFRA

Run Key Genomics and Protein Folding Workloads Faster with NVIDIA RTX PRO 4500 Blackwell

Precision medicine depends on two fundamental capabilities: understanding disease at the genomic level and identifying treatments at the molecular level. ... Precision medicine depends on two fundamental capabilities: understanding disease at the genomic level and identifying treatments at the molecular level. NVIDIA’s contributions to precision medicine extend far beyond accelerated computing, delivering a full-stack platform that translates hardware and software advancements directly into healthcare outcomes. Sequencing the human genome… Source

Alejandro Chacon·1 month ago

TechCrunch AI· PRESS

This startup is betting India’s gig economy can train the world’s robots

Human Archive, a startup founded by Berkeley and Stanford researchers, is paying gig workers in India to wear camera-equipped caps and sensor devices to collect the real-world physical training data that AI and robotics labs are racing to acquire.

Ivan Mehta·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited and time-varying communication resources. For perception-centric twins, pixel-domain transmission or uniformly protected bitstreams can be mismatched to the semantic state consumed by twin-side applications. This paper proposes TWIST, a closed-loop token synchronization framework for application-aware wireless digital twins. TWIST represents each physical observation as a token and synchronizes this state over a wireless link, rather than optimizing visual r...

Sige Liu·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs ...

Pujun Zheng·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers to manually select presets, plot Bézier points, and configure timing properties. We introduce Generative Animations, a system that transforms natural language prompts into production-ready animations. By chaining Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, our pipeline automatically generates motion paths that respect scene geometry, handle depth-based occlusions, and honor 3D perspective tra...

Mannat Khurana·1 month ago

r/ClaudeAI· COMMUNITY

That is load-bearing.

I know this topic is discussed here a lot but I SWEAR TO FUCKING GOD if I read another "That is real" OR "That is not nothing" OR "That is not X but Y" I am going to have a fucking aneurysm. Yes I have specifically forbidden it from telling me these phrases, yes I have specifically updated the memory and spec to BAN these phrases yet they slip through and I swear sometimes it is so insanely creative in its reasoning for how to get around these constraints but it just kills the immersion(?) so hard when it falls back on these god damn tropes. I use Claude (Max) for absolutely everything, ...

u/Lilbugger826·1 month ago·28 pts / 21 comm

r/ClaudeAI· COMMUNITY

Company gave us all unlimited Claude Code Sonnet 4.6 — and now posts a weekly leaderboard of who burns the most tokens. Any tips to top it?

u/sailing67·1 month ago·56 pts / 77 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization

Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dyna...

Thomas Berkane·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medica...

Ning Wu·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Learning When to Think While Listening in Large Audio-Language Models

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when...

Zhiyuan Song·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence...

Serli Kopar·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible shar...

Haoyu Zheng·1 month ago

Simon Willison· ANALYST

Microsoft Copilot Cowork Exfiltrates Files

Microsoft Copilot Cowork Exfiltrates Files The biggest challenge in designing agentic systems continues to be preventing them from enabling attackers to exfiltrate data. In this case Microsoft Copilot Cowork (yes, that's a real product name ) was allowing agents to send emails to the user's own inbox without approval... but those messages were then displayed in a way that could leak data to an attacker via rendered images: Because these messages can contain external images that trigger network requests to external websites, data can be exfiltrated when a user opens a compromised message sent ...

Simon Willison·1 month ago

r/Anthropic· COMMUNITY

Opus 4.7 Often Assumes a Military Audience

This is especially prevalent in Claude Desktop and less obvious in Claude Code. * Often leads long assets with a BLUF (bottom line up front), specifically military-adjacent terminology. * Has slipped into language where it refers to civilians several times, as though it means someone other than us * References to things like deep-dive, vision-intent framing, force multipliers, and reframes are more subtle because they do cross over but their prevalence is unmistakable. This isn't a complaint, it's more a note that, "hey guys, we see your training material in our outputs" and its interestin...

u/isarmstrong·1 month ago·13 pts / 10 comm

r/ClaudeAI· COMMUNITY

The end. What have I done

It seems to be working so far but I think I should have done this in GitHub

u/TheMeltingSnowman72·1 month ago·42 pts / 13 comm

r/LocalLLaMA· COMMUNITY

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

# MOSS-TTS-v1.5 **MOSS-TTS-v1.5** is continued from [MOSS-TTS 1.0](https://huggingface.co/OpenMOSS-Team/MOSS-TTS). It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, decoding hyperparameters, and evaluation tables, please refer to the [MOSS-TTS 1.0 README](https://huggingface.co/OpenMOSS-Team/MOSS-TTS). Compared with MOSS-TTS 1.0, v1.5 focuses on the following improvements: * **St...

u/pmttyji·1 month ago·40 pts / 12 comm

Simon Willison· ANALYST

Quoting Paul Graham

A lot of the emails I get from founders are now written in a hard-hitting journalistic style. I know they're written by AI, because no founder ever wrote this way before. And once you realize something is written by AI, it's hard not to ignore it. I have never knowingly finished reading an email signed by a human but written by AI. It feels like being lied to, and who would stand for that? [ ... ] It makes me think less of the author. It means they can't write well unaided (or feel they can't), and that they're trying to trick me. It's not impressive to use AI to write stuff for you; any teen...

Simon Willison·1 month ago

TechCrunch AI· PRESS

Universal Music Group and TikTok renew agreement to combat unauthorized AI music

For years, UMG has pushed platforms, streaming services, and AI companies to implement stricter content moderation policies

Lauren Forristal·1 month ago

MIT Tech Review· PRESS

Rethinking organizational design in the age of agentic AI

Amid rapidly growing adoption of enterprise-level AI agents, there’s a disconnect emerging between ambition and execution. Although 85% of organizations say they want to be agentic within the next three years, 76% say their current operations and infrastructure can’t support that change. They cite a lack of readiness across people, processes, and workflows. The sticky…

MIT Technology Review Insights·1 month ago

The Verge AI· PRESS

Sundar Pichai on AI, the future of search, and what’s happening to the web

Today, I’m talking with Google and Alphabet CEO Sundar Pichai, in a conversation we recorded just after the Google I/O developer conference. This is the fifth year Sundar and I have sat down after I/O, and it’s become one of my favorite Decoder traditions. There’s always a lot of news at I/O, and this year was no exception — Google has powerful new Gemini models, it’s putting AI agents in everything, and it’s making huge changes to Search on both the web and YouTube that will once again reshape the information ecosystem. That’s a lot to talk about, and Sundar and I got into all of it. But I a...

Nilay Patel·1 month ago

r/ClaudeAI· COMMUNITY

AI quietly turned HTML into a real alternative to PowerPoint and Word for client-facing docs. The blockers that made it impractical a year ago are falling one by one.

A year ago, generating a polished document as HTML instead of a PPT or a Word file was a fun idea with too many practical problems. Lately I've noticed every one of those blockers either gone or close to gone, and I've quietly stopped reaching for Office on a bunch of deliverables. Curious if others are seeing the same. **The blockers, and where they stand now:** **Design**. The old objection was "AI HTML looks generic and amateur." That's basically solved if you give the model a design skill or a style guideline once. You get consistent, on-brand output that looks more like a designed page...

u/Hairy-Fisherman8008·1 month ago·20 pts / 29 comm

r/LocalLLaMA· COMMUNITY

Okay 27B made me a believer

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle making some quick games in it to showcase functionality (save games, console API handling for stat tracking and heartbeat management, meta data for the game, etc) I gave it 3 files, explaining how the API works, the gamepad controls, and a typescript shader for it to apply. Then I just game it a very simple prompt "make a breakout game for this console, in the working directory are reference files on...

u/Forward_Jackfruit813·1 month ago·48 pts / 57 comm

r/OpenAI· COMMUNITY

Jony Ive designed a new Ferrari. Or at least tried to. Give me one reason why Ferrari is paying Ive that much when AI comes up with better designs.

u/encony·1 month ago·112 pts / 73 comm

r/singularity· COMMUNITY

EngineAI shared a view of its Shenzhen Intelligent Manufacturing Base, claiming an output of one humanoid robot every 15 minutes - that's 35,000 humanoid robots per year, the highest production rate publicly claimed by a Chinese humanoid robotics company

This is the highest output rate announced. besides this EngineAI has another aditional Zhengzhou 10K/year line planned. Its more than from what Leju Robotics, AgiBot, Unitree Robotics, and others have claimed for their humanoid robots per year. So its likely, China ready to output 100K humanoids robots per year.

u/Distinct-Question-16·1 month ago·117 pts / 56 comm

← Front Page30 stories

← Newer Older →