The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions

Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and reveal which cultural entities models treat as most salient. We analyze how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into Bengali, Hindi, Punjabi (India), Urdu, Sindhi (Pakistan), Italian, and Sicilian (Italy), a language set spanning the full resource spectrum, from high-resource Italian and Hindi to under-studied Sindhi...

Parisa Suchdev·12 days ago

The Verge AI· PRESS

Apple’s best AI idea looks a lot like vibe coding

Most of Apple's current AI ideas are roughly the same as everyone else's AI ideas. A chatbot you can ask questions; quick ways to create or summarize text; bizarre, borderline creepy image-generation tools. The company spent most of its WWDC keynote playing catch-up with the state of the AI art, announcing Siri features you can already find on Android phones and in the Claude and ChatGPT apps. The pitch, in so many cases, is just "this thing you know, but on your iPhone now." But a few minutes after I downloaded the first developer beta of iPadOS 26 (I didn't want to risk it on my Mac or my i...

David Pierce·12 days ago

The Verge AI· PRESS

Microsoft’s AI chief says superintelligence is near, but won’t take your job

Today I’m talking with Mustafa Suleyman, the CEO of Microsoft AI. And I’m actually going to keep today’s intro short — I’m working from my wife’s family farm this week, as you’ll see in the video, but also this is a real burner of an episode. We covered everything from Mustafa’s approach to training new models to his criticisms of Anthropic talking about Claude as though it is conscious. Of course, we also talked about Microsoft’s relationship with OpenAI, how Mustafa is thinking about all the negative polling and political pushback around AI right now, and whether any of the consumer product...

Nilay Patel·13 days ago

Simon Willison· ANALYST

datasette-agent-edit 0.1a0

Release: datasette-agent-edit 0.1a0 I'm planning several plugins for Datasette Agent which can make edits to existing pieces of text - things like collaborative Markdown editing, updating large SQL queries, and editing SVG files. Agentic editing of text is a little tricky to get right. My favorite published design for this is for the Claude text editor , which implements the following tools: view - view sections of a file, with line numbers added to every line. str_replace - find an exact old_str and replace it with new_str - fail if the original string is not unique insert - insert the speci...

Simon Willison·14 days ago

Hugging Face· INFRA

Her · हेर — a detective for your Claude Code sessions

Hugging Face·14 days ago

Latent Space· ANALYST

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

We talk with the VendingBench authors on evaling Claudes from Haiku to Mythos, and how they build leading, and lasting, frontier evals from scratch.

Latent Space·17 days ago

TechCrunch AI· PRESS

Lovable signs multi-year deal with Google Cloud to up usage 5x, source says

Lovable and Google signed an expanded multi-year deal athat involves a 5x expansion of Lovable's footprint on Google Cloud, and expanded access to Anthropic Claude.

Julie Bort·18 days ago

Anthropic· FRONTIER

Introducing the Services Track and Partner Hub of the Claude Partner Network

Anthropic·18 days ago

Simon Willison· ANALYST

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs I wrote the other day about Uber blowing its 2026 AI budget in four months, and how that wasn't particularly surprising given they would have set that budget in 2025, before anyone could have predicted how popular token-burning coding agents were about to become. Natalie Lung for Bloomberg: The rideshare giant is limiting all employees to $1,500 in monthly token spending per AI coding tool, an Uber spokesperson said in response to a Bloomberg News inquiry. That means spending on one tool doesn’t have a bearing on the budget for anot...

Simon Willison·18 days ago

TechCrunch AI· PRESS

Anthropic scales Claude Mythos to critical infrastructure in 15+ countries

Anthropic is expanding Project Glasswing, its security vulnerability program, and access to Mythos to 150 organizations across 15 countries — targeting critical infrastructure in power, water, healthcare, and communications where a cyberattack could affect 100 million people.

Rebecca Bellan·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-...

Qi Han Wong·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

What Makes Interaction Trajectories Effective for Training Terminal Agents?

Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantl...

Sidi Yang·19 days ago

Simon Willison· ANALYST

Pasted File Editor

Tool: Pasted File Editor I really like how you can paste a large volume of text into claude.ai (or the Claude desktop/mobile apps) and it will detect it as a large paste and turn it into a file attachment instead. I decided to have Codex desktop build me a version of that as a prototype. You can also open files directly - including images which will be shown as thumbnails - or drag files onto the texture. Tags: javascript , tools , ai-assisted-programming , claude , codex

Simon Willison·20 days ago

Simon Willison· ANALYST

The solution might be cancelling my AI subscription

The solution might be cancelling my AI subscription I find this post by David Wilson very relatable. David lists 16+ projects he's spun up with AI tooling, and concludes: I didn't mean to build most of these things. Usually the Claude session started with something like " write a quick script for X ", and one hour later the result is not a quick script for X , nor in the usual case is my problem solved, whatever the original itch happened to be. On that last point, this technology is horrific for attention. It's a thermonuclear ADHD amplifier and I have seen the same effect in every single on...

Simon Willison·21 days ago

Simon Willison· ANALYST

How we contain Claude across products

How we contain Claude across products A complaint I often have about sandboxing products is that they are rarely thoroughly documented , and in the absence of detailed documentation it's hard to know how much I can trust them. Anthropic just published a fantastic overview of how their various sandbox techniques work across Claude.ai , Claude Code, and Cowork. We constrain where and how an agent can act with process sandboxes, VMs, filesystem boundaries, and egress controls. The goal is to set a hard boundary on what an agent can reach. For example, if credentials never enter the sandbox, they...

Simon Willison·22 days ago

Simon Willison· ANALYST

Running Python ASGI apps in the browser via Pyodide + a service worker

Research: Running Python ASGI apps in the browser via Pyodide + a service worker Datasette Lite is my version of Datasette that runs entirely in the browser using Pyodide in WebAssembly. When I first built it four years ago I used Web Workers and code that intercepts navigation operations and fetches the generated HTML by running the Python app. This worked, but had the disadvantage that any JavaScript in <script> tags would not be executed - breaking some Datasette functionality and a whole lot of Datasette plugins. This morning I set Claude Opus 4.8 the task (in Claude Code for web) of figu...

Simon Willison·22 days ago

Simon Willison· ANALYST

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic shipped Claude Opus 4.8 today. My favourite thing about it is this note in the release announcement: Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model! Honesty seems to be a theme. Here's my other favorite note from that announcement: One of the most prominent improvements in Opus 4...

Simon Willison·24 days ago

Simon Willison· ANALYST

llm-anthropic 0.25.1

Release: llm-anthropic 0.25.1 New model: Claude Opus 4.8 ( claude-opus-4.8 ). New -o fast 1 option for fast mode , for organizations with that feature enabled on their account. Default max_tokens for each model now defaults to that model's maximum output rather than 8,192. #72 See also my notes on Opus 4.8 - I used this new release of llm-anthropic to generate the pelicans.

Simon Willison·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as ...

Nhat-Minh Nguyen·24 days ago

r/ClaudeAI· COMMUNITY

Introducing dynamic workflows in Claude Code

Today we're introducing dynamic workflows in Claude Code. Claude now writes its own orchestration scripts, fans work out across tens to hundreds of parallel subagents in a single session, and verifies its own results before anything reaches you. Work you'd normally plan in quarters can finish in days. Built for the tasks a single pass can't handle: codebase-wide bug hunts, security and optimization audits, large migrations and language ports, and high-stakes work where you want adversarial agents trying to break the answer before you see it. Progress is checkpointed, so long runs survive int...

u/ClaudeOfficial·24 days ago·24 pts / 9 comm

Anthropic· FRONTIER

Introducing Claude Opus 4.8

An upgrade to our Opus class of models, with stronger performance across coding, agentic tasks, and professional work, and the consistency to handle long-running work.

Anthropic·24 days ago

The Verge AI· PRESS

Claude’s new model is more ‘honest’ when it messes up

Anthropic is releasing Claude Opus 4.8 on Thursday, and the company is touting the model's "honesty." According to Anthropic, it trains "all [its] models to be honest - for instance, to avoid making claims that they can't support." But it notes that "a general problem with AI models is that they sometimes jump to conclusions, confidently presenting their work as making progress despite thin evidence." The AI lab claims that early testers have found that Opus 4.8 "is more likely to flag uncertainties about its work and less likely to make unsupported claims." In the company's evaluations, Opus...

Jay Peters·24 days ago

r/singularity· COMMUNITY

Introducing Claude Opus 4.8

Link: anthropic.com

u/ThinkOfaNameOK·24 days ago·128 pts / 29 comm

r/Anthropic· COMMUNITY

Introducing Claude Opus 4.8

We’re upgrading Claude Opus to a new version: Claude Opus 4.8. It builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today for the same price. In Claude Code, you can hand off a feature, a migration, or a bug sweep and let it follow the work through while you focus on what’s next. Also launching today: * Fast mode for Opus 4.8 (research preview). Same model at roughly 2.5x the speed, now three times cheaper than before. * Dynamic workflows in Claude Code (research preview). Claude ...

u/ClaudeOfficial·24 days ago·141 pts / 37 comm·+ covered by others

r/ClaudeAI· COMMUNITY

I spent $340 on AI subscriptions last month. Wrote down what I actually used each one for. It was depressing.

Going through the credit card statement, here's what I had active: Claude Pro (40), ChatGPT Plus (20), Cursor (20), Perplexity Pro (20), Notion AI (10), Granola (20), ElevenLabs Starter (5), Midjourney Basic (10), Gamma Pro (10), Beautiful.ai (12), Otter Pro (17), Loom Business (15), Zapier Pro (30), Make Core (10), Tactiq Pro (8), Descript Creator (15), Reclaim.ai Pro (8), Motion (19), Superhuman (30), one i can't remember the name of (10), some ai-something for instagram captions (11) Then I sat down and wrote next to each one the last time I'd actually used it. Not opened it, used it for...

u/OneSeaworthiness2676·24 days ago·20 pts / 31 comm

r/ClaudeAI· COMMUNITY

Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days

Imagine a world run by AI agents. What does it look like? What are the values or societal priorities? Is it a safer or more dangerous world? Enterprise AI startup Emergence AI is trying to find out. The company just launched Emergence World, a research lab dedicated to stress-testing the long-term viability of continuously-running AI systems. The organization ran five 15-day simulations, each governed by a different AI: Claude, ChatGPT, Grok, Gemini, and a fifth simulation run by a mix of models to see what kind of world each one builds, and whether it holds. Each simulation netted wildly d...

u/fortune·24 days ago·332 pts / 44 comm

r/ClaudeAI· COMMUNITY

Tried using my own brain to save Claude tokens. Bad trade

I love Claude, but the usage limit has made me weirdly strategic For actual messy stuff, I still go straight to Claude because it saves me a ton of time But for tiny questions, I now catch myself thinking, “Do I really need to burn a message on this?” So yes, I tried using my own brain again. It’s technically free, but the response time is awful and it starts hallucinating the second I’m tired or hungry. Honestly not a terrible deal if I remember to SLEEP

u/Overall_Ad9737·24 days ago·40 pts / 7 comm

r/ClaudeAI· COMMUNITY

The Uber claude code budget story is the most claude code thing possible

The reported Uber story is so on brand it almost reads like satire. Incredibly useful tool, slightly magical workflow, then finance walks in with a flamethrower in April. If they really finished the year's claude code budget by month four, that does not mean claude code is bad. It means the usage pattern changed faster than procurement math did. Claude is good enough at coding that people stopped treating it like autocomplete and started treating it like a coworker that never sleeps. That is exactly where the cost curve gets weird. A dev asks for a refactor. Claude reads context, plans, edi...

u/breadislifeee·24 days ago·24 pts / 11 comm

r/ClaudeAI· COMMUNITY

Overnight autonomous coding

At work we've been prompted about running Claude Code overnight. The suggestion came in form of a document that loosely outlined how this could be done... use git worktrees, make tight specs, no commit to main, static code analysis and lining etc. Very high level. Had a bit of sales pitch smell to it, but has enough content to peak my interest in spite of it. I looked at reddit to verify if this is even an idea that could be taken seriously. I could only find a couple of reddit posts with little actual information and usually from about 4-6 months ago so not much credibility for today. I'd ...

u/mehow_j·25 days ago·20 pts / 51 comm

r/Anthropic· COMMUNITY

Style that I didn't create.

RESOLVED, it happened because I have a claude qol extension. nothing bad happening this style appeared in my claude app of nowhere, i never created it and the name's weird, has anyone seen this too, or is it just me? does anyone have the answe why this appeared?

u/Ambitious-Lock-5928·25 days ago·16 pts / 11 comm

← Front Page30 matches

← Newer Older →