The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Are Large Language Models Economically Viable for Industry Deployment?

EDGE-EVAL framework assesses LLM deployment viability via energy, latency, cost constraints beyond accuracy metrics.

Abdullah Mohammad·2 months ago

Evaluation-driven Scaling for Scientific Discovery

Framework for scaling evaluation-driven discovery loops with LLMs using verifiers and simulators for hypothesis refinement.

Haotian Ye·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Improvements to the post-processing of weather forecasts using machine learning and feature selection

LightGBM post-processing improves weather forecasts for precipitation, temperature, wind via feature selection on JMA data.

Kazuma Iwase·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

FedSEA: Achieving Benefit of Parallelization in Federated Online Learning

FedSEA framework extends federated online learning with stochastically extended adversary for privacy-preserving decentralized learning.

Harekrushna Sahu·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction

Empirical study shows active learning limitations for chemical reaction extraction despite uncertainty/diversity strategies.

Simin Yu·2 months ago

r/ClaudeAI· COMMUNITY

Claude: complicated task let's do it tomorrow!

Empty post with no content.

u/Nemo1985·2 months ago·36 pts / 25 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation

Framework evaluates LLM-generated parliamentary debate summaries using computational argumentation for faithfulness assessment.

Eoghan Cunningham·2 months ago

r/LocalLLaMA· COMMUNITY

Unpopular opinion: OpenClaw and all its clones are almost useless tools for those who know what they're doing. It's kind of impressive for someone who has never used a CLI, Claude Code, Codex, etc. Nor used any workflow tool like 8n8 or make.

Critique that agent/agentic tools like OpenClaw reduce utility for experienced developers despite onboarding novices.

u/pacmanpill·2 months ago·480 pts / 195 comm

The Verge AI· PRESS

Yelp is making its AI chatbot way more useful

Yelp is giving its chatbot assistant a major upgrade, turning the platform into something closer to a digital concierge with a suite of new features designed for "getting things done." The move, one of several AI-focused updates in recent months, is part of a broader industry push to make AI more relevant and practically useful to consumers while turning huge troves of user-generated data into a competitive edge. In a press release, Yelp says the Yelp Assistant chatbot will be at "the center of the app experience," where it can answer questions, make recommendations, and even handle bookings ...

Robert Hart·2 months ago

r/ClaudeAI· COMMUNITY

New fear unlocked: Claude can run Bash tool with dangerouslyDisableSandbox when it wishes to do so

Claude Auto mode in Code allows tool execution without per-call approval but lacks safety guardrails, enabling dangerous Bash commands.

u/somerussianbear·2 months ago·148 pts / 66 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

PLaMo 2.1-VL Technical Report

PLaMo 2.1-VL: lightweight 2B/8B VLM for edge deployment with Japanese support, visual grounding, factory/infrastructure applications.

Tommi Kerola·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

Rough set analysis reveals concept-level inconsistencies in Derm7pt dermoscopy dataset limit concept bottleneck model accuracy.

Gonzalo Nápoles·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

RDP LoRA uses geometry-driven trajectory analysis to identify optimal LoRA adapter placement in LLM fine-tuning without training.

Yusuf Çelebi·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

On the Conditioning Consistency Gap in Conditional Neural Processes

Conditioning consistency gap quantifies KL divergence between Conditional Neural Process predictions when context points are added.

Robin Young·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Co-Refine: AI-Powered Tool Supporting Qualitative Analysis

Co-Refine provides real-time AI feedback on coding drift in qualitative research without disrupting analyst workflow.

Athikash Jeyaganthan·2 months ago

Hugging Face· INFRA

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Hugging Face·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Large Language Models Exhibit Normative Conformity

LLMs exhibit normative conformity in multi-agent discussions distinct from informational conformity, mechanistically analyzed via new tasks.

Mikako Bito·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

HalluAudio: first large-scale benchmark for hallucination detection across speech, environmental sound, and music in audio-language models.

Feiyu Zhao·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Small Language Models (<10B params) deployed via agent paradigms (tool use, multi-agent) match larger models with reduced compute.

Xinlin Wang·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

IndiaFinBench: first public benchmark for LLM evaluation on Indian financial regulatory text (SEBI, RBI documents).

Rajveer Singh Pall·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Debiased neural operators for estimating functionals

DOPE: semiparametric estimator for scalar functionals from neural operator solution trajectories with debiasing.

Konstantin Hess·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TEMPO: Scaling Test-time Training for Large Reasoning Models

TEMPO scales test-time training for LLMs via periodic critic recalibration to avoid reward drift and diversity collapse.

Qingyang Zhang·2 months ago

Stratechery· ANALYST

Tim Cook’s Impeccable Timing

Tim Cook's departure as Apple CEO analyzed through lens of timing and business cycles; not AI-specific.

Ben Thompson·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

LocQA benchmark (2,156 locale-ambiguous questions in 12 languages) exposes inter/intra-lingual biases in multilingual LLMs on facts, laws, and measurements.

Guy Mor-Lan·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

VB-Score evaluation framework for medical QA systems assesses accuracy and health equity risks beyond semantic similarity matching.

Abu Noman Md Sakib·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Explicit Trait Inference for Multi-Agent Coordination

Explicit Trait Inference (ETI) method uses warmth/competence dimensions to improve LLM-based multi-agent coordination, reducing payoff loss by 45-77%.

Suhaib Abdurahman·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

HarDBench benchmark evaluates LLM robustness against draft-based co-authoring jailbreak attacks in collaborative writing scenarios.

Euntae Kim·2 months ago

r/LocalLLaMA· COMMUNITY

Surprising screenshot - Most token usage is non-coders (openrouter ranking)

Analysis of OpenRouter token usage shows top coding apps predominantly non-coding models; Hermes adoption noted.

u/superloser48·2 months ago·184 pts / 109 comm

r/ClaudeAI· COMMUNITY

Why the huge divergence in lovers and haters of Claude Opus 4.7?

Contradictory user reports on Claude Opus 4.7: some praise instruction adherence and comprehension; others cite regression in context and creativity.

u/entheosoul·2 months ago·66 pts / 93 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

CulturALL benchmark assesses LLMs on grounded multilingual/multicultural tasks via human-AI collaborative framework across real-world scenarios.

Peiqin Lin·2 months ago

← Front Page30 stories

← Newer Older →