Topic

§ Benchmarks

Every story tagged with this topic, ordered by date.

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

Reddit discussion argues prefill latency is underemphasized vs. token generation speed in local LLM benchmarking and optimization focus.

u/wbulot·20 hours ago·40 pts / 35 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

TabEmbed introduces first generalist embedding model for tabular data and TabBench, a comprehensive benchmark for tabular understanding evaluation.

Minjie Qiang·1 day ago

arXiv (cs.AI/CL/LG)· ACADEMIA

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

KernelBench-X evaluates LLM-generated Triton GPU kernels across 176 tasks; finds task structure explains 3x more correctness variance than method design.

Han Wang·1 day ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data

Controlled benchmark study diagnosing when causal vs. correlational methods fail for gene regulatory network inference from single-cell RNA-seq.

Miguel Fernandez-de-Retana·1 day ago

r/LocalLLaMA· COMMUNITY

An Open Benchmark for Testing RAG on Realistic Company-Internal Data

EnterpriseRAG-Bench: 500k-document synthetic dataset benchmarking RAG systems on realistic internal company data (Slack, email, tickets, PRs) vs. public corpora.

u/Weves11·1 day ago·41 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Empirical quantization degradation analysis for Qwen 3.6 27B across 8 compression levels via chess state-tracking task.

u/bobaburger·1 day ago·62 pts / 36 comm

r/LocalLLaMA· COMMUNITY

12M Context Window and some some sprinkle of lies?

SubQ claims 12M context window in marketing but production model capped at 1M; benchmark results show significant performance drop vs. research variant and competitors.

u/prokajevo·2 days ago·57 pts / 25 comm

r/singularity· COMMUNITY

Benchmarks in 2024

Reddit discussion on 2024 AI benchmarks without substantive content or specific findings provided.

u/RetiredApostle·2 days ago·115 pts / 20 comm

r/LocalLLaMA· COMMUNITY

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

Developer benchmarked local Qwen 3.6 27B vs cloud models on 150 real coding tasks, finding local matched cloud 97% on 35% of workload, suggesting cost arbitrage opportunity.

u/spencer_kw·2 days ago·41 pts / 17 comm

r/LocalLLaMA· COMMUNITY

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

Benchmark comparison shows Gemma 4 31B trades inference speed for token efficiency vs Qwen 3.6/5 27B; Qwen optimizes for metrics, Gemma for throughput.

u/MiaBchDave·2 days ago·51 pts / 11 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Safety and accuracy follow different scaling laws in clinical large language models

SaFE-Scale framework reveals clinical LLM safety and accuracy follow divergent scaling laws; introduces RadSaFE-2 benchmark.

Sebastian Wind·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

BRIGHT-Retriever: benchmark and training approach for reasoning-intensive retrieval in agentic search, beyond topical matching.

Yilun Zhao·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Fairness audit of five LLMs (Gemini, GPT-4, DeepSeek, Mistral, Nemotron) on emergency triage reveals gender bias persistence in clinical decision support.

Richard J. Young·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

Transformer-based AI-text detector using HC3 PLUS and M4 benchmarks demonstrates domain-robust detection with fixed thresholds across generators.

Mohamed Mady·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

MOSAIC-Bench evaluates coding agents' vulnerability to multi-stage attack chains that decompose malicious goals into innocuous sequential tasks, exposing alignment gaps in deployed systems.

Jonathan Steinberg·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Benchmark for Interactive World Models with a Unified Action Generation Framework

iWorld-Bench is a 330k-clip dataset and benchmark for training interactive world models on perception, reasoning, and physical interaction capabilities.

Jianjie Fang·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Towards Open World Sound Event Detection

Open-World Sound Event Detection paradigm extends closed-world audio classifiers to detect unknown events and incrementally learn from them.

P. H. Hai·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

CC-OCR V2 benchmark for real-world enterprise document OCR with LMMs; addresses gap between lab tasks and practical heterogeneous acquisition conditions.

Zhipeng Xu·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Raising the Ceiling: Better Empirical Fixation Densities for Saliency Benchmarking

Proposes improved empirical fixation density estimation methods beyond fixed-bandwidth Gaussian KDE for saliency benchmarking and per-image model evaluation.

Susmit Agrawal·2 days ago

r/LocalLLaMA· COMMUNITY

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)

ProgramBench: 200-task evaluation showing agents struggle to rebuild large binaries from scratch without cheating vulnerabilities.

u/klieret·2 days ago·41 pts / 18 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

MCJudgeBench: benchmark for constraint-level evaluation of LLM judges in multi-constraint instruction following with per-constraint gold labels.

Jaeyun Lee·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability

Domain incremental learning benchmark for ICU time-series model transfer across hospitals with domain shift and patient data heterogeneity.

Ryan King·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Reproducing Complex Set-Compositional Information Retrieval

Reproducibility study of neural retrievers on set-compositional queries; introduces LIMIT+ benchmark for constraint-satisfaction information retrieval.

Vincent Degenhart·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality

Manokhin Probability Matrix: diagnostic framework separating classifier calibration and discriminatory power via 2x2 archetype taxonomy.

Valery Manokhin·2 days ago

r/ClaudeAI· COMMUNITY

I tested Kimi K2.6 vs Claude Opus 4.7 on a weird game coding task

User benchmarks Claude Opus 4.7 vs Kimi K2.6 on complex game mod coding task with TypeScript/Composio integration.

u/shricodev·2 days ago·20 pts / 15 comm

r/MachineLearning· COMMUNITY

Struggling to reproduce paper results before improving them — stuck below reported accuracy [R]

PhD student reports 4% accuracy gap when reproducing computer vision paper baseline; raises reproducibility concerns common in published ML research.

u/Plane_Stick8394·2 days ago·34 pts / 23 comm

r/LocalLLaMA· COMMUNITY

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench agentic task; first Chinese frontier model, 17× cheaper, within 10 weeks of GPT-5.2 release.

u/Disastrous_Theme5906·2 days ago·43 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Peanut - Text to Image Model (Open Weights coming soon)

Anonymous text-to-image model 'Peanut' ranks #8 on Artificial Analysis arena; open-weights release promised to compete with FLUX and Qwen models.

u/pmttyji·2 days ago·47 pts / 12 comm

Simon Willison· ANALYST

Granite 4.1 3B SVG Pelican Gallery

IBM released Granite 4.1 (3B/8B/30B, Apache 2.0); Unsloth published 21 quantized GGUF variants; Willison benchmarked quality across model sizes on SVG generation.

Simon Willison·3 days ago

r/Anthropic· COMMUNITY

Casually beating every other deep research agent out there with a simple Claude Code harness

Open-source research agent built on Claude Code outperforms OpenAI and NVIDIA systems in deep research benchmarking.

u/heisdancingdancing·3 days ago·11 pts / 11 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

PLACE: closed-form persistent-homology pipeline for point cloud and graph classification with margin-based guarantees and per-prediction certificates.

Sushovan Majhi·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

VideoNet benchmark with 1,000 domain-specific actions revives action recognition evaluation for vision-language models.

Tanush Yadav·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Benchmark reveals audio-language models fail to leverage clinical context for dysarthric speech recognition despite multimodal capacity.

Pehuén Moure·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

ReClaim: generative transformer trained on 43.8B medical events from MarketScan claims data for healthcare foundation model development.

Fan Ma·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

PubMed-Ophtha: 102K ophthalmology image-caption pairs extracted from open-access PDFs for training vision-language models.

Verena Jasmin Hallitschke·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

GRIDS framework uses Local Intrinsic Dimensionality to detect how perturbations deform representations in WavLM and wav2vec 2.0 models and track ASR degradation.

Sandra Arcos-Holzinger·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge

ACII-DaiKon 2026 benchmark for dyadic conversation affect modeling: interpersonal influence, timing coordination, and rapport development challenges.

Panagiotis Tzirakis·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES

DILER Benchmark: drug-induced liver injury dataset with mechanistic hypotheses; reframes DILI prediction as explainable hypothesis generation.

Maciej Wisniewski·3 days ago

r/singularity· COMMUNITY

IBM Research introduces MAMMAL, a multi-modal model that combines proteins, molecules, gene data achieving SOTA on 9 out 11 biological benchmarks (beating AlphaFold 3 in some)

IBM Research releases MAMMAL, multimodal model integrating proteins/molecules/genes, achieves SOTA on 9/11 biological benchmarks including drug-target interaction and antibody-antigen binding.

u/Distinct-Question-16·3 days ago·106 pts / 32 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

RMGAP: benchmark for evaluating reward model generalization across diverse user preferences with 1,097 instances, addressing RLHF alignment robustness.

Yangyang Zhou·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Molecular Representations for Large Language Models

MolJSON: novel molecular representation format systematically compared against SMILES/IUPAC for LLM chemistry reasoning tasks.

Nicholas T. Runcie·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

TMD-Bench introduces evaluation metrics for text-driven music-dance co-generation, addressing rhythm-choreography coupling beyond generic audiovisual consistency.

Xiaoda Yang·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics

Calibrated Size Ratio (CSR) replaces Expected Calibration Error for confidence calibration, addressing ECE's inability to detect overconfidence risk.

Fernando Martin-Maroto·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't

Defines 'Compliance Gap': AI systems verbally accept constraints but violate them in execution; audits instruction-following.

Kwan Soo Shin·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Iterated negotiation benchmark tests LLM agents' ability to repair grounding failures in dynamic multi-turn interaction.

Yiheng Yao·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

SignVerse-2M dataset adds 2M pose-annotated sign language clips across 25+ languages for improved recognition and generation.

Sen Fang·4 days ago

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B vs Coder-Next

Empirical comparison of Qwen3.6-27B and Coder-Next models across 40 test cases shows statistical parity with task-dependent tradeoffs.

u/Signal_Ad657·5 days ago·48 pts / 13 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

MultiBreak benchmark evaluates LLM safety via scalable multi-turn jailbreaks using active learning to generate diverse adversarial prompts.

Jialin Song·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Benchmarking Single-Pose Docking, Consensus Rescoring, and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock, AutoDock-GPU, GNINA, and DiffDock-NMDN

Large-scale evaluation of DiffDock, AutoDock-GPU, GNINA, and NMDN on LIT-PCBA library (15 targets, 578K pairs) for molecular docking.

Youssef Abo-Dahab·5 days ago

r/LocalLLaMA· COMMUNITY

Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.

27B/31B vision model comparison: Qwen 3.6 benchmarks higher than Gemma 4 but underperforms in real-world tasks; suggests benchmark gaming.

u/FantasticNature7590·5 days ago·46 pts / 41 comm

← Front Page50 stories