The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

r/LocalLLaMA· COMMUNITY

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.

u/jacek2023·10 days ago·42 pts / 12 comm

r/LocalLLaMA· COMMUNITY

What is the best coding agent (CLI) like Claude Code for Local Development

Reddit user seeks advice on setting up local coding agents like Claude Code with open-weight models via llama.cpp.

u/exaknight21·11 days ago·43 pts / 82 comm

r/LocalLLaMA· COMMUNITY

llama.cpp DeepSeek v4 Flash experimental inference

llama.cpp adds experimental DeepSeek v4 Flash support with aggressive 2-bit quantization, achieving 17 tokens/sec on M3 Max with 128GB RAM requirement.

u/antirez·11 days ago·42 pts / 37 comm

r/LocalLLaMA· COMMUNITY

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Llama.cpp benchmarks on Windows 11 vs Lubuntu 26.04 with RTX 5080 show significant OS-level performance variance in local inference.

u/Ok_Mine189·11 days ago·40 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Using PaddleOCR-VL-1.5 with llama-server for book OCR

User demonstrates PaddleOCR-VL-1.5 multimodal inference via llama.cpp server for end-to-end document digitization with layout and table handling.

u/Final-Frosting7742·11 days ago·40 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Experts-Volunteers needed for Vulkan on ik_llama.cpp

ik_llama.cpp maintainer seeks volunteers to develop Vulkan backend support for CPU/GPU inference optimization.

u/pmttyji·12 days ago·69 pts / 10 comm

r/LocalLLaMA· COMMUNITY

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

llama.cpp and ik_llama.cpp now support FP4 inference with different formats: NVFP4 (Nvidia E4M3) and MXFP4 (MX standard) across varying hardware backends.

u/Usual-Carrot6352·12 days ago·40 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

Developer documents practical setup for running Qwen 3.6 35B on M2 MacBook Pro 32GB via llama.cpp, with performance notes and optimization tips.

u/boutell·12 days ago·40 pts / 33 comm

r/LocalLLaMA· COMMUNITY

This is where we are right now, LocalLLaMA

Reddit post celebrating current state of local LLM deployment without specific technical claims or data.

u/jacek2023·13 days ago·61 pts / 32 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

Turboquant on llama.cpp?

Reddit discussion asking about TurboQuant KV cache optimization implementation in llama.cpp.

u/StupidScaredSquirrel·13 days ago·41 pts / 26 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Qwen 3.6 35B-A3B MoE model achieves 250+ tok/s on AMD Radeon 780M iGPU via llama.cpp Vulkan.

u/itroot·13 days ago·44 pts / 26 comm

Simon Willison· ANALYST

Extract PDF text in your browser with LiteParse for the web

Simon Willison ports LlamaIndex's LiteParse PDF text extraction tool to run in-browser, using spatial parsing and Tesseract OCR without ML models.

Simon Willison·14 days ago

r/LocalLLaMA· COMMUNITY

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

Technical walkthrough: Qwen 3.6 27B achieves 85 TPS, 125K context on single RTX 3090 using llama.cpp.

u/AmazingDrivers4u·14 days ago·216 pts / 70 comm

r/LocalLLaMA· COMMUNITY

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

Performance test: Qwen 3.6 27B with speculative decoding achieves 25.53 tokens/sec with 2x speedup on local hardware.

u/Then-Topic8766·14 days ago·222 pts / 74 comm

r/LocalLLaMA· COMMUNITY

When are we getting consumer inference chips?

User questions absence of consumer inference chips ($200 devices running Llama 3 locally) despite industry investment.

u/SnooStories2864·14 days ago·70 pts / 145 comm

r/LocalLLaMA· COMMUNITY

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

llama.cpp Vulkan and SYCL benchmarks comparing Nvidia RTX 3090 vs Intel Arc Pro B70 on prompt processing and token generation.

u/tovidagaming·15 days ago·60 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

User demonstrates Qwen3.6-27B running locally via llama-server with 200k context on dual RTX 3090, achieving coding performance cheaper than Claude.

u/sdfgeoff·15 days ago·333 pts / 97 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp's auto fit works much better than I expected

User demonstrates llama.cpp auto-fit enables 57 t/s on Qwen3.6 Q8 256k context despite weights exceeding 32GB VRAM.

u/a9udn9u·16 days ago·78 pts / 36 comm

r/LocalLLaMA· COMMUNITY

235M param LLM from scratch on a single RTX 5080

Plasma 1.0: 235M-param LLaMA-style model trained from scratch on single RTX 5080 GPU.

u/ExcellentTip9926·16 days ago·63 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Open WebUI Desktop Released!

Open WebUI Desktop released with local llama.cpp support and remote server connectivity options.

u/My_Unbiased_Opinion·16 days ago·246 pts / 90 comm

r/LocalLLaMA· COMMUNITY

llama.cpp is the linux of llm

Commentary comparing llama.cpp infrastructure dominance to Linux in LLM ecosystem.

u/DevelopmentBorn3978·16 days ago·174 pts / 84 comm

r/LocalLLaMA· COMMUNITY

Why doesn't any OSS tool treat llama.cpp as a first class citizen?

Community discussion on why OSS AI tools prioritize Ollama over llama.cpp despite engineering parity.

u/rm-rf-rm·17 days ago·293 pts / 104 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models

Six Llama-3.1-8B variants fine-tuned on Christian, Islamic, Jewish, Hindu, Buddhist texts reveal systematic differences in ethical reasoning patterns.

Chad Coleman·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Cross-linguistic study of politeness effects on 5 LLMs (Gemini-Pro, GPT-4o Mini, Claude 3 Sonnet, DeepSeek-Chat, Llama 3) via 22,500 English/Hindi/Spanish prompts.

Hitesh Mehta·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimizing Korean-Centric LLMs via Token Pruning

Benchmark compares token pruning compression across Qwen3, Gemma-3, Llama-3, Aya for Korean-centric NLP with English-Korean vocabulary optimization.

Hoyeol Kim·20 days ago

Hugging Face· INFRA

← Newer Older →

The Archive

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

What is the best coding agent (CLI) like Claude Code for Local Development

llama.cpp DeepSeek v4 Flash experimental inference

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Using PaddleOCR-VL-1.5 with llama-server for book OCR

Experts-Volunteers needed for Vulkan on ik_llama.cpp

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

This is where we are right now, LocalLLaMA

Turboquant on llama.cpp?

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Extract PDF text in your browser with LiteParse for the web

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

When are we getting consumer inference chips?

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

Llama.cpp's auto fit works much better than I expected

235M param LLM from scratch on a single RTX 5080

Open WebUI Desktop Released!

llama.cpp is the linux of llm

Why doesn't any OSS tool treat llama.cpp as a first class citizen?

Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Optimizing Korean-Centric LLMs via Token Pruning

GGML and llama.cpp join HF to ensure the long-term progress of Local AI

New in llama.cpp: Model Management

Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub

Welcoming Llama Guard 4 on Hugging Face Hub