The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.

u/jacek2023·5 hours ago·40 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.

u/havenoammo·1 day ago·48 pts / 28 comm

r/LocalLLaMA· COMMUNITY

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.

u/ex-arman68·1 day ago·85 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.

u/m94301·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.

u/exintrovert420·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

MTP on strix halo with llama.cpp (PR #22673)

User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.

u/Edenar·2 days ago·42 pts / 20 comm

The Verge AI· PRESS

Meta sued by major book publishers over copyright infringement

Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...

Emma Roth·2 days ago

r/ClaudeAI· COMMUNITY

OllamaXClaude

Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!

u/No-Butterscotch-218·2 days ago·27 pts / 5 comm

r/LocalLLaMA· COMMUNITY

As MTP prepares to land in llama.cpp, Models that support MTP

MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.

u/segmond·2 days ago·46 pts / 28 comm

r/LocalLLaMA· COMMUNITY

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.

u/randomfoo2·3 days ago·51 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp MTP support now in beta!

llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.

u/ilintar·3 days ago·49 pts / 24 comm

r/LocalLLaMA· COMMUNITY

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.

u/segmond·4 days ago·45 pts / 32 comm

r/LocalLLaMA· COMMUNITY

If you've been waiting to try local AI development, please try it

Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.

u/Imaginary_Belt4976·4 days ago·48 pts / 30 comm

r/LocalLLaMA· COMMUNITY

Ban phrases on llama.cpp with this script.

Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.

u/Total-Resort-3120·5 days ago·41 pts / 25 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B-NVFP4 - images

User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.

u/Usual-Carrot6352·6 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

New rules 1 week check-in

r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.

u/rm-rf-rm·6 days ago·50 pts / 28 comm

r/LocalLLaMA· COMMUNITY

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.

u/sandropuppo·6 days ago·68 pts / 17 comm

r/LocalLLaMA· COMMUNITY

gemma-4-31B-it-DFlash has been released

gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.

u/Total-Resort-3120·6 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

User demonstrates DFlash speculative decoding in llama.cpp with Qwen3.5-35B-A3B on RTX 2080 SUPER 8GB, achieving inference on VRAM-constrained hardware.

u/jwestra·6 days ago·45 pts / 11 comm

r/LocalLLaMA· COMMUNITY

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together

llama-swap adds matrix grouping feature for multi-model orchestration and intelligent VRAM swap scheduling.

u/walden42·7 days ago·41 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Developer built local PDF-to-audiobook app using Kokoro 82M TTS, Qwen, and llama.cpp with Tauri 2.0 on M1 Mac.

u/purellmagents·8 days ago·40 pts / 17 comm

r/LocalLLaMA· COMMUNITY

PS5’s can now be hacked to run Linux - perhaps some potential for local inference?

PS5 Linux exploit proposed as potential hardware for local LLM inference via llama.cpp.

u/Thrumpwart·8 days ago·62 pts / 37 comm

r/LocalLLaMA· COMMUNITY

IK_LLAMA now supports Qwen3.5 MTP Support :O

IK_LLAMA.cpp adds Qwen3.5 MTP support with 50% throughput gain (18-20→30 tok/s) via pipeline parallelism on 27B model.

u/fragment_me·8 days ago·40 pts / 16 comm

r/LocalLLaMA· COMMUNITY

llama.cpp - NVFP4 native support on Blackwell from now - b8967

llama.cpp adds native NVFP4 quantization support for Blackwell GPUs with benchmark results on RTX 5090.

u/mossy_troll_84·8 days ago·40 pts / 32 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

llama.cpp merged SM120 native NVFP4 quantization support; community released GGUFs for Gemma-4-31B and Nemotron-Cascade models.

u/ggonavyy·9 days ago·42 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Lemonade OmniRouter: unifying the best local AI engines for omni-modality

Lemonade OmniRouter unifies local AI inference across text, image, audio, and vision modalities via single OpenAI-compatible endpoint using llama.cpp, sd.cpp, and Whisper.

u/jfowers_amd·9 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Qwen3.6-27B IQ4_XS quantization bloat analysis; reverting llama.cpp commit reduces VRAM from 15.1GB to 14.7GB with 110k context.

u/Pablo_the_brave·9 days ago·43 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Duality of r/LocalLLaMA

Reddit discussion analyzing tensions within r/LocalLLaMA community between open-weights advocates and commercial interests.

u/HornyGooner4402·9 days ago·72 pts / 22 comm

r/LocalLLaMA· COMMUNITY

Built myself a bit of a local llm workhorse. What's a good model to try out with llamacpp that will put my 56G of VRAM to good use? Any other fun suggestions?

Reddit user seeks recommendations for large open-weight models to run locally on 56GB VRAM using llama.cpp.

u/SBoots·10 days ago·41 pts / 33 comm

r/LocalLLaMA· COMMUNITY

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.

u/Holiday_Purpose_3166·10 days ago·51 pts / 10 comm

← Front Page30 matches

Older →