Topic

Llama

Every story matching this topic across titles and summaries, newest first.

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.

u/jacek2023·5 hours ago·40 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.

u/havenoammo·1 day ago·48 pts / 28 comm

r/LocalLLaMA· COMMUNITY

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.

u/ex-arman68·1 day ago·85 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.

u/m94301·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.

u/exintrovert420·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

MTP on strix halo with llama.cpp (PR #22673)

User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.

u/Edenar·2 days ago·42 pts / 20 comm

The Verge AI· PRESS

Meta sued by major book publishers over copyright infringement

Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...

Emma Roth·2 days ago

r/ClaudeAI· COMMUNITY

OllamaXClaude

Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!

u/No-Butterscotch-218·2 days ago·27 pts / 5 comm

r/LocalLLaMA· COMMUNITY

As MTP prepares to land in llama.cpp, Models that support MTP

MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.

u/segmond·2 days ago·46 pts / 28 comm

r/LocalLLaMA· COMMUNITY

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.

u/randomfoo2·3 days ago·51 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp MTP support now in beta!

llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.

u/ilintar·3 days ago·49 pts / 24 comm

r/LocalLLaMA· COMMUNITY

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.

u/segmond·4 days ago·45 pts / 32 comm

r/LocalLLaMA· COMMUNITY

If you've been waiting to try local AI development, please try it

Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.

u/Imaginary_Belt4976·4 days ago·48 pts / 30 comm

r/LocalLLaMA· COMMUNITY

Ban phrases on llama.cpp with this script.

Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.

u/Total-Resort-3120·5 days ago·41 pts / 25 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B-NVFP4 - images

User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.

u/Usual-Carrot6352·6 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

New rules 1 week check-in

r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.

u/rm-rf-rm·6 days ago·50 pts / 28 comm

r/LocalLLaMA· COMMUNITY

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.

u/sandropuppo·6 days ago·68 pts / 17 comm

r/LocalLLaMA· COMMUNITY

gemma-4-31B-it-DFlash has been released

gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.

u/Total-Resort-3120·6 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

User demonstrates DFlash speculative decoding in llama.cpp with Qwen3.5-35B-A3B on RTX 2080 SUPER 8GB, achieving inference on VRAM-constrained hardware.

u/jwestra·6 days ago·45 pts / 11 comm

r/LocalLLaMA· COMMUNITY

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together

llama-swap adds matrix grouping feature for multi-model orchestration and intelligent VRAM swap scheduling.

u/walden42·7 days ago·41 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Developer built local PDF-to-audiobook app using Kokoro 82M TTS, Qwen, and llama.cpp with Tauri 2.0 on M1 Mac.

u/purellmagents·8 days ago·40 pts / 17 comm

r/LocalLLaMA· COMMUNITY

PS5’s can now be hacked to run Linux - perhaps some potential for local inference?

PS5 Linux exploit proposed as potential hardware for local LLM inference via llama.cpp.

u/Thrumpwart·8 days ago·62 pts / 37 comm

r/LocalLLaMA· COMMUNITY

IK_LLAMA now supports Qwen3.5 MTP Support :O

IK_LLAMA.cpp adds Qwen3.5 MTP support with 50% throughput gain (18-20→30 tok/s) via pipeline parallelism on 27B model.

u/fragment_me·8 days ago·40 pts / 16 comm

r/LocalLLaMA· COMMUNITY

llama.cpp - NVFP4 native support on Blackwell from now - b8967

llama.cpp adds native NVFP4 quantization support for Blackwell GPUs with benchmark results on RTX 5090.

u/mossy_troll_84·8 days ago·40 pts / 32 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

llama.cpp merged SM120 native NVFP4 quantization support; community released GGUFs for Gemma-4-31B and Nemotron-Cascade models.

u/ggonavyy·9 days ago·42 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Lemonade OmniRouter: unifying the best local AI engines for omni-modality

Lemonade OmniRouter unifies local AI inference across text, image, audio, and vision modalities via single OpenAI-compatible endpoint using llama.cpp, sd.cpp, and Whisper.

u/jfowers_amd·9 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Qwen3.6-27B IQ4_XS quantization bloat analysis; reverting llama.cpp commit reduces VRAM from 15.1GB to 14.7GB with 110k context.

u/Pablo_the_brave·9 days ago·43 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Duality of r/LocalLLaMA

Reddit discussion analyzing tensions within r/LocalLLaMA community between open-weights advocates and commercial interests.

u/HornyGooner4402·9 days ago·72 pts / 22 comm

r/LocalLLaMA· COMMUNITY

Built myself a bit of a local llm workhorse. What's a good model to try out with llamacpp that will put my 56G of VRAM to good use? Any other fun suggestions?

Reddit user seeks recommendations for large open-weight models to run locally on 56GB VRAM using llama.cpp.

u/SBoots·10 days ago·41 pts / 33 comm

r/LocalLLaMA· COMMUNITY

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.

u/Holiday_Purpose_3166·10 days ago·51 pts / 10 comm

r/LocalLLaMA· COMMUNITY

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.

u/jacek2023·10 days ago·42 pts / 12 comm

r/LocalLLaMA· COMMUNITY

What is the best coding agent (CLI) like Claude Code for Local Development

Reddit user seeks advice on setting up local coding agents like Claude Code with open-weight models via llama.cpp.

u/exaknight21·11 days ago·43 pts / 82 comm

r/LocalLLaMA· COMMUNITY

llama.cpp DeepSeek v4 Flash experimental inference

llama.cpp adds experimental DeepSeek v4 Flash support with aggressive 2-bit quantization, achieving 17 tokens/sec on M3 Max with 128GB RAM requirement.

u/antirez·11 days ago·42 pts / 37 comm

r/LocalLLaMA· COMMUNITY

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Llama.cpp benchmarks on Windows 11 vs Lubuntu 26.04 with RTX 5080 show significant OS-level performance variance in local inference.

u/Ok_Mine189·11 days ago·40 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Using PaddleOCR-VL-1.5 with llama-server for book OCR

User demonstrates PaddleOCR-VL-1.5 multimodal inference via llama.cpp server for end-to-end document digitization with layout and table handling.

u/Final-Frosting7742·11 days ago·40 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Experts-Volunteers needed for Vulkan on ik_llama.cpp

ik_llama.cpp maintainer seeks volunteers to develop Vulkan backend support for CPU/GPU inference optimization.

u/pmttyji·11 days ago·69 pts / 10 comm

r/LocalLLaMA· COMMUNITY

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

llama.cpp and ik_llama.cpp now support FP4 inference with different formats: NVFP4 (Nvidia E4M3) and MXFP4 (MX standard) across varying hardware backends.

u/Usual-Carrot6352·12 days ago·40 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

Developer documents practical setup for running Qwen 3.6 35B on M2 MacBook Pro 32GB via llama.cpp, with performance notes and optimization tips.

u/boutell·12 days ago·40 pts / 33 comm

r/LocalLLaMA· COMMUNITY

This is where we are right now, LocalLLaMA

Reddit post celebrating current state of local LLM deployment without specific technical claims or data.

u/jacek2023·13 days ago·61 pts / 32 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

Turboquant on llama.cpp?

Reddit discussion asking about TurboQuant KV cache optimization implementation in llama.cpp.

u/StupidScaredSquirrel·13 days ago·41 pts / 26 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Qwen 3.6 35B-A3B MoE model achieves 250+ tok/s on AMD Radeon 780M iGPU via llama.cpp Vulkan.

u/itroot·13 days ago·44 pts / 26 comm

Simon Willison· ANALYST

Extract PDF text in your browser with LiteParse for the web

Simon Willison ports LlamaIndex's LiteParse PDF text extraction tool to run in-browser, using spatial parsing and Tesseract OCR without ML models.

Simon Willison·14 days ago

r/LocalLLaMA· COMMUNITY

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

Technical walkthrough: Qwen 3.6 27B achieves 85 TPS, 125K context on single RTX 3090 using llama.cpp.

u/AmazingDrivers4u·14 days ago·216 pts / 70 comm

r/LocalLLaMA· COMMUNITY

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

Performance test: Qwen 3.6 27B with speculative decoding achieves 25.53 tokens/sec with 2x speedup on local hardware.

u/Then-Topic8766·14 days ago·222 pts / 74 comm

r/LocalLLaMA· COMMUNITY

When are we getting consumer inference chips?

User questions absence of consumer inference chips ($200 devices running Llama 3 locally) despite industry investment.

u/SnooStories2864·14 days ago·70 pts / 145 comm

r/LocalLLaMA· COMMUNITY

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

llama.cpp Vulkan and SYCL benchmarks comparing Nvidia RTX 3090 vs Intel Arc Pro B70 on prompt processing and token generation.

u/tovidagaming·15 days ago·60 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

User demonstrates Qwen3.6-27B running locally via llama-server with 200k context on dual RTX 3090, achieving coding performance cheaper than Claude.

u/sdfgeoff·15 days ago·333 pts / 97 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp's auto fit works much better than I expected

User demonstrates llama.cpp auto-fit enables 57 t/s on Qwen3.6 Q8 256k context despite weights exceeding 32GB VRAM.

u/a9udn9u·16 days ago·78 pts / 36 comm

r/LocalLLaMA· COMMUNITY

235M param LLM from scratch on a single RTX 5080

Plasma 1.0: 235M-param LLaMA-style model trained from scratch on single RTX 5080 GPU.

u/ExcellentTip9926·16 days ago·63 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Open WebUI Desktop Released!

Open WebUI Desktop released with local llama.cpp support and remote server connectivity options.

u/My_Unbiased_Opinion·16 days ago·246 pts / 90 comm

r/LocalLLaMA· COMMUNITY

llama.cpp is the linux of llm

Commentary comparing llama.cpp infrastructure dominance to Linux in LLM ecosystem.

u/DevelopmentBorn3978·16 days ago·174 pts / 84 comm

r/LocalLLaMA· COMMUNITY

Why doesn't any OSS tool treat llama.cpp as a first class citizen?

Community discussion on why OSS AI tools prioritize Ollama over llama.cpp despite engineering parity.

u/rm-rf-rm·17 days ago·293 pts / 104 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models

Six Llama-3.1-8B variants fine-tuned on Christian, Islamic, Jewish, Hindu, Buddhist texts reveal systematic differences in ethical reasoning patterns.

Chad Coleman·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Cross-linguistic study of politeness effects on 5 LLMs (Gemini-Pro, GPT-4o Mini, Claude 3 Sonnet, DeepSeek-Chat, Llama 3) via 22,500 English/Hindi/Spanish prompts.

Hitesh Mehta·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimizing Korean-Centric LLMs via Token Pruning

Benchmark compares token pruning compression across Qwen3, Gemma-3, Llama-3, Aya for Korean-centric NLP with English-Korean vocabulary optimization.

Hoyeol Kim·20 days ago

Hugging Face· INFRA

Llama

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

MTP on strix halo with llama.cpp (PR #22673)

Meta sued by major book publishers over copyright infringement

OllamaXClaude

As MTP prepares to land in llama.cpp, Models that support MTP

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

Llama.cpp MTP support now in beta!

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

If you've been waiting to try local AI development, please try it

Ban phrases on llama.cpp with this script.

Qwen3.6-27B-NVFP4 - images

New rules 1 week check-in

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

gemma-4-31B-it-DFlash has been released

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together

Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

PS5’s can now be hacked to run Linux - perhaps some potential for local inference?

IK_LLAMA now supports Qwen3.5 MTP Support :O

llama.cpp - NVFP4 native support on Blackwell from now - b8967

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

Lemonade OmniRouter: unifying the best local AI engines for omni-modality

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Duality of r/LocalLLaMA

Built myself a bit of a local llm workhorse. What's a good model to try out with llamacpp that will put my 56G of VRAM to good use? Any other fun suggestions?

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

What is the best coding agent (CLI) like Claude Code for Local Development

llama.cpp DeepSeek v4 Flash experimental inference

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Using PaddleOCR-VL-1.5 with llama-server for book OCR

Experts-Volunteers needed for Vulkan on ik_llama.cpp

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

This is where we are right now, LocalLLaMA

Turboquant on llama.cpp?

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Extract PDF text in your browser with LiteParse for the web

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

When are we getting consumer inference chips?

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

Llama.cpp's auto fit works much better than I expected

235M param LLM from scratch on a single RTX 5080

Open WebUI Desktop Released!

llama.cpp is the linux of llm

Why doesn't any OSS tool treat llama.cpp as a first class citizen?

Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Optimizing Korean-Centric LLMs via Token Pruning

GGML and llama.cpp join HF to ensure the long-term progress of Local AI

New in llama.cpp: Model Management

Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub

Welcoming Llama Guard 4 on Hugging Face Hub

Welcome Llama 4 Maverick & Scout on Hugging Face

“Llama 3.2 in Keras”

Llama can now see and run on your device - welcome Llama 3.2

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

Welcome Llama 3 - Meta's new open LLM

Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora

Make your llama generation time fly with AWS Inferentia2

Non-engineers guide: Train a LLaMA 2 chatbot

Llama 2 on Amazon SageMaker a Benchmark

Fine-tuning Llama 2 70B using PyTorch FSDP

Code Llama: Llama 2 learns to code

Fine-tune Llama 2 with DPO

Llama 2 is here - get it on Hugging Face

StackLLaMA: A hands-on guide to train LLaMA with RLHF