feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp
Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.
Every story matching this topic across titles and summaries, newest first.
Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.
Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.
Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.
Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.
Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.
User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.
Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...
Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!
MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.
FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.
llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.
Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.
Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.
Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.
User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.
r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.
PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.
gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.
User demonstrates DFlash speculative decoding in llama.cpp with Qwen3.5-35B-A3B on RTX 2080 SUPER 8GB, achieving inference on VRAM-constrained hardware.
llama-swap adds matrix grouping feature for multi-model orchestration and intelligent VRAM swap scheduling.
Developer built local PDF-to-audiobook app using Kokoro 82M TTS, Qwen, and llama.cpp with Tauri 2.0 on M1 Mac.
PS5 Linux exploit proposed as potential hardware for local LLM inference via llama.cpp.
IK_LLAMA.cpp adds Qwen3.5 MTP support with 50% throughput gain (18-20→30 tok/s) via pipeline parallelism on 27B model.
llama.cpp adds native NVFP4 quantization support for Blackwell GPUs with benchmark results on RTX 5090.
llama.cpp merged SM120 native NVFP4 quantization support; community released GGUFs for Gemma-4-31B and Nemotron-Cascade models.
Lemonade OmniRouter unifies local AI inference across text, image, audio, and vision modalities via single OpenAI-compatible endpoint using llama.cpp, sd.cpp, and Whisper.
Qwen3.6-27B IQ4_XS quantization bloat analysis; reverting llama.cpp commit reduces VRAM from 15.1GB to 14.7GB with 110k context.
Reddit discussion analyzing tensions within r/LocalLLaMA community between open-weights advocates and commercial interests.
Reddit user seeks recommendations for large open-weight models to run locally on 56GB VRAM using llama.cpp.
User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.
Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.
Reddit user seeks advice on setting up local coding agents like Claude Code with open-weight models via llama.cpp.
llama.cpp adds experimental DeepSeek v4 Flash support with aggressive 2-bit quantization, achieving 17 tokens/sec on M3 Max with 128GB RAM requirement.
Llama.cpp benchmarks on Windows 11 vs Lubuntu 26.04 with RTX 5080 show significant OS-level performance variance in local inference.
User demonstrates PaddleOCR-VL-1.5 multimodal inference via llama.cpp server for end-to-end document digitization with layout and table handling.
ik_llama.cpp maintainer seeks volunteers to develop Vulkan backend support for CPU/GPU inference optimization.
llama.cpp and ik_llama.cpp now support FP4 inference with different formats: NVFP4 (Nvidia E4M3) and MXFP4 (MX standard) across varying hardware backends.
Developer documents practical setup for running Qwen 3.6 35B on M2 MacBook Pro 32GB via llama.cpp, with performance notes and optimization tips.
Reddit post celebrating current state of local LLM deployment without specific technical claims or data.
Reddit discussion asking about TurboQuant KV cache optimization implementation in llama.cpp.
Qwen 3.6 35B-A3B MoE model achieves 250+ tok/s on AMD Radeon 780M iGPU via llama.cpp Vulkan.
Simon Willison ports LlamaIndex's LiteParse PDF text extraction tool to run in-browser, using spatial parsing and Tesseract OCR without ML models.
Technical walkthrough: Qwen 3.6 27B achieves 85 TPS, 125K context on single RTX 3090 using llama.cpp.
Performance test: Qwen 3.6 27B with speculative decoding achieves 25.53 tokens/sec with 2x speedup on local hardware.
User questions absence of consumer inference chips ($200 devices running Llama 3 locally) despite industry investment.
llama.cpp Vulkan and SYCL benchmarks comparing Nvidia RTX 3090 vs Intel Arc Pro B70 on prompt processing and token generation.
User demonstrates Qwen3.6-27B running locally via llama-server with 200k context on dual RTX 3090, achieving coding performance cheaper than Claude.
User demonstrates llama.cpp auto-fit enables 57 t/s on Qwen3.6 Q8 256k context despite weights exceeding 32GB VRAM.
Plasma 1.0: 235M-param LLaMA-style model trained from scratch on single RTX 5080 GPU.
Open WebUI Desktop released with local llama.cpp support and remote server connectivity options.
Commentary comparing llama.cpp infrastructure dominance to Linux in LLM ecosystem.
Community discussion on why OSS AI tools prioritize Ollama over llama.cpp despite engineering parity.
Six Llama-3.1-8B variants fine-tuned on Christian, Islamic, Jewish, Hindu, Buddhist texts reveal systematic differences in ethical reasoning patterns.
Cross-linguistic study of politeness effects on 5 LLMs (Gemini-Pro, GPT-4o Mini, Claude 3 Sonnet, DeepSeek-Chat, Llama 3) via 22,500 English/Hindi/Spanish prompts.
Benchmark compares token pruning compression across Qwen3, Gemma-3, Llama-3, Aya for Korean-centric NLP with English-Korean vocabulary optimization.