feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp
Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.
Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.
Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.
Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.
Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.
User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.
Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...
Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!
MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.
FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.
llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.
Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.
Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.
Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.
User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.
r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.
PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.
gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.
User demonstrates DFlash speculative decoding in llama.cpp with Qwen3.5-35B-A3B on RTX 2080 SUPER 8GB, achieving inference on VRAM-constrained hardware.
llama-swap adds matrix grouping feature for multi-model orchestration and intelligent VRAM swap scheduling.
Developer built local PDF-to-audiobook app using Kokoro 82M TTS, Qwen, and llama.cpp with Tauri 2.0 on M1 Mac.
PS5 Linux exploit proposed as potential hardware for local LLM inference via llama.cpp.
IK_LLAMA.cpp adds Qwen3.5 MTP support with 50% throughput gain (18-20→30 tok/s) via pipeline parallelism on 27B model.
llama.cpp adds native NVFP4 quantization support for Blackwell GPUs with benchmark results on RTX 5090.
llama.cpp merged SM120 native NVFP4 quantization support; community released GGUFs for Gemma-4-31B and Nemotron-Cascade models.
Lemonade OmniRouter unifies local AI inference across text, image, audio, and vision modalities via single OpenAI-compatible endpoint using llama.cpp, sd.cpp, and Whisper.
Qwen3.6-27B IQ4_XS quantization bloat analysis; reverting llama.cpp commit reduces VRAM from 15.1GB to 14.7GB with 110k context.
Reddit discussion analyzing tensions within r/LocalLLaMA community between open-weights advocates and commercial interests.
Reddit user seeks recommendations for large open-weight models to run locally on 56GB VRAM using llama.cpp.
User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.