Topic

§ Open Weights

Every story tagged with this topic, ordered by date.

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.

u/havenoammo·1 day ago·48 pts / 28 comm

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.

u/ex-arman68·1 day ago·85 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Empirical quantization degradation analysis for Qwen 3.6 27B across 8 compression levels via chess state-tracking task.

u/bobaburger·1 day ago·62 pts / 36 comm

r/LocalLLaMA· COMMUNITY

What do you use Gemma 4 for?

Community discussion comparing Gemma 4 and Qwen 3.6 model suitability across coding, benchmarks, and agentic workloads.

u/HornyGooner4402·2 days ago·40 pts / 46 comm

r/LocalLLaMA· COMMUNITY

MTP on strix halo with llama.cpp (PR #22673)

User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.

u/Edenar·2 days ago·42 pts / 20 comm

r/LocalLLaMA· COMMUNITY

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

Benchmark comparison shows Gemma 4 31B trades inference speed for token efficiency vs Qwen 3.6/5 27B; Qwen optimizes for metrics, Gemma for throughput.

u/MiaBchDave·2 days ago·51 pts / 11 comm

r/LocalLLaMA· COMMUNITY

Gemma 4 MTP released

Google releases Gemma 4 multi-token prediction drafters in 4 quantized sizes for local deployment.

u/rerri·2 days ago·86 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Use Qwen3.6 right way -> send it to pi coding agent and forget

Reddit post on using Qwen3.6 with pi.dev harness and agent tooling for local coding and admin tasks.

u/Willing-Toe1942·2 days ago·40 pts / 45 comm

r/LocalLLaMA· COMMUNITY

Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more

Heretic 1.3 adds reproducibility, integrated benchmarking, reduced VRAM, and broader model support for model decensoring.

u/-p-e-w-·2 days ago·54 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Current state of local research tools as of May 2026

Community survey of local deep research tools as of May 2026, highlighting GPT Researcher and Local Deep Research as active open-source projects.

u/Shoddy-Tutor9563·2 days ago·41 pts / 24 comm

r/LocalLLaMA· COMMUNITY

Running a 26B LLM locally with no GPU

User reports running Gemma 26B efficiently on CPU-only hardware (i5-8500, 32GB RAM) without GPU acceleration.

u/JackStrawWitchita·2 days ago·41 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 merged chat template from allanchan339 and froggeric

Community member merges Qwen3.6 chat template fixes from froggeric and allanchan339 using Claude Opus.

u/fakezeta·2 days ago·41 pts / 13 comm

r/LocalLLaMA· COMMUNITY

As MTP prepares to land in llama.cpp, Models that support MTP

MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.

u/segmond·2 days ago·46 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Peanut - Text to Image Model (Open Weights coming soon)

Anonymous text-to-image model 'Peanut' ranks #8 on Artificial Analysis arena; open-weights release promised to compete with FLUX and Qwen models.

u/pmttyji·2 days ago·47 pts / 12 comm

r/LocalLLaMA· COMMUNITY

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

vLLM merged TurboQuant quantization support for Qwen 3.5+, enabling 4-bit/3-bit KV-cache inference via new command-line flags.

u/havenoammo·3 days ago·44 pts / 18 comm

Simon Willison· ANALYST

Granite 4.1 3B SVG Pelican Gallery

IBM released Granite 4.1 (3B/8B/30B, Apache 2.0); Unsloth published 21 quantized GGUF variants; Willison benchmarked quality across model sizes on SVG generation.

Simon Willison·3 days ago

r/LocalLLaMA· COMMUNITY

White House Considers Vetting A.I. Models Before They Are Released

White House exploring pre-release vetting requirements for AI models, raising policy questions for open-weights distribution.

u/fallingdowndizzyvr·3 days ago·49 pts / 54 comm·+ covered by others

arXiv (cs.AI/CL/LG)· ACADEMIA

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

Knowledge distillation from LLMs to compact open-source models for cross-language code clone detection without black-box inference costs.

Mohamad Khajezade·3 days ago

r/LocalLLaMA· COMMUNITY

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

APEX MoE quantization strategy expanded to 30+ models with new I-Nano compression tier, enabling efficient local inference.

u/mudler_it·3 days ago·46 pts / 11 comm

r/LocalLLaMA· COMMUNITY

Roundtable chat with Talkie-1930 and Gemma 4 31B

Community demo comparing Talkie-1930 (13B retro LM) and Gemma 4 31B in side-by-side chat on Opper.ai platform.

u/facethef·3 days ago·40 pts / 14 comm

r/LocalLLaMA· COMMUNITY

LLMSearchIndex- an Open Source Local Web Search Library with over 200 million indexed Web Pages for RAG applications

LLMSearchIndex: open-source Python library for local, offline web search with 200M indexed pages, enabling RAG without paid APIs.

u/zakerytclarke·3 days ago·41 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp MTP support now in beta!

llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.

u/ilintar·3 days ago·49 pts / 24 comm

r/LocalLLaMA· COMMUNITY

[Release] TinyMozart v2 85M 🎶

TinyMozart v2 85M, an unconditional MIDI piano generation model, released with improvements for chord and length control.

u/LH-Tech_AI·3 days ago·45 pts / 11 comm

r/LocalLLaMA· COMMUNITY

it's time to update your Gemma 4 GGUFs

GGUF quantizations of Google Gemma 4 updated with corrected chat template for local inference.

u/jacek2023·3 days ago·62 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Open source models are going to be the future on Cursor, OpenCode etc.

User reports high API costs for Claude Opus and GPT-5.5 on Cursor, predicts open-source models will displace proprietary tools by end of 2024.

u/_maverick98·3 days ago·42 pts / 43 comm

r/LocalLLaMA· COMMUNITY

"Second Thoughts" Been playing with adding a small transformer that reads output near the end of generation, and feeds it back near the top as a refinement loop. A quick test of 1.7B model showed drastic improvement in focused tasks (like coding)

Researcher demonstrates iterative refinement loop using small auxiliary transformer to improve 1.7B model code generation; scaling to 9B for HumanEval validation.

u/bigattichouse·4 days ago·43 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Pushing a 5-Year-Old 6GB VRAM laptop to Its Limits: Qwen3.6-35B-A3B

User reports successfully running Qwen3.6-35B on 6GB VRAM laptop at 23 t/s throughput with quantization techniques.

u/abhinand05·4 days ago·40 pts / 29 comm

r/LocalLLaMA· COMMUNITY

Gemma 4 E2B runs surprisingly well on my 8GB Android phone, so I built a private voice notes app around it.

Developer deployed Gemma 4 E2B (2.4GB) on 8GB Android phone for structured JSON parsing and voice-to-task conversion with usable accuracy.

u/Effective-Drawer9152·4 days ago·40 pts / 21 comm

r/LocalLLaMA· COMMUNITY

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.

u/segmond·4 days ago·45 pts / 32 comm

r/LocalLLaMA· COMMUNITY

A Qwen finetune, that feels VERY human

Community member releases Assistant_Pepe_32B, a Qwen3-32B finetune designed to reduce sycophancy through negativity bias.

u/Sicarius_The_First·4 days ago·43 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Open Weights Models Hall of Fame

Community appreciation post nominating researchers and companies who released open-weights models, from Transformer authors to recent open-source contributors.

u/Equivalent_Job_2257·4 days ago·43 pts / 20 comm

r/LocalLLaMA· COMMUNITY

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost

Hummingbird+ FPGA platform achieves 18 tok/s on Qwen3-30B-A3B with $150 target production cost, targeting edge LLM inference.

u/ayake_ayake·4 days ago·43 pts / 38 comm

r/LocalLLaMA· COMMUNITY

If you've been waiting to try local AI development, please try it

Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.

u/Imaginary_Belt4976·4 days ago·48 pts / 30 comm

r/LocalLLaMA· COMMUNITY

Does the "6 months gap" still hold?

Community discussion on whether open-source models' historical 6-12 month lag behind frontier systems persists after December 2025 agentic capability jump.

u/ihatebeinganonymous·4 days ago·41 pts / 35 comm

r/LocalLLaMA· COMMUNITY

Solidity

Developer discusses building a local Solidity LM with chain-of-thought and tool-calling; seeks alternatives to SOTA models for smart contract security and vulnerability analysis.

u/swingbear·4 days ago·40 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Ban phrases on llama.cpp with this script.

Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.

u/Total-Resort-3120·5 days ago·41 pts / 25 comm

r/LocalLLaMA· COMMUNITY

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

LDR framework with Qwen3.6-27B agentic search achieves 95.7% SimpleQA accuracy on single RTX 3090.

u/ComplexIt·5 days ago·67 pts / 16 comm

r/LocalLLaMA· COMMUNITY

[RELEASE] - Finally, my first TTS model is out! 🎙️ Flare-TTS 28M

LH-Tech-AI releases Flare-TTS 28M, a 28M-parameter text-to-speech model trained on LJSpeech in 24 hours on single A6000 GPU.

u/LH-Tech_AI·5 days ago·87 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

Qwen3.6-27B Windows native vLLM launcher achieves 72 tok/s on RTX 3090 with portable installation, no WSL.

u/One_Slip1455·5 days ago·128 pts / 71 comm

r/LocalLLaMA· COMMUNITY

Have Qwen said anything about further Qwen 3.6 models?

Community inquiry about Qwen roadmap for 9B, 122B, 397B model variants in 3.6 series.

u/spaceman_·5 days ago·64 pts / 33 comm

r/LocalLLaMA· COMMUNITY

Unsloth solved bug in Mistral Medium 3.5 implementation

Unsloth and Mistral fixed YaRN parsing bug in Mistral Medium 3.5 inference; updated GGUFs released with mscale_all_dim correction.

u/Snail_Inference·5 days ago·49 pts / 12 comm

r/LocalLLaMA· COMMUNITY

Mistral Medium 3.5 128b ggufs are fixed

Unsloth fixes broken GGUF quantizations of Mistral Medium 3.5 128B, resolving long-context degradation issues.

u/Sunija_Dev·5 days ago·55 pts / 13 comm

r/LocalLLaMA· COMMUNITY

A Dark-Money Campaign Is Paying Influencers to Frame Chinese AI as a Threat

Build American AI, a nonprofit backed by OpenAI and a16z executives, funds influencer campaign promoting U.S. AI while framing Chinese AI as threat.

u/pmttyji·5 days ago·43 pts / 15 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B-NVFP4 - images

User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.

u/Usual-Carrot6352·6 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

New rules 1 week check-in

r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.

u/rm-rf-rm·6 days ago·50 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver

User reports Qwen-3.6-27B-q8_k_xl outperforms Gemma 4 for local development tasks on RTX 6000 Pro.

u/Demonicated·6 days ago·44 pts / 69 comm

r/LocalLLaMA· COMMUNITY

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.

u/sandropuppo·6 days ago·68 pts / 17 comm

r/LocalLLaMA· COMMUNITY

GitHub - intel/auto-round: A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

Intel releases AutoRound, a low-bit quantization algorithm optimized for CPU/XPU/CUDA with vLLM and Transformers compatibility.

u/muyuu·6 days ago·41 pts / 23 comm

r/LocalLLaMA· COMMUNITY

MiMo-V2.5-Pro - the actual best open-weights model

Xiaomi's MiMo-V2.5-Pro and Kimi K2.6 dominate custom social deduction game benchmark, outperforming other open-weights models.

u/cjami·6 days ago·49 pts / 18 comm

r/LocalLLaMA· COMMUNITY

gemma-4-31B-it-DFlash has been released

gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.

u/Total-Resort-3120·6 days ago·41 pts / 10 comm

← Front Page50 stories

§ Open Weights

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

What do you use Gemma 4 for?

MTP on strix halo with llama.cpp (PR #22673)

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

Gemma 4 MTP released

Use Qwen3.6 right way -&gt; send it to pi coding agent and forget

Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more

Current state of local research tools as of May 2026

Running a 26B LLM locally with no GPU

Qwen3.6 merged chat template from allanchan339 and froggeric

As MTP prepares to land in llama.cpp, Models that support MTP

Peanut - Text to Image Model (Open Weights coming soon)

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

Granite 4.1 3B SVG Pelican Gallery

White House Considers Vetting A.I. Models Before They Are Released

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

Roundtable chat with Talkie-1930 and Gemma 4 31B

LLMSearchIndex- an Open Source Local Web Search Library with over 200 million indexed Web Pages for RAG applications

Llama.cpp MTP support now in beta!

[Release] TinyMozart v2 85M 🎶

it's time to update your Gemma 4 GGUFs

Open source models are going to be the future on Cursor, OpenCode etc.

"Second Thoughts" Been playing with adding a small transformer that reads output near the end of generation, and feeds it back near the top as a refinement loop. A quick test of 1.7B model showed drastic improvement in focused tasks (like coding)

Pushing a 5-Year-Old 6GB VRAM laptop to Its Limits: Qwen3.6-35B-A3B

Gemma 4 E2B runs surprisingly well on my 8GB Android phone, so I built a private voice notes app around it.

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

A Qwen finetune, that feels VERY human

Open Weights Models Hall of Fame

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost

If you've been waiting to try local AI development, please try it

Does the "6 months gap" still hold?

Solidity

Ban phrases on llama.cpp with this script.

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

[RELEASE] - Finally, my first TTS model is out! 🎙️ Flare-TTS 28M

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

Have Qwen said anything about further Qwen 3.6 models?

Unsloth solved bug in Mistral Medium 3.5 implementation

Mistral Medium 3.5 128b ggufs are fixed

A Dark-Money Campaign Is Paying Influencers to Frame Chinese AI as a Threat

Qwen3.6-27B-NVFP4 - images

New rules 1 week check-in

Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

GitHub - intel/auto-round: A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

MiMo-V2.5-Pro - the actual best open-weights model

gemma-4-31B-it-DFlash has been released

Use Qwen3.6 right way -> send it to pi coding agent and forget