Source · Community

r/LocalLLaMA

Reddit · COMMUNITY

Last updated May 7, 2026, 3:30 PM

r/LocalLLaMA· COMMUNITY

Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.

Community fine-tune of Qwen 3.6 27B with reduced safety filters released in multiple quantization formats.

u/LLMFan46·13 hours ago·49 pts / 14 comm

r/LocalLLaMA· COMMUNITY

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ParoQuant introduces pairwise rotation quantization to reduce inference cost for reasoning LLMs while maintaining output quality.

u/Total-Resort-3120·14 hours ago·61 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development

Reddit discussion: RTX 5090 vs M5 Max 128GB for local Qwen3.6 27B agentic development—tradeoffs between speed and memory.

u/BawbbySmith·15 hours ago·40 pts / 88 comm

r/LocalLLaMA· COMMUNITY

Get faster qwen 3.6 27b

User achieves 50 tokens/sec with Qwen 3.6 27B on RTX 3090 using MTP speculative decoding at 100k context.

u/admajic·16 hours ago·45 pts / 11 comm

r/LocalLLaMA· COMMUNITY

Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

MTP speculative decoding ported to Qwen 3.6 35B shows modest 2.5-6% speedup vs. 2-2.5x on 27B; architecture may limit gains.

u/havenoammo·18 hours ago·45 pts / 16 comm

r/LocalLLaMA· COMMUNITY

The GB10 Solution Atlas is now open source, the inference engine made for the community with breakneck inference speeds (Qwen3.6-35B-FP8 100+ tok/s)

GB10 Solution Atlas, Rust+CUDA inference engine, achieves 100+ tok/s on Qwen 35B; open source, minimal footprint, no Python runtime.

u/Live-Possession-6726·19 hours ago·40 pts / 13 comm

r/LocalLLaMA· COMMUNITY

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

Reddit discussion argues prefill latency is underemphasized vs. token generation speed in local LLM benchmarking and optimization focus.

u/wbulot·20 hours ago·40 pts / 35 comm

r/LocalLLaMA· COMMUNITY

ZAYA1-8B: Frontier intelligence density, trained on AMD

Zyphra releases ZAYA1-8B, an 8B parameter model optimized for inference efficiency, trained on AMD hardware.

u/carbocation·20 hours ago·60 pts / 31 comm

r/LocalLLaMA· COMMUNITY

Analysis of the 100 most popular hardware setups on Hugging Face

Analysis of 100 most popular hardware configurations for local LLM inference on Hugging Face reveals deployment patterns and infrastructure preferences.

u/clem59480·23 hours ago·42 pts / 15 comm

r/LocalLLaMA· COMMUNITY

Follow-up: Trying to make NVIDIA GPUs plug-and-play on Macs. Found hidden RDMA symbols Apple doesn't want you to see — zero-copy GPU memory sharing might already work.

**TL;DR:** My last post about testing TinyGPU attracted some interest. This is the follow-up. The Blackwell card is detected and the driver loads, but NVIDIA's GSP firmware fails to boot through TB5 (known issue, I'm working with tinygrad on it). While debugging that, I went down a rabbit hole and discovered that Apple's RDMA subsystem accepts Metal GPU buffers for zero-copy network transfers — something nobody has documented. I also found hidden `ibv_reg_dmabuf_mr` symbols in Apple's libibverbs that suggest GPUDirect RDMA might be possible on macOS without any kernel modification. Here's eve...

u/Street-Buyer-2428·1 day ago·50 pts / 11 comm

r/LocalLLaMA· COMMUNITY

HOT TAKE: local models + agent harnesses are now capable enough to hand off junior-level IT professional tasks to [human written]

User reports Qwen 3.6 27B in Hermes agent harness successfully handles junior IT tasks, signaling maturity of local model + agent systems.

u/Porespellar·1 day ago·40 pts / 26 comm

r/LocalLLaMA· COMMUNITY

None of this will ever get stolen

Reddit comment expressing skepticism about outdoor infrastructure installation due to theft concerns.

u/martin_xs6·1 day ago·65 pts / 77 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

User demonstrates Qwen3.6 27B running 200k context on single RTX 5090 with NVFP4 quantization in vLLM, sharing exact configuration and parameters.

u/Maheidem·1 day ago·41 pts / 11 comm

r/LocalLLaMA· COMMUNITY

An Open Benchmark for Testing RAG on Realistic Company-Internal Data

EnterpriseRAG-Bench: 500k-document synthetic dataset benchmarking RAG systems on realistic internal company data (Slack, email, tickets, PRs) vs. public corpora.

u/Weves11·1 day ago·41 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.

u/havenoammo·1 day ago·48 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Bad news: Apple drops high-memory Mac Studio configs

Apple discontinues high-memory Mac Studio configurations (256GB, 512GB), limiting local LLM inference options to 96GB max.

u/jzn21·1 day ago·47 pts / 20 comm

r/LocalLLaMA· COMMUNITY

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.

u/ex-arman68·1 day ago·85 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Google is making local AI available to mainstream users ;)

Reddit speculation about Google's local AI availability; lacks specifics, credibility unclear.

u/jacek2023·1 day ago·67 pts / 65 comm

r/LocalLLaMA· COMMUNITY

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Empirical quantization degradation analysis for Qwen 3.6 27B across 8 compression levels via chess state-tracking task.

u/bobaburger·1 day ago·62 pts / 36 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.

u/m94301·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.

u/exintrovert420·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

What do you use Gemma 4 for?

Community discussion comparing Gemma 4 and Qwen 3.6 model suitability across coding, benchmarks, and agentic workloads.

u/HornyGooner4402·2 days ago·40 pts / 46 comm

r/LocalLLaMA· COMMUNITY

12M Context Window and some some sprinkle of lies?

SubQ claims 12M context window in marketing but production model capped at 1M; benchmark results show significant performance drop vs. research variant and competitors.

u/prokajevo·2 days ago·57 pts / 25 comm

r/LocalLLaMA· COMMUNITY

MTP on strix halo with llama.cpp (PR #22673)

User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.

u/Edenar·2 days ago·42 pts / 20 comm

r/LocalLLaMA· COMMUNITY

US and tech firms strike deal to review AI models for national security before public release | Technology

US government and tech firms agree to pre-release AI model review process for national security assessment before public deployment.

u/Merchant_Lawrence·2 days ago·44 pts / 40 comm

r/LocalLLaMA· COMMUNITY

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

Developer benchmarked local Qwen 3.6 27B vs cloud models on 150 real coding tasks, finding local matched cloud 97% on 35% of workload, suggesting cost arbitrage opportunity.

u/spencer_kw·2 days ago·41 pts / 17 comm

r/LocalLLaMA· COMMUNITY

I know this isn’t technically an LLM but OmniVoice is FUCKING AMAZING.

Reddit user expresses enthusiasm for OmniVoice, a one-shot voice cloning tool, though lacks technical detail or verification.

u/Borkato·2 days ago·40 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Why run local? Count the money

User quantifies cost savings from running local Qwen-397B with Hermes agent vs. API pricing: 200M tokens in 5 days ≈ $250 saved at API rates.

u/Badger-Purple·2 days ago·42 pts / 115 comm

r/LocalLLaMA· COMMUNITY

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

Benchmark comparison shows Gemma 4 31B trades inference speed for token efficiency vs Qwen 3.6/5 27B; Qwen optimizes for metrics, Gemma for throughput.

u/MiaBchDave·2 days ago·51 pts / 11 comm

r/LocalLLaMA· COMMUNITY

Gemma 4 MTP released

Google releases Gemma 4 multi-token prediction drafters in 4 quantized sizes for local deployment.

u/rerri·2 days ago·86 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Use Qwen3.6 right way -> send it to pi coding agent and forget

Reddit post on using Qwen3.6 with pi.dev harness and agent tooling for local coding and admin tasks.

u/Willing-Toe1942·2 days ago·40 pts / 45 comm

r/LocalLLaMA· COMMUNITY

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding- Google Developers Blog

Google demonstrates 3X LLM inference speedup on TPUs using diffusion-style speculative decoding technique.

u/eternviking·2 days ago·41 pts / 11 comm

r/LocalLLaMA· COMMUNITY

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)

ProgramBench: 200-task evaluation showing agents struggle to rebuild large binaries from scratch without cheating vulnerabilities.

u/klieret·2 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

<thinking></thinking>

Incomplete post with no content.

u/Comfortable-Rock-498·2 days ago·52 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more

Heretic 1.3 adds reproducibility, integrated benchmarking, reduced VRAM, and broader model support for model decensoring.

u/-p-e-w-·2 days ago·54 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Current state of local research tools as of May 2026

Community survey of local deep research tools as of May 2026, highlighting GPT Researcher and Local Deep Research as active open-source projects.

u/Shoddy-Tutor9563·2 days ago·41 pts / 24 comm

r/LocalLLaMA· COMMUNITY

Running a 26B LLM locally with no GPU

User reports running Gemma 26B efficiently on CPU-only hardware (i5-8500, 32GB RAM) without GPU acceleration.

u/JackStrawWitchita·2 days ago·41 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 merged chat template from allanchan339 and froggeric

Community member merges Qwen3.6 chat template fixes from froggeric and allanchan339 using Claude Opus.

u/fakezeta·2 days ago·41 pts / 13 comm

r/LocalLLaMA· COMMUNITY

For those wondering about the power consumption of a dual 3090 rig while inferencing

Dual RTX 3090 setup draws ~760W under LLM inference, 90W idle; practical hardware benchmark for on-premises deployment.

u/sdfgeoff·2 days ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

vibevoice.cpp: C++ ggml port of Microsoft VibeVoice enables TTS and long-form ASR with diarization on CPU/CUDA/Metal/Vulkan without Python.

u/mudler_it·2 days ago·58 pts / 10 comm

r/LocalLLaMA· COMMUNITY

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench agentic task; first Chinese frontier model, 17× cheaper, within 10 weeks of GPT-5.2 release.

u/Disastrous_Theme5906·2 days ago·43 pts / 14 comm

r/LocalLLaMA· COMMUNITY

As MTP prepares to land in llama.cpp, Models that support MTP

MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.

u/segmond·2 days ago·46 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

Qwen 27B FP8 achieves 80 TPS with 200k token BF16 KV cache on RTX 5000 PRO 48GB, reducing quantization artifacts vs. 24GB quantized baselines.

u/__JockY__·2 days ago·47 pts / 40 comm

r/LocalLLaMA· COMMUNITY

US GUARD Act: Age Verification for AI Chatbots

US GUARD Act advances in Senate with mandatory age verification for AI chatbots, framed as child safety but viewed as broader regulatory friction.

u/Hefty_Wolverine_553·2 days ago·43 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Peanut - Text to Image Model (Open Weights coming soon)

Anonymous text-to-image model 'Peanut' ranks #8 on Artificial Analysis arena; open-weights release promised to compete with FLUX and Qwen models.

u/pmttyji·2 days ago·47 pts / 12 comm

r/LocalLLaMA· COMMUNITY

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

MTPLX inference engine achieves 2.24× throughput speedup for MTP-equipped models on Apple Silicon M5 Max via rejection sampling.

u/YoussofAl·3 days ago·40 pts / 30 comm

r/LocalLLaMA· COMMUNITY

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

vLLM merged TurboQuant quantization support for Qwen 3.5+, enabling 4-bit/3-bit KV-cache inference via new command-line flags.

u/havenoammo·3 days ago·44 pts / 18 comm

r/LocalLLaMA· COMMUNITY

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.

u/randomfoo2·3 days ago·51 pts / 10 comm

r/LocalLLaMA· COMMUNITY

White House Considers Vetting A.I. Models Before They Are Released

White House exploring pre-release vetting requirements for AI models, raising policy questions for open-weights distribution.

u/fallingdowndizzyvr·3 days ago·49 pts / 54 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

APEX MoE quantization strategy expanded to 30+ models with new I-Nano compression tier, enabling efficient local inference.

u/mudler_it·3 days ago·46 pts / 11 comm

← All Sources50 stories

r/LocalLLaMA

Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development

Get faster qwen 3.6 27b

Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

The GB10 Solution Atlas is now open source, the inference engine made for the community with breakneck inference speeds (Qwen3.6-35B-FP8 100+ tok/s)

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

ZAYA1-8B: Frontier intelligence density, trained on AMD

Analysis of the 100 most popular hardware setups on Hugging Face

Follow-up: Trying to make NVIDIA GPUs plug-and-play on Macs. Found hidden RDMA symbols Apple doesn't want you to see — zero-copy GPU memory sharing might already work.

HOT TAKE: local models + agent harnesses are now capable enough to hand off junior-level IT professional tasks to [human written]

None of this will ever get stolen

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

An Open Benchmark for Testing RAG on Realistic Company-Internal Data

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Bad news: Apple drops high-memory Mac Studio configs

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Google is making local AI available to mainstream users ;)

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

What do you use Gemma 4 for?

12M Context Window and some some sprinkle of lies?

MTP on strix halo with llama.cpp (PR #22673)

US and tech firms strike deal to review AI models for national security before public release | Technology

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

I know this isn’t technically an LLM but OmniVoice is FUCKING AMAZING.

Why run local? Count the money

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

Gemma 4 MTP released

Use Qwen3.6 right way -&gt; send it to pi coding agent and forget

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding- Google Developers Blog

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)

&lt;thinking&gt;&lt;/thinking&gt;

Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more

Current state of local research tools as of May 2026

Running a 26B LLM locally with no GPU

Qwen3.6 merged chat template from allanchan339 and froggeric

For those wondering about the power consumption of a dual 3090 rig while inferencing

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

As MTP prepares to land in llama.cpp, Models that support MTP

Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

US GUARD Act: Age Verification for AI Chatbots

Peanut - Text to Image Model (Open Weights coming soon)

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

White House Considers Vetting A.I. Models Before They Are Released

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

Use Qwen3.6 right way -> send it to pi coding agent and forget

<thinking></thinking>