How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber
Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.
Reddit user seeks advice on setting up local coding agents like Claude Code with open-weight models via llama.cpp.
llama.cpp adds experimental DeepSeek v4 Flash support with aggressive 2-bit quantization, achieving 17 tokens/sec on M3 Max with 128GB RAM requirement.
Llama.cpp benchmarks on Windows 11 vs Lubuntu 26.04 with RTX 5080 show significant OS-level performance variance in local inference.
User demonstrates PaddleOCR-VL-1.5 multimodal inference via llama.cpp server for end-to-end document digitization with layout and table handling.
ik_llama.cpp maintainer seeks volunteers to develop Vulkan backend support for CPU/GPU inference optimization.
llama.cpp and ik_llama.cpp now support FP4 inference with different formats: NVFP4 (Nvidia E4M3) and MXFP4 (MX standard) across varying hardware backends.
Developer documents practical setup for running Qwen 3.6 35B on M2 MacBook Pro 32GB via llama.cpp, with performance notes and optimization tips.
Reddit post celebrating current state of local LLM deployment without specific technical claims or data.
Reddit discussion asking about TurboQuant KV cache optimization implementation in llama.cpp.
Qwen 3.6 35B-A3B MoE model achieves 250+ tok/s on AMD Radeon 780M iGPU via llama.cpp Vulkan.
Simon Willison ports LlamaIndex's LiteParse PDF text extraction tool to run in-browser, using spatial parsing and Tesseract OCR without ML models.
Technical walkthrough: Qwen 3.6 27B achieves 85 TPS, 125K context on single RTX 3090 using llama.cpp.
Performance test: Qwen 3.6 27B with speculative decoding achieves 25.53 tokens/sec with 2x speedup on local hardware.
User questions absence of consumer inference chips ($200 devices running Llama 3 locally) despite industry investment.
llama.cpp Vulkan and SYCL benchmarks comparing Nvidia RTX 3090 vs Intel Arc Pro B70 on prompt processing and token generation.
User demonstrates Qwen3.6-27B running locally via llama-server with 200k context on dual RTX 3090, achieving coding performance cheaper than Claude.
User demonstrates llama.cpp auto-fit enables 57 t/s on Qwen3.6 Q8 256k context despite weights exceeding 32GB VRAM.
Plasma 1.0: 235M-param LLaMA-style model trained from scratch on single RTX 5080 GPU.
Open WebUI Desktop released with local llama.cpp support and remote server connectivity options.
Commentary comparing llama.cpp infrastructure dominance to Linux in LLM ecosystem.
Community discussion on why OSS AI tools prioritize Ollama over llama.cpp despite engineering parity.
Six Llama-3.1-8B variants fine-tuned on Christian, Islamic, Jewish, Hindu, Buddhist texts reveal systematic differences in ethical reasoning patterns.
Cross-linguistic study of politeness effects on 5 LLMs (Gemini-Pro, GPT-4o Mini, Claude 3 Sonnet, DeepSeek-Chat, Llama 3) via 22,500 English/Hindi/Spanish prompts.
Benchmark compares token pruning compression across Qwen3, Gemma-3, Llama-3, Aya for Korean-centric NLP with English-Korean vocabulary optimization.