Topic

§ Infrastructure

Every story tagged with this topic, ordered by date.

What it means that Elon just rented out all his GPUs to Anthropic

Reddit speculation that Elon/xAI rented GPUs to Anthropic, interpreted as signal of competitive pressure and capacity constraints.

u/ContextCustodian·22 hours ago·34 pts / 30 comm

r/ClaudeAI· COMMUNITY

Anthropic Just Secured a Reserve.

Anthropic secures partnership with SpaceX for 300MW+ compute at Colossus 1, adding 220k+ NVIDIA GPUs within one month.

u/DragonflyOk7139·23 hours ago·27 pts / 11 comm

r/ClaudeAI· COMMUNITY

Higher usage limits for Claude and a compute deal with SpaceX

Anthropic partners with SpaceX for compute capacity; removes Claude Code peak-hour limits and raises API rate limits for Opus.

u/Dependent_Top_8685·23 hours ago·44 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Analysis of the 100 most popular hardware setups on Hugging Face

Analysis of 100 most popular hardware configurations for local LLM inference on Hugging Face reveals deployment patterns and infrastructure preferences.

u/clem59480·23 hours ago·42 pts / 15 comm

r/ClaudeAI· COMMUNITY

SpaceX Conpute Deal - Double Limits

Anthropic partners with SpaceX for compute capacity; removes Claude Code peak-hour limits and raises API rate limits for Opus.

u/Deep_Proposal_7683·24 hours ago·87 pts / 23 comm

Anthropic· FRONTIER

Higher usage limits for Claude and a compute deal with SpaceX

Anthropic raises Claude usage limits and partners with SpaceX for compute infrastructure to expand capacity.

Anthropic·1 day ago

r/LocalLLaMA· COMMUNITY

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

User demonstrates Qwen3.6 27B running 200k context on single RTX 5090 with NVFP4 quantization in vLLM, sharing exact configuration and parameters.

u/Maheidem·1 day ago·41 pts / 11 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.

u/havenoammo·1 day ago·48 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Bad news: Apple drops high-memory Mac Studio configs

Apple discontinues high-memory Mac Studio configurations (256GB, 512GB), limiting local LLM inference options to 96GB max.

u/jzn21·1 day ago·47 pts / 20 comm

r/LocalLLaMA· COMMUNITY

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.

u/ex-arman68·1 day ago·85 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.

u/m94301·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.

u/exintrovert420·2 days ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

MTP on strix halo with llama.cpp (PR #22673)

User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.

u/Edenar·2 days ago·42 pts / 20 comm

r/LocalLLaMA· COMMUNITY

Why run local? Count the money

User quantifies cost savings from running local Qwen-397B with Hermes agent vs. API pricing: 200M tokens in 5 days ≈ $250 saved at API rates.

u/Badger-Purple·2 days ago·42 pts / 115 comm

r/MachineLearning· COMMUNITY

Production AI very different from the demos [D]

Production AI deployment reveals hidden cost scaling: token usage doubled after adding retrieval context, pushing teams from GPT-4o toward cheaper alternatives.

u/Far-Football3763·2 days ago·33 pts / 11 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Transformers with Selective Access to Early Representations

Transformer architecture innovation enables selective early layer access via learned mixing coefficients for memory-efficient low-level feature recovery.

Skye Gunasekaran·2 days ago

r/LocalLLaMA· COMMUNITY

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding- Google Developers Blog

Google demonstrates 3X LLM inference speedup on TPUs using diffusion-style speculative decoding technique.

u/eternviking·2 days ago·41 pts / 11 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

QKVShare framework for quantized KV-cache handoff between multi-agent LLMs on edge devices; token-level mixed-precision allocation reduces memory vs. full-precision transfer.

Pratik Honavar·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models

DMGD proposes training-free dataset distillation using diffusion models with semantic-distribution matching guidance.

Qichao Wang·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

On Adaptivity in Zeroth-Order Optimization

MEAZO: memory-efficient adaptive zeroth-order optimizer for LLM fine-tuning, outperforms ZO-Adam with scalar-only tracking.

Hassan Dbouk·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Memory-Efficient Continual Learning with CLIP Models

Distributionally robust continual learning method for CLIP models using dynamic per-class loss reweighting with small memory buffers.

Ryan King·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

SOAR: real-time joint optimization of order allocation and robot scheduling for robotic mobile fulfillment warehouse systems.

Yibang Tang·2 days ago

r/LocalLLaMA· COMMUNITY

Current state of local research tools as of May 2026

Community survey of local deep research tools as of May 2026, highlighting GPT Researcher and Local Deep Research as active open-source projects.

u/Shoddy-Tutor9563·2 days ago·41 pts / 24 comm

r/LocalLLaMA· COMMUNITY

Running a 26B LLM locally with no GPU

User reports running Gemma 26B efficiently on CPU-only hardware (i5-8500, 32GB RAM) without GPU acceleration.

u/JackStrawWitchita·2 days ago·41 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 merged chat template from allanchan339 and froggeric

Community member merges Qwen3.6 chat template fixes from froggeric and allanchan339 using Claude Opus.

u/fakezeta·2 days ago·41 pts / 13 comm

r/LocalLLaMA· COMMUNITY

For those wondering about the power consumption of a dual 3090 rig while inferencing

Dual RTX 3090 setup draws ~760W under LLM inference, 90W idle; practical hardware benchmark for on-premises deployment.

u/sdfgeoff·2 days ago·41 pts / 18 comm

Stratechery· ANALYST

Amazon’s Durability

Stratechery analysis: Amazon lagged in AI training but positioned for inference dominance through sustained infrastructure investment.

Ben Thompson·2 days ago

OpenAI· FRONTIER

Unlocking large scale AI training networks with MRC (Multipath Reliable Connection)

OpenAI releases MRC (Multipath Reliable Connection), an OCP networking protocol for resilience and performance in large-scale AI training clusters.

OpenAI·2 days ago

r/ClaudeAI· COMMUNITY

Anthropic ships Claude for Creative Work with nine MCP-native connectors

Anthropic released Claude for Creative Work with nine MCP-native connectors including Blender, enabling persistent in-app context and direct action execution.

u/Intelligent-Lynx-953·2 days ago·20 pts / 11 comm

r/LocalLLaMA· COMMUNITY

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

vibevoice.cpp: C++ ggml port of Microsoft VibeVoice enables TTS and long-form ASR with diarization on CPU/CUDA/Metal/Vulkan without Python.

u/mudler_it·2 days ago·58 pts / 10 comm

r/LocalLLaMA· COMMUNITY

As MTP prepares to land in llama.cpp, Models that support MTP

MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.

u/segmond·2 days ago·46 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

Qwen 27B FP8 achieves 80 TPS with 200k token BF16 KV cache on RTX 5000 PRO 48GB, reducing quantization artifacts vs. 24GB quantized baselines.

u/__JockY__·2 days ago·47 pts / 40 comm

r/LocalLLaMA· COMMUNITY

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

MTPLX inference engine achieves 2.24× throughput speedup for MTP-equipped models on Apple Silicon M5 Max via rejection sampling.

u/YoussofAl·3 days ago·40 pts / 30 comm

r/LocalLLaMA· COMMUNITY

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

vLLM merged TurboQuant quantization support for Qwen 3.5+, enabling 4-bit/3-bit KV-cache inference via new command-line flags.

u/havenoammo·3 days ago·44 pts / 18 comm

r/LocalLLaMA· COMMUNITY

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.

u/randomfoo2·3 days ago·51 pts / 10 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

SpecKV adapts speculative decoding's speculation length dynamically based on target model compression, improving LLM inference throughput.

Shikhar Shukla·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Pattern-based AI-assisted methodology for rapid sensor-driven application development using Pegasus workflows on FABRIC testbed.

Komal Thareja·3 days ago·+ covered by others

arXiv (cs.AI/CL/LG)· ACADEMIA

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

JACTUS unifies parameter-efficient fine-tuning and model compression into single joint optimization framework.

Jingze Ge·3 days ago

r/LocalLLaMA· COMMUNITY

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

APEX MoE quantization strategy expanded to 30+ models with new I-Nano compression tier, enabling efficient local inference.

u/mudler_it·3 days ago·46 pts / 11 comm

Google AI (Gemma)· FRONTIER

Reduce friction and latency for long-running jobs with Webhooks in Gemini API

Google Gemini API adds webhook support for asynchronous job notifications, reducing polling overhead for long-running requests.

{"$":{"xmlns:author":"http://www.w3.org/2005/Atom"},"name":["Lucia Loher"],"title":["Product Manager"],"department":["Gemini API"],"company":[""]}·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Caliper-in-the-Loop: Black-Box Optimization for Hyperledger Fabric Performance Tuning

Bayesian optimization with dimensionality reduction for tuning Hyperledger Fabric blockchain configuration parameters via black-box benchmarking.

Yash Madhwal·3 days ago

r/LocalLLaMA· COMMUNITY

LLMSearchIndex- an Open Source Local Web Search Library with over 200 million indexed Web Pages for RAG applications

LLMSearchIndex: open-source Python library for local, offline web search with 200M indexed pages, enabling RAG without paid APIs.

u/zakerytclarke·3 days ago·41 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp MTP support now in beta!

llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.

u/ilintar·3 days ago·49 pts / 24 comm

r/LocalLLaMA· COMMUNITY

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM!

AMD Ryzen AI Max+ 495 APU leaked with 192GB memory, enabling larger local model inference on consumer hardware.

u/PromptInjection_·3 days ago·40 pts / 25 comm

r/LocalLLaMA· COMMUNITY

it's time to update your Gemma 4 GGUFs

GGUF quantizations of Google Gemma 4 updated with corrected chat template for local inference.

u/jacek2023·3 days ago·62 pts / 19 comm

OpenAI· FRONTIER

How OpenAI delivers low-latency voice AI at scale

OpenAI rebuilt WebRTC stack for real-time voice AI with low-latency conversational turn-taking at global scale.

OpenAI·4 days ago

r/LocalLLaMA· COMMUNITY

How much will it cost to host something like qwen3.6 35b a3b in a cloud?

Reddit discussion on hosting costs for Qwen 3.6 35B model until local hardware upgrades become available.

u/Euphoric_North_745·4 days ago·40 pts / 65 comm

r/LocalLLaMA· COMMUNITY

Pushing a 5-Year-Old 6GB VRAM laptop to Its Limits: Qwen3.6-35B-A3B

User reports successfully running Qwen3.6-35B on 6GB VRAM laptop at 23 t/s throughput with quantization techniques.

u/abhinand05·4 days ago·40 pts / 29 comm

r/LocalLLaMA· COMMUNITY

AMD Strix Halo refresh with 192gb!

AMD Strix Halo refresh rumored to feature 192GB+ VRAM, enabling larger MoE model inference on consumer hardware.

u/mindwip·4 days ago·60 pts / 17 comm

r/LocalLLaMA· COMMUNITY

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.

u/segmond·4 days ago·45 pts / 32 comm

← Front Page50 stories