DeepSeek and the Open-Source Inference Revolution: Why Hardware Matters More Than Ever

NovaCore TeamFebruary 21, 2026

DeepSeek has changed the economics of AI. Their V3 base model — a 671B-parameter Mixture of Experts model that competes with GPT-4 class systems — was trained for approximately $5.6 million in compute. That's 55 days on 2,048 H800 GPUs, plus about $294K for R1's reinforcement learning phase.

When frontier-quality models cost single-digit millions to train and are released as open weights, the competitive landscape shifts. The question is no longer "can we access a good enough model?" — it's "can we serve it efficiently at scale?"

The architecture behind the efficiency

DeepSeek's approach is worth understanding because it influences hardware requirements:

Mixture of Experts (MoE): 671B total parameters, but only 37B activated per token. This dramatically reduces compute per inference call while maintaining model quality.
Multi-head Latent Attention (MLA): Compresses the KV cache, reducing memory requirements for long-context inference.
DeepSeek Sparse Attention (DSA): Fine-grained sparse attention that improves long-context training and inference efficiency while maintaining output quality.

These architectural choices are designed to maximize throughput per GPU dollar — exactly the metric that matters for serving workloads.

What this means for hardware

Open-source models running on your own infrastructure vs. API calls to frontier providers is increasingly a real choice. The economics now favor self-hosting for teams with sustained inference volume.

But self-hosting MoE models at scale requires specific hardware characteristics:

Memory capacity matters. DeepSeek R1's full 671B parameters need to live somewhere. With Blackwell Ultra's 288GB HBM3e per GPU, fewer GPUs are needed for the full model, reducing inter-GPU communication overhead.

Memory bandwidth matters. Inference on large models is memory-bandwidth bound, not compute bound. The B300's 8TB/s bandwidth directly translates to higher tokens per second.

Interconnect matters. Expert routing in MoE models creates communication patterns that benefit from NVLink's GPU-to-GPU bandwidth. Running on isolated GPUs without high-speed interconnect creates bottlenecks at the expert dispatch layer.

The Blackwell Ultra advantage

NVIDIA's own benchmarks show the GB300 NVL72 delivering approximately 1,000 tokens per second on DeepSeek R1-671B — a 10x improvement over Hopper generation hardware. An NVL72 rack achieves 30x more inference performance than a comparable Hopper configuration.

For teams evaluating self-hosted inference, these numbers change the cost-per-token calculation dramatically. At 10x the throughput on equivalent rack space, the amortized infrastructure cost per token drops by an order of magnitude.

The new stack

The inference stack for open-source models is maturing rapidly. Multiple frameworks now support optimized DeepSeek serving:

SGLang — full support for FP8 and BF16 inference
TensorRT-LLM — BF16 inference with INT4/8 quantization
vLLM — FP8 and BF16 modes
LMDeploy — efficient FP8 and BF16 inference

Combined with the right hardware, these frameworks make it practical to serve frontier-quality models at API-provider latencies on your own infrastructure.

Our perspective

We built NovaCore for exactly this moment. As model weights become commoditized and openly available, the differentiator becomes infrastructure — the GPUs, interconnect, and operational expertise to serve them efficiently.

Whether you're evaluating self-hosted DeepSeek inference or planning a training run for your own models, talk to our team about Blackwell configurations optimized for your workload.