NVIDIA Blackwell Ultra: What the GB300 NVL72 Means for AI Infrastructure
NVIDIA shipped the B300 in January 2026, and the benchmarks are in. Blackwell Ultra represents a meaningful step forward from the already-impressive B200 — particularly for teams running long-context inference and large-scale training jobs.
Here's what the numbers actually mean for AI infrastructure decisions.
The specs
The B300 GPU features a dual-reticle design with 208 billion transistors and 160 Streaming Multiprocessors across two dies:
- 288GB HBM3e memory (up from 192GB on B200)
- 8TB/s memory bandwidth
- 15 PetaFLOPS dense FP4 compute
- 1.5x FP4 compute increase over standard Blackwell
- 1,400W TDP
Real-world inference performance
The most striking benchmark comes from DeepSeek R1-671B inference. Blackwell Ultra delivers approximately 1,000 tokens per second on this model, compared to Hopper's 100 tokens per second — a 10x increase in throughput.
For long-context workloads specifically, testing by LMSYS shows the GB300 secures a 1.4x to 1.5x lead over the GB200, with a 1.58x latency advantage in long-context inference scenarios.
Multi-Token Prediction (MTP) technology pushes this further, delivering 1.87x user-perceived speed improvement through speculative decoding techniques.
What this means at scale
An NVL72 rack delivers 30x more inference performance than a comparable Hopper configuration. Combined with the 50x throughput increase per megawatt relative to Hopper platforms, this changes the economics of large-scale inference significantly.
For teams running inference-heavy workloads — serving large models to millions of users — the cost-per-token improvement is substantial. For training, the 288GB per-GPU memory reduces the need for model parallelism techniques that add communication overhead.
The agentic AI angle
NVIDIA is explicitly positioning Blackwell Ultra for agentic AI applications — workloads where models need to process long contexts, maintain state across extended interactions, and reason through multi-step problems. The architecture's focus on long-context performance and latency reduction directly serves this use case.
Our take
We're tracking Blackwell Ultra closely and evaluating GB300 NVL72 configurations for our infrastructure roadmap. The inference performance improvements are particularly relevant for our customers running large language model serving workloads.
If you're planning infrastructure for Blackwell Ultra workloads, talk to our team about timelines and configurations.