Blog

WarpSpeed on Blackwell: A Second Day

After a second day of search, WarpSpeed beats 99% of SOL-ExecBench's Blackwell kernels

Published on
June 29, 2026
Share this post

WarpSpeed on Blackwell: A Second Day

Last month we gave WarpSpeed one day to run on NVIDIA's SOL-ExecBench: 235 of the hardest CUDA kernels in production AI, drawn from models like DeepSeek, Qwen, and Gemma and scored on a Blackwell B200.

With one day of compute, WarpSpeed beat the optimised PyTorch baselines on 90% of them, running 2.24× faster on average.

This time, we gave it two days. With a second day of search, WarpSpeed beats the baseline on 99% of the problems — and runs 3.14× faster on average.[1]

Average speedup over the optimised Torch baselines, per problem set: Cursor, WarpSpeed after one day, and WarpSpeed after two days.

With two days of search, WarpSpeed improved its performance across every single set of problems. Our system is not only in the #1 position in the benchmark, but it is also close to the effective speed of light on many of the produced kernels.

Per-problem SOL score, WarpSpeed after one day vs after two.
Per-problem SOL score, for WarpSpeed after one day vs after two. Problems are sorted according to the two-day score, and averaged in pairs. A problem's SOL score measures how much of the gap from the optimised Torch baseline (0.5) to the hardware's speed of light (1.0 — the fastest the B200 could physically run the workload) it closes; higher is faster, and below 0.5 is slower than the baseline. We emphasise that on many of the tasks, a SOL score of 1.0 is not actually attainable, due to physical limitations of the hardware.

Measured against the optimised PyTorch baselines, nearly every single one of the kernels WarpSpeed produced is faster. And often by a very wide margin.

Share of problems beating the optimised Torch baselines, per problem set: Cursor, WarpSpeed after one day, and WarpSpeed after two days.

For the hardest AI workloads on Blackwell, WarpSpeed doesn't just beat Torch — it approaches the physical limits of the underlying hardware.

[1] WarpSpeed beats the baseline on 233 of 235 problems. Average speedup is the geometric mean of per-problem speedup ratios.