Blog

FormulaOne: A Benchmark for deep Algorithmic Reasoning

A reasoning benchmark at the level of real world mathematically complex problems, stumping all frontier AI models.

Published on
March 1, 2026
Share this post

Read the WarpSpeed Technical Post

FormulaOne - by doubleAI

Frontier AI models have recently demonstrated strong performance on mathematical and algorithmic benchmarks, including earning gold medals in olympiads, and attaining top percentile ratings in competitive programming contests. How well do such benchmarks capture the true depth of algorithmic reasoning, as it arises in real-world research problems?

We believe that existing benchmarks fail to capture the deep reasoning skills required for complex, research-level algorithmic problems. To address this gap, we introduce FormulaOne.

FormulaOne consists of 220 novel dynamic programming problems over graphs. The problems are organised into three categories, ranging from moderate difficulty and all the way up to research-level.

Category Size Description
Shallow 100 A set of "easier" problems.
Deeper 100 A set of challenging problems.
Deepest 20 A set of highly challenging problems.

The latter category is incredibly demanding, requiring resolution of many points of uncertainty, and involving an array of reasoning steps, including topological and geometric insight, knowledge of mathematical domains such as extremal graph theory and logic, combinatorial considerations, precise implementation, and more.

Despite impressive performance on existing benchmarks, presently no model solves even a single 'Deepest Tier' problem.

An “Infinite Well” of Problems

Examples of FormulaOne problems

While the problems are often natural to state, their solutions are far from obvious. The solvability of this vast class of problems is guaranteed by an algorithmic meta-theorem due to Courcelle, which broadly states:

“For every sufficiently tree-like graph, any problem definable in an expressive formal logic — Monadic Second-Order (MSO) logic — can be solved by a dynamic programming algorithm that operates in time linear in the order of the graph.”

The key is to use a structure known as a tree decomposition, which organises the graph's vertices into a series of overlapping sets, or “bags”, that are themselves arranged in a tree.

An algorithm can then traverse this tree of bags, solving the problem piece by piece using dynamic programming. This process involves designing a “state” that summarises all necessary information about the partial solution within a bag, and then defining how this state transforms as vertices are introduced, forgotten, or bags are merged.

The deceptive simplicity of the problem statements belies the extraordinary difficulty of discovering the correct dynamic programming solution. This process is riddled with subtle combinatorial and logical pitfalls, demanding a profound understanding of the problem’s underlying structure. For a detailed walkthrough of the fifteen interdependent reasoning steps required to solve a single hard problem — Maximal-Cluster-Graph see the appendix of our paper.

Evaluation

All models were evaluated using their highest available reasoning settings and with the maximum context length permitted. To give models the best possible chance of success, we provide a generous few-shot prompt that covers a broad array of the ideas and techniques involved in solving these problems.

Each submitted solution is subjected to a rigorous and automated test suite that measures three key aspects of its validity:

  • Correctness: The output of the submitted algorithm must be correct on all graphs.
  • Consistency: The solution must produce the same output for a given graph, regardless of the specific tree decomposition.
  • Efficiency: The solution must be truly fixed-parameter linear.

To support research and encourage community contributions, the FormulaOne-Shallow ("warmup") dataset is released as a public resource for training and fine-tuning models. The complete test suite for all 100 'Shallow' problems is available, alongside a standalone evaluation environment, in our GitHub repository.

To maintain the integrity of the core benchmark, only a minimal subset of tests is released for the Deeper and Deepest Tier problems. Solutions submitted for evaluation on our benchmark are evaluated against a withheld comprehensive test-suite.

Model Accuracy

On the FormulaOne-Shallow problems, frontier models perform reasonably well. This confirms they have a foundational capability for these types of algorithmic tasks, in other words, the tasks are squarely in-distribution.

However, as the reasoning depth increases in the Deeper tier, and solutions require the discovery and integration of novel and more complex state representations, model performance drops off sharply.

Performance of frontier models on the FormulaOne dataset.

This trend culminates in Deepest Tier, where the difficulty is characteristic of exploratory research problems. On this set of 20 problems, no current frontier model solves even a single one. This result starkly illustrates the gap that remains between high performance on existing benchmarks and the deep algorithmic reasoning required for truly complex problems.


View more at FormulaOne-Leaderboard

@misc{beniamini2025formulaonemeasuringdepthalgorithmic,
  title   = {FormulaOne: Measuring the Depth of Algorithmic
             Reasoning Beyond Competitive Programming},
  author  = {Gal Beniamini and Yuval Dor and Alon Vinnikov
             and Shir Granot Peled and Or Weinstein
             and Or Sharir and Noam Wies
             and Tomer Nussbaum and Nadav Schweiger
             and Ido Ben Shaul and Tomer Zekharya
             and Yoav Levine and Shai Shalev-Shwartz
             and Amnon Shashua},
  year    = {2025},
  eprint  = {2507.13337},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url     = {https://arxiv.org/abs/2507.13337}
}