Measuring AI Model Progress

Document Links

technology-knowledge-ai-measuring-progress.xlsx — HLE leaderboard data (Jan 2025–May 2026), GPQA Diamond leaderboard data, saturated-benchmark reference table, METR Time Horizon data, chart source data

AI capability continues to improve at a remarkable pace as the frontier models race toward artificial general intelligence (AGI). But how close are we really to AI which can replace or substantially augment human knowledge workers?

Humanity's Last Exam HLE line chart by quarter and company, Q1 2025 through Q2 2026, showing frontier score rising from 7% to mid-40s

Humanity’s Last Exam (HLE) — Reasoning Models, Top Score per Company per Quarter | Source: Artificial Analysis HLE leaderboard, reasoning-models filter (snapshot May 24, 2026)

Frontier reasoning models scored just over 7% on Humanity’s Last Exam (“HLE”)[1] in January 2025 and reached the mid-40s by May 2026. Yet the top score still sits below 45%, meaning the best AI in the world still answers more than half of the questions on Humanity’s Last Exam incorrectly.

Saturation[2]: The Recurring Benchmark Pattern

GPQA Diamond line chart by quarter and company, Q1 2025 through Q2 2026, showing scores clustering at 88-94 percent near saturation

GPQA Diamond — Reasoning Models, Top Score per Company per Quarter | Source: Artificial Analysis GPQA Diamond leaderboard, reasoning-models filter (snapshot May 24, 2026)

GPQA Diamond — a 198-question graduate-level science benchmark introduced in late 2023 to test whether AI could match PhD-level domain experts — is approaching saturation. Frontier reasoning models now cluster in the 88–94% range, near the ceiling of what label noise and question ambiguity make statistically reliable. This is the recurring pattern in AI benchmarking: each new test is designed to be hard at release, then matched or exceeded within one to four years, forcing researchers to design the next ceiling. HLE was designed as that next ceiling.

Benchmark	Released	Saturation Reached	Years to Saturation
GLUE	Apr 2018	mid-2019	~1 year
SuperGLUE	May 2019	mid-2021	~2 years
MMLU	Sep 2020	Sep 2024	~4 years
HumanEval (coding)	Jul 2021	early 2025	~3.5 years
GSM8K (math)	Oct 2021	mid-2024	~2.5 years
MMLU-Pro	Jun 2024	Nov 2025	~1.5 years
GPQA Diamond	Nov 2023	Q1 2026 (approaching)	~2.5 years
HLE (current frontier)	Jan 2025	not yet (frontier ~45%)	—

Each benchmark was designed to be difficult at release; frontier scores reached saturation within one to four years, forcing researchers to design the next ceiling. “Saturation” refers to frontier scores clustering inside the benchmark’s label-noise and ambiguity floor, making model differentiation statistically unreliable — it is not 100% accuracy. Source: AI research literature; CRE42 compilation from benchmark release papers and Artificial Analysis leaderboard data.

METR Time Horizon: Measuring Agentic (Autonomous) AI Capabilities

METR Time Horizon p50 horizontal bar chart, GPT-2 through Claude Mythos Preview, showing exponential growth from 3 seconds to 17.4 hours

METR Time Horizon — 50% Success Rate (p50): How long a human-expert task an AI can complete autonomously half the time | Source: METR benchmark_results v1.1 dataset (Apr 6, 2026 release)

The METR Time Horizon chart[3] is a perfect illustration of the uncertainty surrounding AI and its potential to augment and/or replace white-collar workers. On one hand, AI is advancing at an unbelievable rate, basically doubling its abilities every few months and showing the potential for exponential growth into the future. On the other hand, the scoring mechanism of the test itself gives us a clue regarding AI’s most fundamental problems: intermittent reliability and hallucinations. The test measures the “task-completion time horizon,” defined by METR as “the task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability [in this case 50%].”[5] While going from less than five hours to 17+ hours in five months (Dec 2025 to Apr 2026) is extremely impressive, any human with a “success” rate of 50% (or even 80%) would not last long at real-life companies.[4] Anyone who uses AI for complex and iterative analysis can attest to the frequency of hallucinations and its apparent preference to make up facts and figures over asking clarifying questions. The models continue to improve in this aspect but still have a very long way to go before humans can be taken “out of the loop” for most professional processes.

The same models tested at the more demanding 80% success threshold (p80) typically run four to six times shorter than at p50:

METR Time Horizon p80 horizontal bar chart, GPT-2 through Claude Mythos Preview, showing growth from 1 second to 3.1 hours

METR Time Horizon — 80% Success Rate (p80): Same task universe, higher reliability threshold | Source: METR benchmark_results v1.1 dataset (Apr 6, 2026 release)

What to Watch for in 2026 and Beyond

AI has demonstrated an ability to work through complex problems with novel strategies, empowering DeepMind to defeat world Go champion Lee Sedol in 2016, revolutionize protein folding prediction (earning the AlphaFold team a Nobel Prize in Chemistry in 2024), and most recently (May 2026) an internal OpenAI reasoning model offered a new solution to a well-known and previously misunderstood eighty-year-old geometry problem.[6] AI is exceptional at well-defined technical tasks like writing code, summarizing documents, and generating first-draft analyses, but most professional work involves iterative judgment, social negotiation, and an understanding of physical-world and organizational context.

The AI revolution will likely move faster than the previous three industrial revolutions but will still share a few common characteristics: (1) new technology adoption will be an iterative process as people and organizations learn to understand, use, and incorporate AI into their lives and workflow; (2) jobs will be augmented, replaced, and created in an ongoing process and affect different industries on different timelines and in different manners; and (3) the majority of today’s predictions and projections related to AI adoption and job displacement will look ridiculous in ten years — this is necessarily true because the range is so wide and variable.

Sources to Track AI Model Progress in 2026:

Source	Report / Series	Frequency	Notes
Artificial Analysis	Benchmark leaderboards (HLE, GPQA, etc.)	Continuous	Comprehensive cross-benchmark leaderboard with reasoning-models filter; primary source for HLE and GPQA Diamond scores
METR	Time Horizons benchmark	Per model release	Measures how long an AI agent can autonomously complete human-expert tasks at p50 / p80 success rates
Anthropic Economic Index	AI usage by occupation	Quarterly	Actual usage data from millions of Claude conversations mapped to BLS occupational codes
Stanford HAI	AI Index Annual Report	Annual	Authoritative cross-domain compilation: benchmarks, investment, hardware, policy, and labor-market indicators
Stanford Digital Economy Lab	AI labor-market research	Ongoing	ADP payroll-data studies of AI's impact on employment by occupation and demographic
McKinsey	Global Survey on the State of AI	Annual	Enterprise AI adoption rates, use-case patterns, and productivity gains across industries and geographies

▶ AI & Computation Timeline — click to expand

Long-Term Computation Milestones (pre-2022)

1956 — Dartmouth Conference / Birth of AI as a Field. John McCarthy coins “artificial intelligence” at Dartmouth College. The initial vision: machines that reason, learn, and solve problems like humans. Decades of alternating progress and “AI winters” follow, with each wave producing useful but narrow applications (chess engines, expert systems, early neural networks) before funding and enthusiasm recede.

1997–2011 — Deep Blue to Watson: Narrow AI Proves Itself. IBM’s Deep Blue defeats world chess champion Garry Kasparov (1997); IBM Watson wins Jeopardy! (2011). These milestones demonstrate that machines can outperform humans in specific, well-defined tasks — but each system is purpose-built, expensive, and unable to generalize.

2012–2016 — Deep Learning Breakthrough and the GPU Revolution. AlexNet wins the ImageNet competition (2012), proving deep neural networks can dramatically outperform traditional methods in image recognition. NVIDIA GPUs become the hardware of choice for training neural networks. Google DeepMind’s AlphaGo defeats world Go champion Lee Sedol (2016). Deep learning begins penetrating commercial applications.

2017–2021 — Transformers, GPT, and the Foundation Model Era. Google publishes “Attention Is All You Need” (2017), introducing the transformer architecture. OpenAI releases GPT-2 (2019) and GPT-3 (2020), demonstrating that very large language models can generate coherent text and answer questions across domains.

The AI Acceleration (2022–Present)

Nov 2022 — ChatGPT Launch. OpenAI releases ChatGPT, reaching 100 million users in two months — the fastest consumer technology adoption in history. For the first time, general-purpose AI is accessible to non-technical users.

2023 — Enterprise AI Adoption Begins in Earnest. GPT-4 launches (Mar 2023) with significantly improved reasoning. Microsoft integrates AI into Office (Copilot). McKinsey reports 33% of organizations using generative AI in at least one business function.

2024 — Scaling and Investment Surge. Hyperscaler capex surges. McKinsey’s generative AI adoption rate rises from 33% to 71% in one year. Stanford HAI AI Index shows exponential improvement across benchmarks.

2025–2026 — Reasoning Models and the Saturation Race. Frontier labs ship dedicated reasoning models. MMLU and HumanEval saturate; GPQA Diamond approaches saturation in Q1 2026. HLE becomes the new frontier benchmark; frontier scores rise from 7% to mid-40s in roughly sixteen months. METR Time Horizon p50 crosses 17 hours.

▶ The Four Industrial Revolutions — click to expand

Revolution	Era	Key Driver	Main Impact
First (1.0)	Late 1700s	Steam & Water	Mechanization: shifted work from hand-crafted cottage industries to factories with steam-powered machines.
Second (2.0)	Late 1800s	Electricity & Oil	Mass production: assembly lines, the lightbulb, the internal combustion engine.
Third (3.0)	Late 1900s	Electronics	The digital revolution: computers, the internet, programmable logic controllers automating production.
Fourth (4.0)	2020s–	Artificial Intelligence	Knowledge-work automation and augmentation; physical AI; the data-center buildout. Still in early innings.

[1] Humanity’s Last Exam, Center for AI Safety. A multidisciplinary 2,500-question benchmark covering mathematics, sciences, humanities, and engineering at expert level; released January 2025 with the explicit goal of resisting saturation longer than predecessors. ↩

[2] “Saturation” in this context means frontier scores cluster inside the label-noise ceiling of the benchmark, making it statistically unreliable for distinguishing top models. It is not equivalent to 100% accuracy. See benchmark-specific source papers and the Saturated Benchmarks tab in the linked workbook. ↩

[3] METR (Model Evaluation & Threat Research) times skilled human professionals on real software, ML, and cybersecurity tasks, then tests AI models on the same tasks. The p50 (50% success rate) and p80 (80% success rate) horizons identify the human task duration at which a model’s fitted success curve crosses the respective reliability threshold. Doubling time since 2023: approximately 4.2 months. ↩

[4] The current p50 frontier of 17.4 hours was set by Claude Mythos Preview (early), an Anthropic model evaluated by METR on April 6, 2026. METR’s own methodology page notes that “measurements above 16 hours are unreliable with our current task suite,” meaning the measurement infrastructure itself is at its practical limit at this point on the curve. Successor models and an updated task suite are expected; near-term frontier movement is therefore as much a function of METR’s ability to expand its test set as of model capability. ↩

[5] METR (Model Evaluation & Threat Research), “Measuring AI Ability to Complete Long Tasks.” Definition quoted from metr.org/time-horizons. ↩

[6] The unit-distance problem asks the maximum number of point pairs at distance exactly 1 that can be formed by N points in the plane. Erdős’s 1946 lower bound, based on square-grid constructions, was widely believed to be essentially optimal. OpenAI’s internal reasoning model produced an infinite family of constructions yielding a polynomial improvement, drawing on algebraic number theory in a way human researchers had not pursued. The proof was externally verified and a companion paper authored by human mathematicians. Since January 2026, AI has reportedly contributed to solving fifteen previously-open Erdős problems. ↩

Sources

[1] Artificial Analysis HLE leaderboard. artificialanalysis.ai

[2] Artificial Analysis GPQA Diamond leaderboard. artificialanalysis.ai

[3] Center for AI Safety, Humanity’s Last Exam. cais.org/hle

[4] Rein, Hou, et al. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” arXiv:2311.12022 (Nov 2023).

[5] Wang, Singh, et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” ICLR 2019.

[6] Hendrycks, Burns, et al. “Measuring Massive Multitask Language Understanding” (MMLU). ICLR 2021.

[7] Wang, Ma, et al. “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.” 2024.

[8] Chen et al. “Evaluating Large Language Models Trained on Code” (HumanEval). arXiv:2107.03374 (2021).

[9] Cobbe et al. “Training Verifiers to Solve Math Word Problems” (GSM8K). arXiv:2110.14168 (2021).

[10] METR, “Measuring AI Ability to Complete Long Tasks” (Mar 2025); METR benchmark_results v1.1 dataset (Apr 2026). metr.org/time-horizons

[11] OpenAI, “An OpenAI Model Has Disproved a Central Conjecture in Discrete Geometry” (May 2026). openai.com

[12] The Nobel Prize in Chemistry 2024, awarded to Demis Hassabis, John Jumper, and David Baker. nobelprize.org

[13] Silver et al., “Mastering the game of Go with deep neural networks and tree search” (AlphaGo). Nature 529 (Jan 2016).

[14] Jumper et al., “Highly accurate protein structure prediction with AlphaFold.” Nature 596 (Aug 2021); Varadi et al., “AlphaFold Protein Structure Database in 2024” (214M+ structures).