The Benchmark War: How We Score the Magnificent Seven on AI Model Performance

Not all benchmarks are created equal. Here is exactly how SEVENAI measures model performance — and why absolute scores matter far less than the direction of travel.

By SEVENAI Editorial  ·  May 17, 2026  ·  11 min read  ·  Methodology

30%

Dimension 1 of 5 · Highest weighted

Model Benchmarks

Performance on MMLU, HumanEval, MATH, and frontier evals. Scored on absolute performance and week-over-week improvement. The single largest component of the SEVENAI Momentum Index.

The most important number in the AI race is not a stock price, a revenue figure, or a headcount. It is a benchmark score — a single percentage point on a standardised evaluation that tells you, with more precision than any earnings call, whether a company's AI models are getting better or falling behind. Model benchmarks are the scoreboard of the AI race, and they account for 30% of the SEVENAI Momentum Index — the largest single dimension we track.

But benchmarks are also the most misunderstood and most manipulated numbers in the AI industry. Companies cherry-pick the evaluations where their models perform best. They train on benchmark data, inflating scores without improving real-world capability. They announce "state of the art" results on metrics that nobody in the industry actually uses. Understanding which benchmarks matter, why they matter, and how to read them honestly is one of the most important analytical skills for anyone following the AI race.

This post explains exactly how SEVENAI approaches model benchmark scoring — which evaluations we use, how we weight them, and how we translate raw benchmark numbers into a component of the weekly momentum score for each of the Magnificent Seven.

"A company that scores 85% on MMLU and improves by 2 points week-over-week is more interesting to us than a company that scores 91% and is flat. The race rewards momentum, not position."

— SEVENAI Methodology Notes, May 2026

The four benchmarks we track — and why

The AI industry produces dozens of benchmarks. We track four. The selection is not arbitrary — these four evaluations were chosen because they are hard to game, widely respected across the research community, and measure capabilities that translate directly to real-world commercial value.

Benchmark	What it measures	Weight	Why it matters commercially
MMLU Massive Multitask Language Understanding	Knowledge breadth across 57 academic and professional subjects — law, medicine, finance, history, science	35%	Enterprise AI buyers need models that can handle diverse professional domains. MMLU is the closest proxy for general enterprise readiness.
HumanEval OpenAI coding evaluation	Ability to write correct, functional code from natural language descriptions across multiple programming languages	30%	Software development is the highest-value AI use case in enterprise. Coding capability is the primary purchase criterion for the largest segment of AI buyers.
MATH Competition mathematics	Multi-step mathematical reasoning from competition problems — requires genuine logical chaining, not pattern matching	20%	Mathematical reasoning is the best proxy for complex analytical tasks. A model that excels at MATH is a model that can handle financial analysis, scientific reasoning, and structured problem-solving.
Frontier Evals GPQA, ARC-AGI, AIME	Expert-level reasoning tasks designed to resist saturation — problems that even PhD-level humans find genuinely difficult	15%	Standard benchmarks saturate as models improve. Frontier evals reveal the true capability ceiling and indicate which companies are approaching genuine reasoning, not sophisticated pattern matching.

Absolute performance vs week-over-week improvement

The most important methodological decision in our benchmark scoring is the dual weighting of absolute performance and improvement trajectory. We score each company on both dimensions, weighted equally within the benchmark component.

Here is the intuition: a company with a model scoring 91% on MMLU that has been flat for three months is less interesting, from a competitive momentum perspective, than a company scoring 84% that has improved by 3 points in the same period. The first company has a better model today. The second company has more momentum — and in a race, momentum predicts the future better than current position.

This dual weighting is particularly important for understanding companies like Meta and Amazon, whose publicly released models may trail frontier performance but whose improvement trajectories are steep. A 3-point MMLU improvement in a single week is a significant signal about the pace of a company's research and engineering progress, regardless of where the absolute score sits.

How we calculate the improvement score

The momentum formula

Each week, we record the best publicly available benchmark score for each company's primary model across our four evaluations. The week-over-week change is converted to a standardised score using a rolling 12-week baseline. A company that improves faster than its own historical average scores higher, even if its absolute performance is lower than a competitor. This rewards genuine research progress over static market position.

Why each of the seven scores differently

The seven companies in our index have fundamentally different relationships with public model benchmarks — and understanding those differences is essential to interpreting the scores correctly.

Nvidia — scored differently from the others

Nvidia does not build foundation models in the traditional sense. Its NeMo framework and Megatron research models exist but are not its primary competitive product. For Nvidia, the model benchmark dimension of our index measures something different: the benchmark performance of models trained on Nvidia hardware vs competing hardware. When Blackwell-trained models consistently outperform TPU-trained models on HumanEval, that is a positive Nvidia benchmark signal. This indirect measurement is the correct one for an infrastructure company.

Microsoft — the OpenAI proxy problem

Microsoft's AI model performance is inseparable from OpenAI's. GPT-4o, o3, and whatever OpenAI releases next are Microsoft's benchmark story, because they are the models that power Copilot — Microsoft's primary AI product. We score Microsoft's benchmark dimension based on OpenAI model performance, with a discount applied for the dependency risk. A company whose benchmark scores depend on a partner's research output is less insulated than a company whose scores reflect internal capability.

Alphabet — the dual-model complexity

Alphabet runs two major model families: Gemini, developed by Google DeepMind, and the research models that emerge from Google Brain's academic programme. We score on Gemini performance for commercial benchmark purposes, while tracking Brain research outputs as a leading indicator of future Gemini capability. The gap between Google's research output and its commercial model performance is one of the most watched metrics in our methodology.

Meta — the open-source advantage

Meta's benchmark situation is uniquely transparent. Because Llama models are publicly released, their benchmark scores are independently verified by the research community within hours of launch. There is no possibility of cherry-picking or selective disclosure. Meta's benchmark scores are the most trustworthy in our index — and Llama 5's recent HumanEval results, which exceed GPT-4o on coding tasks, have driven the largest single-week benchmark score improvement we have recorded.

Amazon — the Bedrock inference problem

Amazon does not have a primary frontier model of its own. Its Nova model family exists but sits well below the capability frontier. For Amazon, we track benchmark performance of the models it hosts on Bedrock — weighted by estimated usage share — rather than its own model outputs. This is the honest measurement of Amazon's AI model capability as experienced by its customers.

Tesla — the physical world exception

Tesla's AI models do not appear on MMLU or HumanEval leaderboards. They are evaluated on entirely different metrics: miles per disengagement in FSD, object recognition accuracy in physical environments, task completion rates in robotic manipulation. For Tesla, we track these physical-world performance metrics and translate them into a comparable score using a normalisation methodology we will publish separately. Tesla's benchmark component is genuinely different from the other six — which is appropriate, because Tesla is competing in a genuinely different category of AI.

Apple — the privacy constraint

Apple publishes less benchmark data than any other company in our index. Its on-device models are evaluated internally and results are disclosed selectively. We supplement Apple's limited public disclosures with independent evaluations conducted by the research community on Apple Intelligence outputs. The 3–4 billion parameter constraint imposed by on-device deployment creates a structural benchmark ceiling that no amount of engineering can fully overcome against cloud-based competitors.

The benchmark gaming problem — and how we handle it

Every company in our index has an incentive to inflate benchmark scores. Training on benchmark data, selecting favourable evaluation conditions, and announcing scores on metrics that nobody else tracks are all common practices. Our methodology addresses this through three safeguards: we weight independent replications over company-announced scores, we track benchmark consistency across multiple evaluations rather than relying on any single result, and we apply a credibility discount to scores that cannot be independently verified within 30 days of announcement.

Current benchmark scores — May 17, 2026

Here is where each company stands on the model benchmark dimension of the SEVENAI Momentum Index this week. This component contributes 30 points maximum to a company's overall score out of 100.

Model benchmark component scores — week of May 17, 2026  (max 30 pts)

Nvidia

NVDA

29.1▲ +0.4

Microsoft

MSFT

27.0▲ +0.6

Alphabet

GOOGL

25.8— 0.0

The bottom line on benchmarks

Model benchmarks are the most objective data available in a race that is otherwise difficult to measure. They are imperfect — gameable, selective, and often disconnected from real-world performance in ways that matter. But they are the best available signal of whether a company's AI research engine is accelerating or stalling, and at 30% of our total index, they are the single most important dimension we track.

The current benchmark picture tells a clear story: Nvidia and Microsoft lead, Meta is surging, Alphabet is flat but capable of a step-change on Gemini Ultra's release, and Apple is in structural decline relative to its competitors. That picture will change. Our job is to track it every week, score it honestly, and tell you what it means for the race.

Next week: we publish the methodology for the second dimension of the SEVENAI index — AI Capital Expenditure, which accounts for 25% of the total score and is the best leading indicator of competitive position six to twelve months from now.

FAVORITE Azure Engineering, AI & Real-World Fixes

Search This Blog