Not all benchmarks are created equal. Here is exactly how SEVENAI measures model performance — and why absolute scores matter far less than the direction of travel.
Model Benchmarks
Performance on MMLU, HumanEval, MATH, and frontier evals. Scored on absolute performance and week-over-week improvement. The single largest component of the SEVENAI Momentum Index.
The most important number in the AI race is not a stock price, a revenue figure, or a headcount. It is a benchmark score — a single percentage point on a standardised evaluation that tells you, with more precision than any earnings call, whether a company's AI models are getting better or falling behind. Model benchmarks are the scoreboard of the AI race, and they account for 30% of the SEVENAI Momentum Index — the largest single dimension we track.
But benchmarks are also the most misunderstood and most manipulated numbers in the AI industry. Companies cherry-pick the evaluations where their models perform best. They train on benchmark data, inflating scores without improving real-world capability. They announce "state of the art" results on metrics that nobody in the industry actually uses. Understanding which benchmarks matter, why they matter, and how to read them honestly is one of the most important analytical skills for anyone following the AI race.
This post explains exactly how SEVENAI approaches model benchmark scoring — which evaluations we use, how we weight them, and how we translate raw benchmark numbers into a component of the weekly momentum score for each of the Magnificent Seven.
"A company that scores 85% on MMLU and improves by 2 points week-over-week is more interesting to us than a company that scores 91% and is flat. The race rewards momentum, not position."
— SEVENAI Methodology Notes, May 2026The four benchmarks we track — and why
The AI industry produces dozens of benchmarks. We track four. The selection is not arbitrary — these four evaluations were chosen because they are hard to game, widely respected across the research community, and measure capabilities that translate directly to real-world commercial value.
| Benchmark | What it measures | Weight | Why it matters commercially |
|---|---|---|---|
MMLU Massive Multitask Language Understanding | Knowledge breadth across 57 academic and professional subjects — law, medicine, finance, history, science | 35% | Enterprise AI buyers need models that can handle diverse professional domains. MMLU is the closest proxy for general enterprise readiness. |
HumanEval OpenAI coding evaluation | Ability to write correct, functional code from natural language descriptions across multiple programming languages | 30% | Software development is the highest-value AI use case in enterprise. Coding capability is the primary purchase criterion for the largest segment of AI buyers. |
MATH Competition mathematics | Multi-step mathematical reasoning from competition problems — requires genuine logical chaining, not pattern matching | 20% | Mathematical reasoning is the best proxy for complex analytical tasks. A model that excels at MATH is a model that can handle financial analysis, scientific reasoning, and structured problem-solving. |
Frontier Evals GPQA, ARC-AGI, AIME | Expert-level reasoning tasks designed to resist saturation — problems that even PhD-level humans find genuinely difficult | 15% | Standard benchmarks saturate as models improve. Frontier evals reveal the true capability ceiling and indicate which companies are approaching genuine reasoning, not sophisticated pattern matching. |
Absolute performance vs week-over-week improvement
The most important methodological decision in our benchmark scoring is the dual weighting of absolute performance and improvement trajectory. We score each company on both dimensions, weighted equally within the benchmark component.
Here is the intuition: a company with a model scoring 91% on MMLU that has been flat for three months is less interesting, from a competitive momentum perspective, than a company scoring 84% that has improved by 3 points in the same period. The first company has a better model today. The second company has more momentum — and in a race, momentum predicts the future better than current position.
This dual weighting is particularly important for understanding companies like Meta and Amazon, whose publicly released models may trail frontier performance but whose improvement trajectories are steep. A 3-point MMLU improvement in a single week is a significant signal about the pace of a company's research and engineering progress, regardless of where the absolute score sits.
The momentum formula
Each week, we record the best publicly available benchmark score for each company's primary model across our four evaluations. The week-over-week change is converted to a standardised score using a rolling 12-week baseline. A company that improves faster than its own historical average scores higher, even if its absolute performance is lower than a competitor. This rewards genuine research progress over static market position.
Why each of the seven scores differently
The seven companies in our index have fundamentally different relationships with public model benchmarks — and understanding those differences is essential to interpreting the scores correctly.
Nvidia — scored differently from the others
Nvidia does not build foundation models in the traditional sense. Its NeMo framework and Megatron research models exist but are not its primary competitive product. For Nvidia, the model benchmark dimension of our index measures something different: the benchmark performance of models trained on Nvidia hardware vs competing hardware. When Blackwell-trained models consistently outperform TPU-trained models on HumanEval, that is a positive Nvidia benchmark signal. This indirect measurement is the correct one for an infrastructure company.
Microsoft — the OpenAI proxy problem
Microsoft's AI model performance is inseparable from OpenAI's. GPT-4o, o3, and whatever OpenAI releases next are Microsoft's benchmark story, because they are the models that power Copilot — Microsoft's primary AI product. We score Microsoft's benchmark dimension based on OpenAI model performance, with a discount applied for the dependency risk. A company whose benchmark scores depend on a partner's research output is less insulated than a company whose scores reflect internal capability.
Alphabet — the dual-model complexity
Alphabet runs two major model families: Gemini, developed by Google DeepMind, and the research models that emerge from Google Brain's academic programme. We score on Gemini performance for commercial benchmark purposes, while tracking Brain research outputs as a leading indicator of future Gemini capability. The gap between Google's research output and its commercial model performance is one of the most watched metrics in our methodology.
Meta — the open-source advantage
Meta's benchmark situation is uniquely transparent. Because Llama models are publicly released, their benchmark scores are independently verified by the research community within hours of launch. There is no possibility of cherry-picking or selective disclosure. Meta's benchmark scores are the most trustworthy in our index — and Llama 5's recent HumanEval results, which exceed GPT-4o on coding tasks, have driven the largest single-week benchmark score improvement we have recorded.
Amazon — the Bedrock inference problem
Amazon does not have a primary frontier model of its own. Its Nova model family exists but sits well below the capability frontier. For Amazon, we track benchmark performance of the models it hosts on Bedrock — weighted by estimated usage share — rather than its own model outputs. This is the honest measurement of Amazon's AI model capability as experienced by its customers.
Tesla — the physical world exception
Tesla's AI models do not appear on MMLU or HumanEval leaderboards. They are evaluated on entirely different metrics: miles per disengagement in FSD, object recognition accuracy in physical environments, task completion rates in robotic manipulation. For Tesla, we track these physical-world performance metrics and translate them into a comparable score using a normalisation methodology we will publish separately. Tesla's benchmark component is genuinely different from the other six — which is appropriate, because Tesla is competing in a genuinely different category of AI.
Apple — the privacy constraint
Apple publishes less benchmark data than any other company in our index. Its on-device models are evaluated internally and results are disclosed selectively. We supplement Apple's limited public disclosures with independent evaluations conducted by the research community on Apple Intelligence outputs. The 3–4 billion parameter constraint imposed by on-device deployment creates a structural benchmark ceiling that no amount of engineering can fully overcome against cloud-based competitors.
The benchmark gaming problem — and how we handle it
Every company in our index has an incentive to inflate benchmark scores. Training on benchmark data, selecting favourable evaluation conditions, and announcing scores on metrics that nobody else tracks are all common practices. Our methodology addresses this through three safeguards: we weight independent replications over company-announced scores, we track benchmark consistency across multiple evaluations rather than relying on any single result, and we apply a credibility discount to scores that cannot be independently verified within 30 days of announcement.
Current benchmark scores — May 17, 2026
Here is where each company stands on the model benchmark dimension of the SEVENAI Momentum Index this week. This component contributes 30 points maximum to a company's overall score out of 100.
- 01Meta's Llama 5 independent replications. The company-announced HumanEval scores are exceptional. We are waiting on independent verification from three academic labs before finalising the benchmark uplift in our index. If the scores hold, Meta's benchmark component could surpass Microsoft's within two weeks — the first time an open-source model has led the index.
- 02Google's Gemini 2.5 Ultra release. Delayed from its originally announced Q2 window, Gemini Ultra's benchmark scores on MATH and frontier evals will determine whether Alphabet closes the gap with Microsoft or widens it. A strong result would be the most significant positive score movement for any company since we launched the index.
- 03Apple's WWDC model announcements. The annual developer conference is the one moment each year where Apple discloses meaningful model capability data. If Apple Intelligence's next generation demonstrates a path beyond the 4B parameter ceiling, the benchmark component score could recover materially. If not, Apple risks falling to a score that reflects genuine capability stagnation rather than strategic choice.
- 04Microsoft's o4 benchmark disclosure. OpenAI's next reasoning model is expected in Q3. If its frontier eval performance marks a step-change improvement over o3, Microsoft's benchmark component will see its largest single-week gain since we began tracking. The dependency on OpenAI's research timeline is Microsoft's primary benchmark risk.
The bottom line on benchmarks
Model benchmarks are the most objective data available in a race that is otherwise difficult to measure. They are imperfect — gameable, selective, and often disconnected from real-world performance in ways that matter. But they are the best available signal of whether a company's AI research engine is accelerating or stalling, and at 30% of our total index, they are the single most important dimension we track.
The current benchmark picture tells a clear story: Nvidia and Microsoft lead, Meta is surging, Alphabet is flat but capable of a step-change on Gemini Ultra's release, and Apple is in structural decline relative to its competitors. That picture will change. Our job is to track it every week, score it honestly, and tell you what it means for the race.
Next week: we publish the methodology for the second dimension of the SEVENAI index — AI Capital Expenditure, which accounts for 25% of the total score and is the best leading indicator of competitive position six to twelve months from now.
Comments
Post a Comment