Skip to main content

The Benchmark War: How We Score the Magnificent Seven on AI Model Performance


Not all benchmarks are created equal. Here is exactly how SEVENAI measures model performance — and why absolute scores matter far less than the direction of travel.

By SEVENAI Editorial  ·  May 17, 2026  ·  11 min read  ·  Methodology
30%
Dimension 1 of 5  ·  Highest weighted

Model Benchmarks

Performance on MMLU, HumanEval, MATH, and frontier evals. Scored on absolute performance and week-over-week improvement. The single largest component of the SEVENAI Momentum Index.

The most important number in the AI race is not a stock price, a revenue figure, or a headcount. It is a benchmark score — a single percentage point on a standardised evaluation that tells you, with more precision than any earnings call, whether a company's AI models are getting better or falling behind. Model benchmarks are the scoreboard of the AI race, and they account for 30% of the SEVENAI Momentum Index — the largest single dimension we track.

But benchmarks are also the most misunderstood and most manipulated numbers in the AI industry. Companies cherry-pick the evaluations where their models perform best. They train on benchmark data, inflating scores without improving real-world capability. They announce "state of the art" results on metrics that nobody in the industry actually uses. Understanding which benchmarks matter, why they matter, and how to read them honestly is one of the most important analytical skills for anyone following the AI race.

This post explains exactly how SEVENAI approaches model benchmark scoring — which evaluations we use, how we weight them, and how we translate raw benchmark numbers into a component of the weekly momentum score for each of the Magnificent Seven.

"A company that scores 85% on MMLU and improves by 2 points week-over-week is more interesting to us than a company that scores 91% and is flat. The race rewards momentum, not position."

— SEVENAI Methodology Notes, May 2026

The four benchmarks we track — and why

The AI industry produces dozens of benchmarks. We track four. The selection is not arbitrary — these four evaluations were chosen because they are hard to game, widely respected across the research community, and measure capabilities that translate directly to real-world commercial value.

BenchmarkWhat it measuresWeightWhy it matters commercially
MMLU
Massive Multitask Language Understanding
Knowledge breadth across 57 academic and professional subjects — law, medicine, finance, history, science35%Enterprise AI buyers need models that can handle diverse professional domains. MMLU is the closest proxy for general enterprise readiness.
HumanEval
OpenAI coding evaluation
Ability to write correct, functional code from natural language descriptions across multiple programming languages30%Software development is the highest-value AI use case in enterprise. Coding capability is the primary purchase criterion for the largest segment of AI buyers.
MATH
Competition mathematics
Multi-step mathematical reasoning from competition problems — requires genuine logical chaining, not pattern matching20%Mathematical reasoning is the best proxy for complex analytical tasks. A model that excels at MATH is a model that can handle financial analysis, scientific reasoning, and structured problem-solving.
Frontier Evals
GPQA, ARC-AGI, AIME
Expert-level reasoning tasks designed to resist saturation — problems that even PhD-level humans find genuinely difficult15%Standard benchmarks saturate as models improve. Frontier evals reveal the true capability ceiling and indicate which companies are approaching genuine reasoning, not sophisticated pattern matching.

Absolute performance vs week-over-week improvement

The most important methodological decision in our benchmark scoring is the dual weighting of absolute performance and improvement trajectory. We score each company on both dimensions, weighted equally within the benchmark component.

Here is the intuition: a company with a model scoring 91% on MMLU that has been flat for three months is less interesting, from a competitive momentum perspective, than a company scoring 84% that has improved by 3 points in the same period. The first company has a better model today. The second company has more momentum — and in a race, momentum predicts the future better than current position.

This dual weighting is particularly important for understanding companies like Meta and Amazon, whose publicly released models may trail frontier performance but whose improvement trajectories are steep. A 3-point MMLU improvement in a single week is a significant signal about the pace of a company's research and engineering progress, regardless of where the absolute score sits.

How we calculate the improvement score

The momentum formula

Each week, we record the best publicly available benchmark score for each company's primary model across our four evaluations. The week-over-week change is converted to a standardised score using a rolling 12-week baseline. A company that improves faster than its own historical average scores higher, even if its absolute performance is lower than a competitor. This rewards genuine research progress over static market position.

Why each of the seven scores differently

The seven companies in our index have fundamentally different relationships with public model benchmarks — and understanding those differences is essential to interpreting the scores correctly.

Nvidia — scored differently from the others

Nvidia does not build foundation models in the traditional sense. Its NeMo framework and Megatron research models exist but are not its primary competitive product. For Nvidia, the model benchmark dimension of our index measures something different: the benchmark performance of models trained on Nvidia hardware vs competing hardware. When Blackwell-trained models consistently outperform TPU-trained models on HumanEval, that is a positive Nvidia benchmark signal. This indirect measurement is the correct one for an infrastructure company.

Microsoft — the OpenAI proxy problem

Microsoft's AI model performance is inseparable from OpenAI's. GPT-4o, o3, and whatever OpenAI releases next are Microsoft's benchmark story, because they are the models that power Copilot — Microsoft's primary AI product. We score Microsoft's benchmark dimension based on OpenAI model performance, with a discount applied for the dependency risk. A company whose benchmark scores depend on a partner's research output is less insulated than a company whose scores reflect internal capability.

Alphabet — the dual-model complexity

Alphabet runs two major model families: Gemini, developed by Google DeepMind, and the research models that emerge from Google Brain's academic programme. We score on Gemini performance for commercial benchmark purposes, while tracking Brain research outputs as a leading indicator of future Gemini capability. The gap between Google's research output and its commercial model performance is one of the most watched metrics in our methodology.

Meta — the open-source advantage

Meta's benchmark situation is uniquely transparent. Because Llama models are publicly released, their benchmark scores are independently verified by the research community within hours of launch. There is no possibility of cherry-picking or selective disclosure. Meta's benchmark scores are the most trustworthy in our index — and Llama 5's recent HumanEval results, which exceed GPT-4o on coding tasks, have driven the largest single-week benchmark score improvement we have recorded.

Amazon — the Bedrock inference problem

Amazon does not have a primary frontier model of its own. Its Nova model family exists but sits well below the capability frontier. For Amazon, we track benchmark performance of the models it hosts on Bedrock — weighted by estimated usage share — rather than its own model outputs. This is the honest measurement of Amazon's AI model capability as experienced by its customers.

Tesla — the physical world exception

Tesla's AI models do not appear on MMLU or HumanEval leaderboards. They are evaluated on entirely different metrics: miles per disengagement in FSD, object recognition accuracy in physical environments, task completion rates in robotic manipulation. For Tesla, we track these physical-world performance metrics and translate them into a comparable score using a normalisation methodology we will publish separately. Tesla's benchmark component is genuinely different from the other six — which is appropriate, because Tesla is competing in a genuinely different category of AI.

Apple — the privacy constraint

Apple publishes less benchmark data than any other company in our index. Its on-device models are evaluated internally and results are disclosed selectively. We supplement Apple's limited public disclosures with independent evaluations conducted by the research community on Apple Intelligence outputs. The 3–4 billion parameter constraint imposed by on-device deployment creates a structural benchmark ceiling that no amount of engineering can fully overcome against cloud-based competitors.

The benchmark gaming problem — and how we handle it

Every company in our index has an incentive to inflate benchmark scores. Training on benchmark data, selecting favourable evaluation conditions, and announcing scores on metrics that nobody else tracks are all common practices. Our methodology addresses this through three safeguards: we weight independent replications over company-announced scores, we track benchmark consistency across multiple evaluations rather than relying on any single result, and we apply a credibility discount to scores that cannot be independently verified within 30 days of announcement.

Current benchmark scores — May 17, 2026

Here is where each company stands on the model benchmark dimension of the SEVENAI Momentum Index this week. This component contributes 30 points maximum to a company's overall score out of 100.

Model benchmark component scores — week of May 17, 2026  (max 30 pts)
Nvidia
NVDA
29.1▲ +0.4
Microsoft
MSFT
27.0▲ +0.6
Alphabet
GOOGL
25.8— 0.0
Meta
META
27.0▲ +1.8
Amazon
AMZN
20.4— 0.0
Tesla
TSLA
21.6▲ +0.3
Apple
AAPL
15.6▼ −0.6
What to watch
  • 01Meta's Llama 5 independent replications. The company-announced HumanEval scores are exceptional. We are waiting on independent verification from three academic labs before finalising the benchmark uplift in our index. If the scores hold, Meta's benchmark component could surpass Microsoft's within two weeks — the first time an open-source model has led the index.
  • 02Google's Gemini 2.5 Ultra release. Delayed from its originally announced Q2 window, Gemini Ultra's benchmark scores on MATH and frontier evals will determine whether Alphabet closes the gap with Microsoft or widens it. A strong result would be the most significant positive score movement for any company since we launched the index.
  • 03Apple's WWDC model announcements. The annual developer conference is the one moment each year where Apple discloses meaningful model capability data. If Apple Intelligence's next generation demonstrates a path beyond the 4B parameter ceiling, the benchmark component score could recover materially. If not, Apple risks falling to a score that reflects genuine capability stagnation rather than strategic choice.
  • 04Microsoft's o4 benchmark disclosure. OpenAI's next reasoning model is expected in Q3. If its frontier eval performance marks a step-change improvement over o3, Microsoft's benchmark component will see its largest single-week gain since we began tracking. The dependency on OpenAI's research timeline is Microsoft's primary benchmark risk.

The bottom line on benchmarks

Model benchmarks are the most objective data available in a race that is otherwise difficult to measure. They are imperfect — gameable, selective, and often disconnected from real-world performance in ways that matter. But they are the best available signal of whether a company's AI research engine is accelerating or stalling, and at 30% of our total index, they are the single most important dimension we track.

The current benchmark picture tells a clear story: Nvidia and Microsoft lead, Meta is surging, Alphabet is flat but capable of a step-change on Gemini Ultra's release, and Apple is in structural decline relative to its competitors. That picture will change. Our job is to track it every week, score it honestly, and tell you what it means for the race.

Next week: we publish the methodology for the second dimension of the SEVENAI index — AI Capital Expenditure, which accounts for 25% of the total score and is the best leading indicator of competitive position six to twelve months from now.

Comments

Popular posts from this blog

IDENTIFY TO FIND YOUR FIRE:

Discovering Passion & Niche with Purpose In a world full of voices, how do you hear your own? If you’ve ever felt the tension between having a powerful story and not knowing how to package it , the IDENTIFICATION framework becomes more than a business tool—it becomes a spiritual compass. Here’s how to use it not just to monetize a skill, but to uncover the soul print of your purpose . I – Industry Mapping Ask: What spaces already exist where I feel energized—yet I also see something missing? Passion blooms at the intersection of curiosity and calling. Look beyond buzzwords and into movements that stir your spirit : Is it personal finance for families ? Is it edutainment that empowers children? Is it soul-based entrepreneurship that feels alive ? Try: Write down 5 digital spaces where you could spend hours exploring (hint: not scrolling, but solving). D – Demand Signals Ask: What do people constantly ask me about—or what problems do I instinctively try to solve? S...

The Importance of Content Marketing in 2026: Building Trust, Driving Leads and Growing Your Business

 The Importance of Content Marketing in 2026: Building Trust, Driving Leads and Growing Your Business Content marketing is not a passing trend – it has become the backbone of modern marketing and sales strategies. Companies that consistently educate and engage their audience with blogs, videos , podcasts and other formats are seeing measurable results in brand awareness, lead generation and revenue. By 2026, content marketing is no longer optional: over 82 % of companies use it and more than 54 % plan to increase their investment . In today’s competitive landscape, high‑quality, customer‑focused content builds trust, attracts qualified prospects and nurtures loyalty throughout the buyer journey. Pervasive adoption and why it matters Widespread usage: Research shows that 73 % of B2B marketers and 70 % of B2C marketers include content marketing in their strategies . Within organisations, dedicated content teams are becoming the norm; 73 % of major o...

FAST FRAMEWORKS:

Structure for the Soul. Strategy for the Seed. At FavorSeeds , we don’t just teach financial tools—we plant systems of transformation. Behind every product, tracker, and challenge we offer lies a foundational code. A sacred rhythm. A set of spiritual structures designed to bring your vision into reality. We call them the FavorSeeds Frameworks : IDENTIFICATION — The art of knowing what to plant IMPLEMENTATION — The process of planting it with power and purpose These frameworks aren’t just theories—they’re active lenses. They shape how you think, move, and manifest your financial and spiritual goals. Why Frameworks Matter Most people are handed fragmented financial advice without a meaningful foundation. Budget this. Save that. Hustle here. Meditate there. But you’re not just managing money. You’re managing meaning. The FavorSeeds Frameworks give you structure and direction—without separating spirit from strategy. They help you discern what truly matters to yo...