TL;DR
- Key Finding: A Google Research study accepted at AAAI-26 found that standard AI benchmarks use too few human raters, making model comparisons statistically unreliable.
- Core Problem: Majority voting across three to five raters erases meaningful human disagreement, hiding information that could change which AI model ranks higher.
- Recommended Fix: More than ten raters per example are needed, and budget allocation strategy matters more than total budget size.
- New Tool: The team released VET, an open-source simulator that helps benchmark designers optimize rater budgets before committing annotation resources.
- Broader Context: The study adds to growing scrutiny of AI evaluation methods, following prior critiques of crowdsourced benchmark platforms in 2025.
AI benchmarks that compare language models typically collect three to five human ratings per test example and pick a winner by majority vote. Google Research revealed in a study published in late March 2026 that this approach systematically erases the very disagreement it should be measuring, undermining the reliability of evaluations that inform product decisions across the industry.
Accepted at the annual AAAI Conference on Artificial Intelligence, the study titled “Forest vs Tree” by researchers from Google Research and Rochester Institute of Technology demonstrates that the standard three-to-five rater count is statistically insufficient for reliable AI model comparisons. How annotation budgets are split between breadth (more test examples) and depth (more raters per example) matters as much as the budget itself, and many benchmarks get that split wrong. For statistically reliable results that capture the range of human opinion, more than ten raters per example are generally needed.
How Budget Allocation Changes Results
At its core, the study exposes a trade-off between two legitimate evaluation goals. For accuracy-based metrics that rely on majority vote agreement, a wide approach performs strongest: many test examples with few raters each. For distribution-aware metrics like total variation, which capture the full spread of human opinion, the opposite holds: fewer test examples but meaningfully more raters per item.
Current benchmarks overwhelmingly follow the first model, casting a wide net across test examples while collecting only a thin layer of human judgment for each one. Two items can receive the same majority-vote label yet have very different response distributions underneath. One might reflect near-unanimous agreement among raters; another might reflect a narrow 3-2 split.
By collapsing both into a single label, majority voting treats confident consensus and razor-thin margins as identical, hiding information that could change which AI model appears to perform better. For safety-sensitive applications, those lost signals represent the gap between a model that handles easy cases well and one that fails on the hard ones.
To illustrate the problem, the study authors point to toxicity detection:
“Both comments get the same ‘Toxic’ label by majority vote, even though evaluators in the second case disagree significantly. Standard benchmarks ignore this difference entirely.”
Google Research study authors (via Google Research)
Consider a content moderation benchmark where five raters evaluate whether a social media post is offensive. If all five agree a post is harmful, that consensus carries different implications for model training than a 3-2 split where cultural background or personal experience drives the disagreement. Under majority voting, both cases produce the same binary label, and a model trained on that label learns nothing about the contested nature of the second example.
An incorrect budget balance can produce unreliable conclusions even with a much larger annotation budget. Across their experiments, the researchers tested scales from 100 to 50,000 items and crowd sizes from 1 to 500 raters per item. Four real datasets anchored the analysis: Toxicity (107,620 comments, 17,280 raters), DICES (350 conversations, 123 raters), D3code (4,554 items, 4,309 raters from 21 countries), and Jobs (2,000 tweets, 5 raters each).
Distribution-aware metrics needed the smallest overall budget to produce reliable results, suggesting that capturing human disagreement is not inherently more expensive, just differently allocated. Reliable results can often be achieved with around 1,000 total annotations, but only if the budget is split correctly between test examples and raters. In one experiment, a poorly allocated budget of 10,000 annotations produced less reliable rankings than a well-allocated budget of just 1,000, underscoring that strategy matters more than scale.
The study’s “forest vs. tree” metaphor captures this dynamic: accuracy-based evaluation resembles looking at a forest from above, prioritizing coverage over detail, while distribution-aware evaluation resembles examining individual trees closely, gaining richer information at the cost of surveying fewer. Neither approach is universally superior; the right choice depends entirely on what a benchmark is trying to measure. Benchmark designers who choose the wrong strategy risk drawing conclusions that would not replicate if the evaluation were run again with different raters.
An Open-Source Tool for Benchmark Design
To help benchmark designers navigate this trade-off, the team built an open-source simulator called Variance Estimation Toolkit (VET) that replicates human rating patterns using real datasets. Calibrated against five real datasets covering toxicity detection, chatbot safety, and cross-cultural offensiveness assessment, VET lets researchers test thousands of combinations across different total budgets and rater counts. Running simulations before committing annotation resources enables benchmark designers to determine the optimal split for their specific evaluation goal without wasting time and money on trial-and-error campaigns.
Rather than prescribing a single formula, VET accounts for the fact that different evaluation metrics respond differently to changes in breadth and depth. A benchmark measuring simple accuracy may need 500 items rated by three people each, while a benchmark measuring opinion diversity on the same data could require 50 items rated by 20 people each. VET replaces guesswork with data-driven allocation decisions grounded in empirical rater behavior.
Google Research scientists Flip Korn and Chris Welty led the study alongside PhD student Deepak Pandita and Prof. Christopher Homan from Rochester Institute of Technology. Korn and Welty described the goal as achieving “truly reliable results that reflect human nuance.” VET’s source code is publicly available on GitHub, and the full paper is accessible on arXiv.
A Growing Chorus of Benchmark Criticism
The study arrives amid broader skepticism about AI evaluation methods. In April 2025, academics and AI ethics specialists challenged crowdsourced benchmark platforms like LMArena, questioning their fairness and methodology. Around the same time, LMArena launched as a standalone company spun out from the original Chatbot Arena project, reflecting both growing commercial interest in benchmarking and increased scrutiny of its methods. Separately, a July 2025 study found deep flaws in benchmark design that could sharply inflate performance estimates.
Where those critiques questioned what benchmarks measure, this Google Research study shifts focus to the human judgment layer underneath, questioning whether benchmarks collect enough human input to measure anything reliably. Combined, the findings paint a picture of an evaluation ecosystem where neither the metrics nor the human data feeding them have kept pace with the rapid advancement of the models they are meant to assess.
When human evaluations decide which AI model comes out on top, the stakes extend beyond academic rankings. Companies use benchmark results to market products, investors rely on them to evaluate startups, and regulators are beginning to reference them in policy discussions about AI safety and capability thresholds.
With AI systems increasingly deployed in high-stakes applications, from content moderation to medical triage, the gap between how benchmarks score models and how humans perceive their outputs in practice carries real consequences. A model that appears to handle toxic content well under a majority-vote benchmark might perform poorly on edge cases where raters genuinely disagree about what counts as harmful. Deploying such a model in multilingual communities, where cultural norms around offensiveness vary widely, could produce inconsistent enforcement that the benchmark did not predict.
Adopting the study’s findings means recognizing that the typical one-to-five rater count per example is often insufficient for reproducible model comparisons. Until benchmark designers account for the full spectrum of human judgment rather than collapsing it into a single vote, the evaluations shaping which AI models reach the market will remain fundamentally unreliable. Open-source tools like VET offer a practical path forward, but the AI evaluation community must first acknowledge that its current methods are systematically incomplete.

