AI Benchmarks Use Too Few Raters to Be Reliable


TL;DR

  • Key Finding: A Google Research study accepted at AAAI-26 found that standard AI benchmarks use too few human raters, making model comparisons statistically unreliable.
  • Core Problem: Majority voting across three to five raters erases meaningful human disagreement, hiding information that could change which AI model ranks higher.
  • Recommended Fix: More than ten raters per example are needed, and budget allocation strategy matters more than total budget size.
  • New Tool: The team released VET, an open-source simulator that helps benchmark designers optimize rater budgets before committing annotation resources.
  • Broader Context: The study adds to growing scrutiny of AI evaluation methods, following prior critiques of crowdsourced benchmark platforms in 2025.

AI benchmarks that compare language models typically collect three to five human ratings per test example and pick a winner by majority vote. Google Research revealed in a study published in late March 2026 that this approach systematically erases the very disagreement it should be measuring, undermining the reliability of evaluations that inform product decisions across the industry.

Accepted at the annual AAAI Conference on Artificial Intelligence, the study titled “Forest vs Tree” by researchers from Google Research and Rochester Institute of Technology demonstrates that the standard three-to-five rater count is statistically insufficient for reliable AI model comparisons. How annotation budgets are split between breadth (more test examples) and depth (more raters per example) matters as much as the budget itself, and many benchmarks get that split wrong. For statistically reliable results that capture the range of human opinion, more than ten raters per example are generally needed.

How Budget Allocation Changes Results

At its core, the study exposes a trade-off between two legitimate evaluation goals. For accuracy-based metrics that rely on majority vote agreement, a wide approach performs strongest: many test examples with few raters each. For distribution-aware metrics like total variation, which capture the full spread of human opinion, the opposite holds: fewer test examples but meaningfully more raters per item.

Current benchmarks overwhelmingly follow the first model, casting a wide net across test examples while collecting only a thin layer of human judgment for each one. Two items can receive the same majority-vote label yet have very different response distributions underneath. One might reflect near-unanimous agreement among raters; another might reflect a narrow 3-2 split.

By collapsing both into a single label, majority voting treats confident consensus and razor-thin margins as identical, hiding information that could change which AI model appears to perform better. For safety-sensitive applications, those lost signals represent the gap between a model that handles easy cases well and one that fails on the hard ones.