December 19, 2025

Randomness in AI Benchmarks: What Makes an Eval Trustworthy?

Zairah Mustahsan

Staff Data Scientist

Chart showing variance components and ICC convergence for GPT-5 on FRAMES benchmarks, analyzing trials per question and number of questions for reliability.

How do you know if your favorite chatbot, research agent, or AI tool is really better—or if it just happened to perform well on that particular run? Because AI models sample from probability distributions and interact with external systems, the same agent can produce different answers across runs. You.com researchers—Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, Bryan McCannuncover why a single number can’t tell the full story, and what we should do about it.

 Our research has been accepted at two AAAI 2026 workshops. Check out the full paper and open-source code on GitHub

The Problem: AI Agents Are Unpredictable

Modern AI isn’t just about text generation. Today’s large language models (LLMs) are increasingly acting as agents—autonomously searching the web, solving multi-step problems, and using tools to plan and reason. Evaluating these agentic systems, to ensure they’re completing tasks as designed, is now a cornerstone of AI research and deployment.

But here’s the catch: most benchmarks only report a single accuracy or success rate from one run. This “leaderboard” approach hides a dirty secret—AI agents can be wildly inconsistent from one run to the next. What looks like progress may simply be a lucky streak.

Who cares? Well, if an AI sub-agent is unreliable, your entire system becomes brittle, unpredictable, and potentially unsafe. Imagine choosing an agent for a financial system, a self-driving car, or a medical application, only for it to fail half the time on the same task.

ICC is a Fresh Lens on AI Reliability

To bring transparency and rigor, a group of researchers at You.com proposed a simple but powerful idea: don’t just report accuracy. Instead, report how consistent the agent is—using a metric called the Intraclass Correlation Coefficient (ICC).

What is ICC, and Why Does It Matter?

Originally, the concept of ICC comes from medicine and psychology, where it’s used to measure the reliability of tests and raters. 

According to a paper, “Intraclass correlation – A discussion and demonstration of basic features,” published in the National Library of Medicine, “The intra-class correlation coefficient (ICC) is a number, usually found to have a value between 0 and 1…It refers to correlations within a class of data (for example correlations within repeated measurements of weight), rather than to correlations between two different classes of data (for example the correlation between weight and length).”

In AI, ICC answers the question: “If I run this agent on the same question multiple times, do I get the same result?” High ICC means the agent is predictable and stable. Low ICC means its performance is mostly random noise.

Think of two agents, both with 73% accuracy. One has ICC 0.66 (very consistent), the other just 0.30 (inconsistent). Which would you trust in a critical application? The first one. ICC is the tiebreaker that reveals if an AI’s performance is real or just a roll of the dice.

How Stochastic Are Today’s AI Agents? A Deep Dive

To determine the stochasticity—the quality of being random—of AI agents, the You.com research team ran extensive experiments on two popular benchmarks:

  • GAIA: Measures agentic reasoning and tool use, with tasks of varying complexity.
  • FRAMES: Tests an agent’s ability to retrieve and reason with factual information from multiple documents.

For each task, they didn’t just run the agent once—they ran it up to 64 times per question and measured how much the results varied.

Key Takeaways:

  • Agentic evaluations are inherently stochastic. Randomness comes from the model itself, external APIs, and even the task setup.
  • ICC varies dramatically by task and agent. On hard, open-ended reasoning tasks (GAIA Level 3), older models like GPT-4o had ICC as low as 0.30 (meaning most “progress” was just noise). Newer models like GPT-5 improved both accuracy and ICC, showing real, reliable gains.
  • In information retrieval tasks (FRAMES), ICC was much higher (up to 0.71), meaning results were more stable and meaningful.
  • ICC converges quickly. You usually only need 8-16 trials per question for easy tasks— and up to 32 for the hardest ones—to get a trustworthy reliability estimate.

Why Does Stochasticity Matter for Developers and Users?

Single-run leaderboards can be misleading, as they don’t reveal whether an apparent jump in accuracy is genuine or simply the result of chance—something you can only determine by knowing the ICC. 

And, because system reliability hinges not just on accuracy but also on consistency, swapping in a new agent with higher accuracy but low ICC can actually make your downstream application much less stable. 

The good news is that ICC is actionable. You can proactively boost an agent’s reliability, and therefore its ICC, through better prompting, improved tool design, or smarter system integration, even if the accuracy itself doesn’t change.

Practical Guidelines: How Should the AI Community Evolve?

Our research team recommends that the AI community always report accuracy alongside ICC and within-task variance, ensuring that results are both transparent and meaningful. 

To make evaluation practices more standardized and accessible, they also suggest adopting “Evaluation Cards”—much like nutrition labels for AI—that clearly document all key evaluation details, including the task, agent, number of trials, metrics used, and any known limitations. 

And, finally, experiments should be designed with scientific rigor rather than as mere leaderboard competitions. This requires running multiple trials, applying proper statistical tests such as McNemar’s test for paired results, and being fully transparent about uncertainty throughout the evaluation process.

Forging the Path to More Trustworthy AI Agents

This research is a call to action: If you care about AI reliability, you need to look beyond single-run scores. ICC gives us the missing piece—visibility into agent consistency and the true meaning of AI performance gains. 

By making evaluation stability visible and standardizing how we report results, we can transform agentic AI from an opaque leaderboard race to a trustworthy, scientific discipline. That’s good for developers, users, and everyone relying on AI in real-world systems.

Because we’re looking at AI through a scientific lens, it’s also important to recognize the study’s limitations. The analysis focused on English-language benchmarks and a specific set of agent architectures (mainly OpenAI, Claude, Gemini, and select open-source models), so results may not generalize to other languages or agent types. 

Additionally, the ICC approach assumes binary scoring (right/wrong), which means tasks that allow for partial credit or more nuanced grading require further methodological development. 

More research is needed to extend these findings to non-English benchmarks, partial-credit tasks, and emerging agent architectures. By acknowledging these boundaries, the AI community can better chart the path toward robust, reliable evaluation practices that serve a truly global and diverse set of users.

Featured resources.

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Dark blue graphic with the text 'What Are AI Search Engines and How Do They Work?' alongside simple white line drawings of a magnifying glass and a gear icon.
AI Search Infrastructure

What Are AI Search Engines and How Do They Work?

Chris Mann, Product Lead, Enterprise AI Products

January 29, 2026

Blog

A man with light hair speaks in a bright office, gesturing with one hand while wearing a gray shirt and lapel mic, with blurred city buildings behind him.
Company

How Richard Socher, Inventor of Prompt Engineering, Built a $1.5B AI Search Company

You.com Team, AI Experts

January 29, 2026

Blog

An image with the text “What is AI Search Infrastructure?” above a geometric grid with a star-like logo on the left and a stacked arrangement of white cubes on the right.
AI Search Infrastructure

[DOWNLOAD] What Is AI Search Infrastructure?

Brooke Grief, Head of Content

January 28, 2026

Guides

Two men speaking onstage in separate panels, each gesturing during a presentation, framed by geometric shapes and gradient color blocks.
Company

AI in 2026: Inside the Future-Shaping Predictions from You.com Co-Founders

You.com Team, AI Experts

January 27, 2026

Blog

Black you.com cover reading “What Is AI Grounding and How Does It Work?” above a blue geometric pattern on a gradient purple background.
AI 101

[DOWNLOAD] What Is AI Grounding and How Does it Work?

Brooke Grief, Head of Content

January 26, 2026

Guides

Book cover titled “AI Predictions for 2026” with gradient background, text blocks showing names, and two men pictured speaking onstage in small photo panels.
Company

[DOWNLOAD] 2026 AI Predictions: Insights from You.com Co-Founders

Richard Socher, You.com Co-Founder & CEO

January 23, 2026

Guides

Light blue graphic with the text ‘What Is MCP?’ on the left and simple outlined geometric shapes, including nested diamonds and a partial circle, on the right.
API Management & Evolution

What Is Model Context Protocol (MCP)?

Edward Irby, Senior Software Engineer

January 22, 2026

Blog

Graphic with the text ‘What are Vertical Indexes?’ beside simple burgundy line art showing stacked diamond shapes and geometric elements on a light background.
AI Research Agents & Custom Indexes

What the Heck Are Vertical Search Indexes?

Oleg Trygub, Senior AI Engineer

January 20, 2026

Blog