***

title: How to Evaluate You.com Search API
description: A practical guide to benchmarking You.com's Search API: methodology, configurations, datasets, and real performance tradeoffs.
og:title: How to Evaluate You.com Search API | Benchmarking Guide
og:description: A practical guide to benchmarking You.com's Search API: methodology, datasets (SimpleQA, FRAMES, FreshQA), latency profiling, and a production readiness checklist.
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://you.com/docs/llms.txt. For full documentation content, see https://you.com/docs/llms-full.txt.

A practical guide to benchmarking You.com's Search API: methodology, datasets, and real performance tradeoffs.

<Info>
  New to the Search API? Start with the [Search API Overview](/docs/search/overview) for a full parameter reference and feature walkthrough, then come back here when you're ready to run a structured evaluation.
</Info>

## Why This Guide Exists

Most developer docs treat evaluation like checking boxes. This guide treats it like shipping production code: you need real benchmarks, honest tradeoffs, and configurations that actually work.

We'll cover:

* **Retrieval Quality** — Does it actually find what you need?
* **Latency** — Fast enough for your users?
* **Freshness** — Can it handle "what happened today?"
* **Cost** — What's your burn rate per query?
* **Agent Performance** — Does it work in multi-step reasoning workflows?

***

<Info>
  **Want help running your eval?** Our team can design and run custom benchmarks for your use case. [Talk to us](https://you.com/evaluations-as-a-service)
</Info>

## The Golden Rule: Start Simple, Stay Fair

**TL;DR: Use default settings. Don't over-engineer your first eval.**

Most failed evaluations have one thing in common: people add too many parameters too early.

### Recommended Starting Point

```python
from youdotcom import You

with You("api_key") as you:
    result = you.search.unified(
        query=query,
        count=10
    )

# That's it. No date filters. No domain whitelists. Just search.
```

### When to Add Complexity

Add parameters ONLY when:

1. Your evaluation explicitly tests that feature (e.g., freshness requires the `freshness` parameter)
2. You've already run baseline evals and know what you're optimizing for
3. The parameter reflects actual production usage, not hypothetical edge cases

**Anti-pattern**: "Let me add every possible parameter to make this perfect"

**Better approach**: "Let me run this with defaults, measure performance, then iterate"

***

For a full reference of available parameters and their defaults, see the [Search API Overview](/docs/search/overview#all-optional-parameters-you-can-control).

***

## Latency: Compare Apples to Apples

**Critical insight: Never compare APIs with wildly different latency profiles.**

A 200ms API and a 3000ms API serve different use cases. Comparing them is like comparing a bicycle to a freight train.

### Latency Buckets

| Latency Class             | Use Cases                                       | Compare Within                     |
| ------------------------- | ----------------------------------------------- | ---------------------------------- |
| **Ultra-fast (\< 200ms)** | Autocomplete, real-time voice agents            | Other sub-200ms systems            |
| **Fast (200-800ms)**      | Chatbots, user-facing QA                        | Similar mid-latency APIs           |
| **Deep (>1000ms)**        | Research, multi-hop reasoning, batch processing | Other comprehensive search systems |

### Fair Comparison Framework

```python
# Good: Comparing within the same latency class
compare_systems([
    "You.com (350ms p50)",
    "Competitor A (380ms p50)",
    "Competitor B (290ms p50)"
])

# Bad: Comparing across latency classes
compare_systems([
    "You.com (350ms p50)",
    "Deep Research API (2800ms p50)"  # Meaningless comparison
])
```

***

## Evaluation Workflow: 4 Steps That Actually Work

<Steps toc={true}>
  ### Define What You're Testing

  Don't start with "let's evaluate everything." Start with:

  * What capability matters? (speed? accuracy? freshness?)
  * What latency can you tolerate?
  * Single-step retrieval or multi-step reasoning?

  **Example scope**: "We need 90%+ accuracy on customer support questions with \< 500ms latency"

  ### Pick Your Dataset

  | Dataset            | Tests                    | Notes                                                  |
  | ------------------ | ------------------------ | ------------------------------------------------------ |
  | SimpleQA           | Fast factual QA          | Good baseline                                          |
  | FRAMES             | Multi-step reasoning     | Agentic workflows                                      |
  | FreshQA            | Time-sensitive queries   | Use with `freshness` param                             |
  | Custom (your data) | Domain-specific accuracy | Start [here](https://you.com/evaluations-as-a-service) |

  **Pro tip**: Start with public benchmarks, but your production queries are the real test.

  **Need help building a custom dataset?** [We can help](https://you.com/evaluations-as-a-service)

  ### Run Your Eval

  ```python
  from youdotcom import You
  import time

  def run_eval(dataset_path, config):
      results = []

      with You("api_key") as you:
          for item in load_dataset(dataset_path):
              query = item['question']
              expected = item['answer']

              # Step 1: Retrieve
              start = time.time()
              search_results = you.search.unified(
                  query=query,
                  count=config.get('count', 10),
                  livecrawl=config.get('livecrawl'),
                  freshness=config.get('freshness')
              )
              latency = (time.time() - start) * 1000

              # Step 2: Synthesize answer using your LLM
              snippets = [r.snippets[0] for r in search_results.results.web if r.snippets]
              context = "\n".join(snippets)
              answer = llm.generate(
                  f"Answer using only this context:\n{context}\n\n"
                  f"Question: {query}\nAnswer:"
              )

              # Step 3: Grade
              grade = evaluate_answer(expected, answer)

              results.append({
                  'correct': grade == 'correct',
                  'latency_ms': latency
              })

      # Calculate metrics
      accuracy = sum(r['correct'] for r in results) / len(results)
      p50_latency = sorted([r['latency_ms'] for r in results])[len(results)//2]

      return {
          'accuracy': f"{accuracy:.1%}",
          'p50_latency': f"{p50_latency:.0f}ms"
      }
  ```

  ### Analyze & Iterate

  Look at:

  * **Accuracy vs latency tradeoff** - Can you get 95% accuracy at 300ms?
  * **Failure modes** - Which queries fail? Is there a pattern?
  * **Cost** - What's your \$/1000 queries?

  Then iterate:

  * Add `livecrawl` if snippets aren't giving enough context
  * Add `freshness` if failures are due to stale content
  * Compare against competitors in the same latency class
</Steps>

***

## Tool Calling for Agents

When evaluating You.com in agentic workflows, keep the tool definition minimal.

<Info>
  **Open-source evaluation framework**: Check out [Agentic Web Search Playoffs](https://github.com/youdotcom-oss/web-search-agent-evals) for a ready-to-use benchmark comparing web search providers in agentic contexts.
</Info>

```python
search_tool = {
    "type": "function",
    "function": {
        "name": "web_search",
        "description": "Search the web using You.com. Returns relevant snippets and URLs.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                }
            },
            "required": ["query"]
        }
    }
}
```

**Note**: Don't expose `freshness`, `livecrawl`, or other parameters to the agent unless necessary. Let the agent focus on formulating good queries.

### Implementation

```python
def handle_tool_call(tool_call):
    query = tool_call.arguments["query"]

    with You("api_key") as you:
        results = you.search.unified(query=query, count=10)

    # Format for agent consumption
    formatted = []
    for r in results.results.web[:5]:
        formatted.append({
            "title": r.title,
            "snippet": r.snippets[0] if r.snippets else r.description,
            "url": r.url
        })

    return json.dumps(formatted)
```

***

## Common Mistakes to Avoid

### 1. Over-Filtering Too Early

**Don't:**

```python
result = you.search.unified(
    query=query,
    freshness="week",
    country="US",
    language="EN",
    safesearch="strict"
)
```

**Do:**

```python
result = you.search.unified(query=query, count=10)  # Start simple
```

### 2. Ignoring Your Actual Queries

**Don't just run:** Public benchmarks

**Also run:** Your actual user queries from production logs

### 3. Not Measuring What Users Care About

**Don't only measure:** Technical accuracy

**Also measure:** Click-through rate, task completion, reformulation rate

### 4. Testing in Isolation

**Don't test:** Search API alone

**Test:** Full workflow (search -> synthesis -> grading) with your actual LLM and prompts

***

## Debugging Performance Issues

### If Accuracy is Low (\< 85%)

1. Are you requesting enough results? Try `count=15`
2. Enable livecrawl for full page content:

```python
results = you.search.unified(
    query=query,
    count=15,
    livecrawl="web",
    livecrawl_formats="markdown"
)
```

3. Is your synthesis prompt good? Test with GPT-4
4. Is your grading fair? Manual review a sample

### If Results are Stale

```python
# Force fresh results
results = you.search.unified(
    query=query,
    count=10,
    freshness="day"  # or "week", "month"
)
```

***

<Info>
  **Still stuck?** Our team has run hundreds of search evals. [Get hands-on help](https://you.com/evaluations-as-a-service)
</Info>

## Production Checklist

### 1. Run Comparative Benchmarks

```python
configs = [
    {'count': 5},
    {'count': 10},
    {'count': 10, 'livecrawl': 'web', 'livecrawl_formats': 'markdown'}
]

for config in configs:
    results = run_eval('your_dataset.json', config)
    print(f"{config}: {results}")
```

### 2. Set Up Monitoring

```python
# What to log
{
    'query': query,
    'latency_ms': latency,
    'num_results_returned': len(results.results.web),
    'used_livecrawl': bool(livecrawl),
    'freshness_filter': freshness,
    'timestamp': now()
}
```

### 3. Document Everything

```markdown
## Search Evaluation - 2025-01-26

**Dataset**: 500 customer support queries
**Config**: count=10, livecrawl=web
**Results**:
- Accuracy: 91.2%
- P50 Latency: 445ms
- P95 Latency: 892ms

**Decision**: Ship with livecrawl enabled - improves synthesis quality
```

***

## Getting Help

* **[Evaluations as a Service](https://you.com/evaluations-as-a-service)** - Custom benchmarks designed and run by our team
* **[Agentic Web Search Playoffs](https://github.com/youdotcom-oss/web-search-agent-evals)** - Open-source benchmark for comparing web search in agentic workflows
* [API Documentation](https://docs.you.com/)
* [Discord Community](https://discord.gg/you)
* Email: [developers@you.com](mailto:developers@you.com)

***

**Remember**: The best evaluation is the one you actually run. Start simple, measure what matters, and iterate.