June 24, 2026

The AI Token Cost Problem Is a Design Flaw

Anmol Jawandha

Staff AI Engineer

Share
  1. LI Test

  2. LI Test

TLDR: Uber burned through its entire 2026 AI coding budget in four months — then its COO admitted he couldn't link rising token consumption to shipped features. This post goes over the different sources of token waste and suggests how to design token-efficient systems that ultimately move business outcomes.

The Uber Problem Is Your Problem Too

Uber burned through its entire 2026 AI coding budget in the first four months of the year, then capped every employee at $1,500 per month per tool. The scale explains how. Uber has about 5,000 engineers, and 84–95% of them used these tools each month. Before the caps, individual engineers were running up $500 to $2,000 a month in tokens. The CTO reportedly spent $1,200 in a single two-hour demo.

The numbers are striking, but the revealing part is a quote from COO Andrew Macdonald. Asked whether all that token usage was actually improving Uber's products, he said: "That link is not there yet, right?" Uber was tracking volume, not value.

That's the real problem: most token spend is unattributed, and most token waste is baked into the architecture. So even though Gartner expects inference costs to drop more than 90% by 2030 as hardware improves and specialized chips come online, cheaper tokens won't fix the waste.

Why Agentic Systems Burn Tokens

Before getting into the fixes, it helps to understand why agentic token consumption is structurally different from chat-based consumption.

In a stateless chat interaction, each request is independent. In an agentic loop, every subsequent turn re-sends the full conversation trajectory — tool calls, results, prior reasoning — as context. This means a bloated tool output is a tax you pay on every turn that follows. 

Outside of the system architecture warranting context reuse, here are other sources of context bloat in practice:

  • Uncompressed payloads: returning full HTML, raw JSON blobs, or complete file contents when only specific excerpts are needed
  • Token-heavy noise: full URLs instead of compact link identifiers, verbose tool descriptions, repeated boilerplate in schemas
  • Over-engineered tool interfaces: complex multi-parameter schemas that force the model to spend tokens reasoning about which argument to use and more importantly costing the agent valuable iterations
  • Context rot: when a long-running agent loses track of a fact it established earlier, re-researches it, and spends additional tokens recovering from a context management failure that didn't have to happen

Three Engineering Principles That Reduce Token Spend

Principle 1: Simple Abstractions

The model should spend its budget on reasoning about your problem, not on navigating your tool interface. Every parameter you add to a tool schema is a decision surface for the model, and decision surfaces cost tokens — both in prompt construction and in the model's reasoning about how to call the tool correctly.

The design target is tools that are minimal and semantically familiar. Create a small set of well-named primitives with narrow inputs and predictable outputs. Push complexity down into your API layer rather than surfacing it as model-facing parameters.

Principle 2: Dense Payloads

Your tool shouldn’t return everything it could return, it should return only what's relevant to the current task.

In practice, this means a few things:

  • Right-size your extractors by document type. A structured data source (JSON API, database row) should return a compact, field-selective response. A web page should return a cleaned excerpt, not the full DOM. A PDF should return the relevant section, not the full text with headers, footers, and navigation stripped but the content still intact.
  • Encode links compactly. Instead of returning full URLs in tool outputs—which are typically 60–150 characters of noise—assign short identifiers and maintain a citation map in your system prompt or a dedicated context slot. Something like [src:3] instead of https://example.com/very/long/path/to/document.html saves tokens on every subsequent turn that references that source.
  • Weave citation markers into the content itself. If you're building a research or retrieval pipeline, grounding claims in-line with compact markers keeps the context coherent without requiring the model to re-fetch attribution from somewhere else in the trajectory.

Principle 3: Budget as a First-Class Signal

Agentic systems often treat token budget as an infrastructure concern—something the billing system handles after the fact. A better design makes the agent budget-aware from turn zero.

This means passing the remaining token budget (or a proxy like turn count) as an explicit field in the system prompt or as a special context variable the model can read. 

Other Strategies That Don't Require Architecture Changes

  • Model routing/tiering: Not every task needs a frontier model. Classification, summarization, format normalization, and structured extraction all perform well on smaller, faster, cheaper models. Routing simple agentic sub-tasks to a small model while reserving the large model for complex reasoning steps can cut costs dramatically without measurable quality loss on most benchmarks.
  • Prompt caching: If your system prompt is large and stable — detailed instructions, tool schemas, reference documents — most providers offer caching at the prompt level. Cached tokens are typically billed at a fraction of the cost of fresh input tokens.
  • Semantic caching: At the application layer, if a question or sub-task is semantically identical to one recently answered, returning the cached result skips the inference call entirely. This requires a similarity search layer, but for high-throughput applications where queries cluster, the savings are real. This layer is generally worth investing in at a large scale. 
  • Batch processing: For non-interactive workloads — nightly analysis jobs, bulk document processing, evaluation runs — batch API endpoints are typically cheaper than synchronous inference. The latency trade-off is usually irrelevant for offline tasks.

Token Cost is Not the Only Metric

The number that matters is cost per task — or better, if you can measure it, cost per business outcome. A task is whatever the agent was actually for: a passing test suite, a merged PR, a resolved support ticket, a clean data extraction. Pick the unit that maps to the work, then measure how many tokens it took to get there. 

Cost per task is the easy one to instrument because it's narrow, close to the tools, and it's the number engineers can move week to week — tokens per resolved ticket, per generated test, per document summarized, per successful tool call. 

Cost per business outcome is the harder (and more valuable) one to measure — support deflection rate, time-to-merge, weekly impressions for a marketing team, site reliability for an infra team, revenue per engineer etc. are all hard to associate with upstream tokens consumed. The task metrics are your control surface but the business metrics are what you're accountable for. 

Once you measure outcomes, the finding is almost always two-sided: AI is creating real value, and token spend could be cut sharply without touching any of it. 

Build AI agents that don't waste tokens on retrieval noise.

Featured resources.

Paying 10x More After Google’s num=100 Change? Migrate to You.com in Under 10 Minutes

September 18, 2025

Blog

September 2025 API Roundup: Introducing Express & Contents APIs

September 16, 2025

Blog

You.com vs. Microsoft Copilot: How They Compare for Enterprise Teams

September 10, 2025

Blog

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Accuracy, Latency, & Cost

Same LLM, Better Web Search, Better Outcome

Chak Pothina

Product Marketing Manager, APIs

May 7, 2026

Blog

A navy graphic with the text “What Is Semi-Structured Data?” beside simple white line icons of a database cylinder and geometric shapes.
AI 101

What Is Semi Structured Data: A Developer's Guide

You.com Team

May 4, 2026

Blog

API Management & Evolution

Context Rot Is Quietly Breaking Your API Integrations

Brooke Grief

Head of Content & Web

May 1, 2026

Blog

Graphic with the text 'What Is a SERP API?' beside simple line icons of a document and circular shapes on a light blue background in minimalist style
API Management & Evolution

What Is a SERP API? Architecture, Limitations, and Why the Market Is Shifting

Brooke Grief

Head of Content & Web

April 30, 2026

Blog

Product Updates

New You.com Research API Controls: Scope the Web and Shape the Output

Lance Shaw

Product Marketing Lead

April 28, 2026

Blog

Blue graphic showing text: You.com Web Search Eval Harness: Benchmark Any Web Search Provider Yourself, with simple decorative shapes in the corners too
Comparisons, Evals & Alternatives

The You.com Web Search Eval Harness: Benchmark Any Web Search Provider Yourself

Eddy Nassif

Senior Applied Scientist

April 21, 2026

Blog

Clear petri dishes, a small vial, and a glass molecular model arranged on a bright blue surface with soft shadows for a clean scientific look.
Comparisons, Evals & Alternatives

Extreme Single-Agent Inference Scaling for Agentic Search: Achieving SOTA on DeepSearchQA

Abel Lim

Senior Research Engineer

April 20, 2026

Blog

Graphic with purple background showing title about AI governance and web search APIs, with geometric line shapes arranged below the headline.
AI Search Infrastructure

The AI Governance Problem: Why Web Search APIs Are the Missing Layer

You.com Team

April 20, 2026

Blog