June 24, 2026

The AI Token Cost Problem Is a Design Flaw

Anmol Jawandha

Staff AI Engineer

Share
  1. LI Test

  2. LI Test

TLDR: Uber burned through its entire 2026 AI coding budget in four months — then its COO admitted he couldn't link rising token consumption to shipped features. This post goes over the different sources of token waste and suggests how to design token-efficient systems that ultimately move business outcomes.

The Uber Problem Is Your Problem Too

Uber burned through its entire 2026 AI coding budget in the first four months of the year, then capped every employee at $1,500 per month per tool. The scale explains how. Uber has about 5,000 engineers, and 84–95% of them used these tools each month. Before the caps, individual engineers were running up $500 to $2,000 a month in tokens. The CTO reportedly spent $1,200 in a single two-hour demo.

The numbers are striking, but the revealing part is a quote from COO Andrew Macdonald. Asked whether all that token usage was actually improving Uber's products, he said: "That link is not there yet, right?" Uber was tracking volume, not value.

That's the real problem: most token spend is unattributed, and most token waste is baked into the architecture. So even though Gartner expects inference costs to drop more than 90% by 2030 as hardware improves and specialized chips come online, cheaper tokens won't fix the waste.

Why Agentic Systems Burn Tokens

Before getting into the fixes, it helps to understand why agentic token consumption is structurally different from chat-based consumption.

In a stateless chat interaction, each request is independent. In an agentic loop, every subsequent turn re-sends the full conversation trajectory — tool calls, results, prior reasoning — as context. This means a bloated tool output is a tax you pay on every turn that follows. 

Outside of the system architecture warranting context reuse, here are other sources of context bloat in practice:

  • Uncompressed payloads: returning full HTML, raw JSON blobs, or complete file contents when only specific excerpts are needed
  • Token-heavy noise: full URLs instead of compact link identifiers, verbose tool descriptions, repeated boilerplate in schemas
  • Over-engineered tool interfaces: complex multi-parameter schemas that force the model to spend tokens reasoning about which argument to use and more importantly costing the agent valuable iterations
  • Context rot: when a long-running agent loses track of a fact it established earlier, re-researches it, and spends additional tokens recovering from a context management failure that didn't have to happen

Three Engineering Principles That Reduce Token Spend

Principle 1: Simple Abstractions

The model should spend its budget on reasoning about your problem, not on navigating your tool interface. Every parameter you add to a tool schema is a decision surface for the model, and decision surfaces cost tokens — both in prompt construction and in the model's reasoning about how to call the tool correctly.

The design target is tools that are minimal and semantically familiar. Create a small set of well-named primitives with narrow inputs and predictable outputs. Push complexity down into your API layer rather than surfacing it as model-facing parameters.

Principle 2: Dense Payloads

Your tool shouldn’t return everything it could return, it should return only what's relevant to the current task.

In practice, this means a few things:

  • Right-size your extractors by document type. A structured data source (JSON API, database row) should return a compact, field-selective response. A web page should return a cleaned excerpt, not the full DOM. A PDF should return the relevant section, not the full text with headers, footers, and navigation stripped but the content still intact.
  • Encode links compactly. Instead of returning full URLs in tool outputs—which are typically 60–150 characters of noise—assign short identifiers and maintain a citation map in your system prompt or a dedicated context slot. Something like [src:3] instead of https://example.com/very/long/path/to/document.html saves tokens on every subsequent turn that references that source.
  • Weave citation markers into the content itself. If you're building a research or retrieval pipeline, grounding claims in-line with compact markers keeps the context coherent without requiring the model to re-fetch attribution from somewhere else in the trajectory.

Principle 3: Budget as a First-Class Signal

Agentic systems often treat token budget as an infrastructure concern—something the billing system handles after the fact. A better design makes the agent budget-aware from turn zero.

This means passing the remaining token budget (or a proxy like turn count) as an explicit field in the system prompt or as a special context variable the model can read. 

Other Strategies That Don't Require Architecture Changes

  • Model routing/tiering: Not every task needs a frontier model. Classification, summarization, format normalization, and structured extraction all perform well on smaller, faster, cheaper models. Routing simple agentic sub-tasks to a small model while reserving the large model for complex reasoning steps can cut costs dramatically without measurable quality loss on most benchmarks.
  • Prompt caching: If your system prompt is large and stable — detailed instructions, tool schemas, reference documents — most providers offer caching at the prompt level. Cached tokens are typically billed at a fraction of the cost of fresh input tokens.
  • Semantic caching: At the application layer, if a question or sub-task is semantically identical to one recently answered, returning the cached result skips the inference call entirely. This requires a similarity search layer, but for high-throughput applications where queries cluster, the savings are real. This layer is generally worth investing in at a large scale. 
  • Batch processing: For non-interactive workloads — nightly analysis jobs, bulk document processing, evaluation runs — batch API endpoints are typically cheaper than synchronous inference. The latency trade-off is usually irrelevant for offline tasks.

Token Cost is Not the Only Metric

The number that matters is cost per task — or better, if you can measure it, cost per business outcome. A task is whatever the agent was actually for: a passing test suite, a merged PR, a resolved support ticket, a clean data extraction. Pick the unit that maps to the work, then measure how many tokens it took to get there. 

Cost per task is the easy one to instrument because it's narrow, close to the tools, and it's the number engineers can move week to week — tokens per resolved ticket, per generated test, per document summarized, per successful tool call. 

Cost per business outcome is the harder (and more valuable) one to measure — support deflection rate, time-to-merge, weekly impressions for a marketing team, site reliability for an infra team, revenue per engineer etc. are all hard to associate with upstream tokens consumed. The task metrics are your control surface but the business metrics are what you're accountable for. 

Once you measure outcomes, the finding is almost always two-sided: AI is creating real value, and token spend could be cut sharply without touching any of it. 

Build AI agents that don't waste tokens on retrieval noise.

Featured resources.

Paying 10x More After Google’s num=100 Change? Migrate to You.com in Under 10 Minutes

September 18, 2025

Blog

September 2025 API Roundup: Introducing Express & Contents APIs

September 16, 2025

Blog

You.com vs. Microsoft Copilot: How They Compare for Enterprise Teams

September 10, 2025

Blog

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Light blue graphic with the text ‘What Is MCP?’ on the left and simple outlined geometric shapes, including nested diamonds and a partial circle, on the right.
API Management & Evolution

What Is Model Context Protocol (MCP)?

Edward Irby

Senior Software Engineer

January 22, 2026

Blog

Graphic with the text ‘What are Vertical Indexes?’ beside simple burgundy line art showing stacked diamond shapes and geometric elements on a light background.
AI Agents & Custom Indexes

What the Heck Are Vertical Search Indexes?

Oleg Trygub

Senior AI Engineer

January 20, 2026

Blog

A flowchart showing a looped process: Goal → Context → Plan, curving into Action → Evaluate, with arrows indicating continuous iteration.
AI Agents & Custom Indexes

The Agent Loop: How AI Agents Actually Work (and How to Build One)

Mariane Bekker

Head of Developer Relations

January 16, 2026

Blog

A speaker with light hair and glasses gestures while talking on a panel at the World Economic Forum, with the you.com logo shown in the corner of the image.
AI 101

Before Superintelligent AI Can Solve Major Challenges, We Need to Define What 'Solved' Means

Richard Socher

You.com Co-Founder & CEO

January 14, 2026

News & Press

Stacked white cubes on gradient background with tiny squares.
AI Search Infrastructure

AI Search Infrastructure: The Foundation for Tomorrow’s Intelligent Applications

Brooke Grief

Head of Content & Web

January 9, 2026

Blog

Cover of the You.com whitepaper titled "How We Evaluate AI Search for the Agentic Era," with the text "Exclusive Ungated Sneak Peek" on a blue background.
Comparisons, Evals & Alternatives

How to Evaluate AI Search in the Agentic Era: A Sneak Peek 

Zairah Mustahsan

Staff Data Scientist

January 8, 2026

Blog

API Management & Evolution

You.com Hackathon Track

Mariane Bekker

Head of Developer Relations

January 5, 2026

Guides

Chart showing variance components and ICC convergence for GPT-5 on FRAMES benchmarks, analyzing trials per question and number of questions for reliability.
Comparisons, Evals & Alternatives

Randomness in AI Benchmarks: What Makes an Eval Trustworthy?

Zairah Mustahsan

Staff Data Scientist

December 19, 2025

Blog