Blog
 / 
AI 101
March 2, 2026

Effective AI Skills Are Like Seeds

Edward Irby

Senior Software Engineer

TLDR: Early AI coding skills relied on detailed instructions, but this approach proved brittle and hard to verify in real-world scenarios. The shift to a "seed" model—providing working reference code and meaningful tests—enabled robust, adaptable integrations. Skills are now verified via real API calls, automated CI, and regression detection.

When we first built AI coding skills for Claude Code, we thought about them the way most teams do: as detailed instruction manuals. Write precise steps, enumerate every option, cover every edge case. The more thorough the skill, the better the agent would perform.

We were wrong—or at least, incomplete.

A thorough skill that tells an agent exactly what to do is brittle in a specific way: it's hard to verify whether the agent actually did it correctly. You can read the skill. You can read the generated code. But you can't easily tell, from the outside, whether the integration actually works against real APIs with real data. You're trusting the agent's judgment without any way to check it.

Then, two "aha moments"changed how we think about this.

Two "Aha Moments"

The first came from our own work. We had already used @plaited/agent-eval-harness—an open-source tool I created by for capturing agent trajectories and scoring them— to evaluate web search agents. It occurred to us: what if we used the same harness to evaluate the skills themselves? Not "did the agent follow the instructions," but "did the thing the agent built actually work?"

The second came from Zairah Mustahsan, Staff Product Data Scientist, who shared a post by Shane Butler that crystallized the idea in a single phrase:

"You don't send someone the code? You send them the seed? They plant it in their own soil, answer 7 questions about their world, and a completely different organism grows."

Shane had collapsed a 50,000-line AI data analyst system into a 1,400-line Markdown file—a genome. Drop it into an empty repo, answer a few questions about your data and context, walk away. Hours later, a fully customized AI analyst has grown itself from scratch, adapted to your environment.

That's the mental model we needed. A skill isn't an instruction manual. It's a seed.

What Makes a Skill a Seed

A seed has two properties that an instruction manual doesn't:

  1. It contains everything needed to grow—not just what to do, but what correct growth looks like.
  2. It can be verified—you can tell whether the seed sprouted correctly by examining what grew.

For AI coding skills, this translates directly:

Instruction Manual Seed
Tells agent what code to write Shows agent canonical reference code
Describes what correct output looks like Includes tests that verify output against real APIs
Evaluated by reading Evaluated by running
Static Verifiable

The key addition is the assets/ directory. Instead of describing what correct code looks like in prose, we ship the actual working code. The agent reads the assets, understands the pattern, and generates a working integration in the developer's specific context. Then tests prove it sprouted correctly.

The Anatomy of a Seed Skill

Here's what our integration skills look like after the refactor:

Shell
skills/ydc-ai-sdk-integration/ ├── SKILL.md ← Instructions + test templates └── assets/ ├── path-a-generate.ts ← Canonical Path A integration ├── path-b-stream.ts ← Canonical Path B integration └── integration.spec.ts ← Test file agents must replicate

The SKILL.md body links directly to the asset files so the agent can read them:

Shell
## Reference Assets - [assets/path-a-generate.ts](assets/path-a-generate.ts) — basic integration - [assets/path-b-stream.ts](assets/path-b-stream.ts) — streaming integration - [assets/integration.spec.ts](assets/integration.spec.ts) — test structure

And the "Generate Integration Tests" section in each skill now tells the agent explicitly: write tests that call real APIs, use keyword assertions on the output, and use "Search the web for..." to force tool invocation rather than letting the model answer from memory.

The Test Assertion Problem

This last point—test assertion quality—turned out to matter more than we expected.

Our original tests looked like this:

Shell
const result = await prompt.send('What is TypeScript?') expect(result.content.length).toBeGreaterThan(50)

This test passes even if the MCP tool was never called. A language model can answer "What is TypeScript?" from training data. The test asserts on content length, not on whether a real web search happened.

The fix has two parts:

1. Force tool invocation with an explicit instruction:

Shell
const result = await prompt.send( 'Search the web for the three branches of the US government' )

The phrase "Search the web for..." makes tool use an instruction, not an inference. The model can't easily answer by ignoring the instruction.

2. Assert on semantic content, not length:

Shell
const text = result.content.toLowerCase() expect(text).toContain('legislative') expect(text).toContain('executive') expect(text).toContain('judicial')

These keywords will appear in any real response about the U.S. government. A hallucinated or tool-skipped response is far less likely to hit all three. The test now has genuine signal.

The Eval Harness

With seed skills and meaningful tests, we could build an eval loop. The harness works like this:

Shell
prompts.jsonl → Claude Code agent → generated code + tests ↓ bun test / uv run pytest ↓ LLM judge (Haiku) → score 0.0–1.0

Each entry in prompts.jsonl is two turns:

  1. "Using the [skill], create a basic integration and write tests that prove it calls the real API."
  2. "Extend with You.com MCP server and update tests to prove MCP works with a live query."

That's it. No "add streaming," no "handle errors," no "support custom env vars." Two turns, two verifiable outcomes.

The eval question is simple: did the seed sprout a working integration with tests that prove it works?

The grader runs the tests against real APIs (using actual ANTHROPIC_API_KEY and YDC_API_KEY), then sends the test output to Haiku for scoring.

One lesson we learned the hard way: the test results are ground truth, not the LLM judge. Haiku hallucinated that @youdotcom-oss/teams-anthropic was a fabricated package — despite tests passing with 3+ second network timings that proved real API calls happened. We fixed the judge prompt to explicitly anchor on test evidence:

"If tests passed (exit code 0), the code WORKS with real packages and real endpoints. Do not second-guess whether packages exist or endpoints are real; the test output proves they do."

The lesson: LLM judges are useful for qualitative assessment, but they should never be able to override empirical test results.

We also learned this the hard way with the judge's fallback behavior. When Haiku's API call throws an error mid-eval—transient network issue, rate limit, whatever—the grader originally returned a score of 0.5 rather than surfacing the test result it already had. A 0.5 score falls below the 0.65 pass threshold, so a job with all tests passing would report as a failure.

The fix was simple: when the LLM judge fails, trust the test exit code.

The judge prompt now scores on two dimensions rather than one:

  • The first is the same as before: did the integration work? (Test exit code is ground truth.)
  • The second is new: are the tests meaningful?

The judge reads the test source from generatedFiles and looks for keyword assertions over toBeDefined(), "Search the web for..." queries that force tool invocation, and coverage of both the basic integration and the MCP extension. A score of 0.92–1.0 requires both dimensions; tests that pass with only length checks land in the 0.85–0.91 band. This creates useful differentiation—the eval can flag a skill whose tests pass but whose assertions are weak enough that you can't trust them.

Results

After the refactor, all seven integration skills passed with scores ranging from 0.92–0.96:

Skill Score Tests
teams-anthropic-integration 0.92 4/4
ydc-ai-sdk-integration 0.95 2/2
ydc-claude-agent-sdk-integration (Python) 0.93 2/2
ydc-claude-agent-sdk-integration (TypeScript) 0.94 2/2
ydc-crewrai-mcp-integration 0.93 4/4
ydc-openai-agent-sdk-integration (Python) 0.95 2/2
ydc-openai-agent-sdk-integration (TypeScript) 0.96 2/2

Before: 5/7 passing at mixed scores, with two skills scoring 0.35 due to a missing pyproject.toml—Python skills need this for uv run pytest to find the test environment. The seed pattern caught this. The asset now includes a pyproject.toml template so agents always generate one.

Continuous Integration (CI): Skills That Degrade Get Caught

The last piece is automation. Skills rot. APIs change, packages update, new SDK versions break patterns. We added a GitHub Actions workflow that:

  • On PR: detects which skills changed, runs only those evals, blocks merge if any score below 0.65
  • On push to main: same detection and scoring
  • Weekly schedule: runs all skills, opens a GitHub issue with eval-failure label if anything regressed

The weekly run is the safety net. If an upstream package changes its API and breaks a skill, we find out on Monday morning rather than when a developer tries to use it.

The CI run for our first PR demonstrated this immediately.

Three skills failed: both OpenAI Agents SDK skills, and the crewAI skill. The root cause in all three cases was the same missing ingredient—OPENAI_API_KEY wasn't passed to the eval step env block. The OpenAI skills failed with a clean assertion error. The crewAI skill failed less obviously—without OPENAI_API_KEY, crewAI couldn't initialize its default LLM and errored before the MCP connection was even attempted. The fix was a one-line addition to the workflow. CI surfaced the gap within minutes of the PR opening.

Applying This to Your Own Skills

If you're building skills for your engineering org—whether for Claude Code, Cursor, or another AI coding tool—the seed pattern applies broadly:

1. Ship working reference code in assets/

Don't describe what correct code looks like. Show it. The agent reads the asset, not your description of the asset.

2. Write tests that call real services

Mocks test that your mock returns what you told it to return. Real API calls test that the integration actually works. Use keyword assertions, not length checks.

3. Force tool use explicitly in test queries

"Search the web for X" is not the same as "What is X?". The former makes tool invocation part of the instruction; the latter lets the model skip the tool entirely.

4. Evaluate outcomes, not instructions

The eval question is: did it work? Not: did the agent follow the instructions correctly? Tests that call real APIs are the ground truth. LLM judges are useful for qualitative color, not for overriding empirical evidence.

5. Automate regression detection

Skills that work today will break next month. A weekly CI run costs little and catches regressions before your users do.

Open Source

The eval harness, @plaited/agent-eval-harness, is open source. The skill patterns from this post are visible in our agent-skills repository. We hope other teams building skills for their organizations find both useful.

The seed is the unit of distribution. What grows from it depends on the soil—the developer's context, their codebase, their APIs. Your job is to make the seed viable. The tests tell you whether it is.

Edward Irby is a Developer Experience engineer at You.com. The @plaited/agent-eval-harness is part of his open-source Plaited project.

Featured resources.

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Light blue graphic with the text ‘What Is MCP?’ on the left and simple outlined geometric shapes, including nested diamonds and a partial circle, on the right.
API Management & Evolution

What Is Model Context Protocol (MCP)?

Edward Irby, Senior Software Engineer

January 22, 2026

Blog

Graphic with the text ‘What are Vertical Indexes?’ beside simple burgundy line art showing stacked diamond shapes and geometric elements on a light background.
AI Agents & Custom Indexes

What the Heck Are Vertical Search Indexes?

Oleg Trygub, Senior AI Engineer

January 20, 2026

Blog

A flowchart showing a looped process: Goal → Context → Plan, curving into Action → Evaluate, with arrows indicating continuous iteration.
AI Agents & Custom Indexes

The Agent Loop: How AI Agents Actually Work (and How to Build One)

Mariane Bekker, Head of Developer Relations

January 16, 2026

Blog

A speaker with light hair and glasses gestures while talking on a panel at the World Economic Forum, with the you.com logo shown in the corner of the image.
AI 101

Before Superintelligent AI Can Solve Major Challenges, We Need to Define What 'Solved' Means

Richard Socher, You.com Co-Founder & CEO

January 14, 2026

News & Press

Stacked white cubes on gradient background with tiny squares.
AI Search Infrastructure

AI Search Infrastructure: The Foundation for Tomorrow’s Intelligent Applications

Brooke Grief, Head of Content

January 9, 2026

Blog

Cover of the You.com whitepaper titled "How We Evaluate AI Search for the Agentic Era," with the text "Exclusive Ungated Sneak Peek" on a blue background.
Comparisons, Evals & Alternatives

How to Evaluate AI Search in the Agentic Era: A Sneak Peek 

Zairah Mustahsan, Staff Data Scientist

January 8, 2026

Blog

API Management & Evolution

You.com Hackathon Track

Mariane Bekker, Head of Developer Relations

January 5, 2026

Guides

Chart showing variance components and ICC convergence for GPT-5 on FRAMES benchmarks, analyzing trials per question and number of questions for reliability.
Comparisons, Evals & Alternatives

Randomness in AI Benchmarks: What Makes an Eval Trustworthy?

Zairah Mustahsan, Staff Data Scientist

December 19, 2025

Blog