Blog
 / 
AI 101
March 2, 2026

Effective AI Skills Are Like Seeds

Edward Irby

Senior Software Engineer

TLDR: Early AI coding skills relied on detailed instructions, but this approach proved brittle and hard to verify in real-world scenarios. The shift to a "seed" model—providing working reference code and meaningful tests—enabled robust, adaptable integrations. Skills are now verified via real API calls, automated CI, and regression detection.

When we first built AI coding skills for Claude Code, we thought about them the way most teams do: as detailed instruction manuals. Write precise steps, enumerate every option, cover every edge case. The more thorough the skill, the better the agent would perform.

We were wrong—or at least, incomplete.

A thorough skill that tells an agent exactly what to do is brittle in a specific way: it's hard to verify whether the agent actually did it correctly. You can read the skill. You can read the generated code. But you can't easily tell, from the outside, whether the integration actually works against real APIs with real data. You're trusting the agent's judgment without any way to check it.

Then, two "aha moments"changed how we think about this.

Two "Aha Moments"

The first came from our own work. We had already used @plaited/agent-eval-harness—an open-source tool I created by for capturing agent trajectories and scoring them— to evaluate web search agents. It occurred to us: what if we used the same harness to evaluate the skills themselves? Not "did the agent follow the instructions," but "did the thing the agent built actually work?"

The second came from Zairah Mustahsan, Staff Product Data Scientist, who shared a post by Shane Butler that crystallized the idea in a single phrase:

"You don't send someone the code? You send them the seed? They plant it in their own soil, answer 7 questions about their world, and a completely different organism grows."

Shane had collapsed a 50,000-line AI data analyst system into a 1,400-line Markdown file—a genome. Drop it into an empty repo, answer a few questions about your data and context, walk away. Hours later, a fully customized AI analyst has grown itself from scratch, adapted to your environment.

That's the mental model we needed. A skill isn't an instruction manual. It's a seed.

What Makes a Skill a Seed

A seed has two properties that an instruction manual doesn't:

  1. It contains everything needed to grow—not just what to do, but what correct growth looks like.
  2. It can be verified—you can tell whether the seed sprouted correctly by examining what grew.

For AI coding skills, this translates directly:

Instruction Manual Seed
Tells agent what code to write Shows agent canonical reference code
Describes what correct output looks like Includes tests that verify output against real APIs
Evaluated by reading Evaluated by running
Static Verifiable

The key addition is the assets/ directory. Instead of describing what correct code looks like in prose, we ship the actual working code. The agent reads the assets, understands the pattern, and generates a working integration in the developer's specific context. Then tests prove it sprouted correctly.

The Anatomy of a Seed Skill

Here's what our integration skills look like after the refactor:

Shell
skills/ydc-ai-sdk-integration/ ├── SKILL.md ← Instructions + test templates └── assets/ ├── path-a-generate.ts ← Canonical Path A integration ├── path-b-stream.ts ← Canonical Path B integration └── integration.spec.ts ← Test file agents must replicate

The SKILL.md body links directly to the asset files so the agent can read them:

Shell
## Reference Assets - [assets/path-a-generate.ts](assets/path-a-generate.ts) — basic integration - [assets/path-b-stream.ts](assets/path-b-stream.ts) — streaming integration - [assets/integration.spec.ts](assets/integration.spec.ts) — test structure

And the "Generate Integration Tests" section in each skill now tells the agent explicitly: write tests that call real APIs, use keyword assertions on the output, and use "Search the web for..." to force tool invocation rather than letting the model answer from memory.

The Test Assertion Problem

This last point—test assertion quality—turned out to matter more than we expected.

Our original tests looked like this:

Shell
const result = await prompt.send('What is TypeScript?') expect(result.content.length).toBeGreaterThan(50)

This test passes even if the MCP tool was never called. A language model can answer "What is TypeScript?" from training data. The test asserts on content length, not on whether a real web search happened.

The fix has two parts:

1. Force tool invocation with an explicit instruction:

Shell
const result = await prompt.send( 'Search the web for the three branches of the US government' )

The phrase "Search the web for..." makes tool use an instruction, not an inference. The model can't easily answer by ignoring the instruction.

2. Assert on semantic content, not length:

Shell
const text = result.content.toLowerCase() expect(text).toContain('legislative') expect(text).toContain('executive') expect(text).toContain('judicial')

These keywords will appear in any real response about the U.S. government. A hallucinated or tool-skipped response is far less likely to hit all three. The test now has genuine signal.

The Eval Harness

With seed skills and meaningful tests, we could build an eval loop. The harness works like this:

Shell
prompts.jsonl → Claude Code agent → generated code + tests ↓ bun test / uv run pytest ↓ LLM judge (Haiku) → score 0.0–1.0

Each entry in prompts.jsonl is two turns:

  1. "Using the [skill], create a basic integration and write tests that prove it calls the real API."
  2. "Extend with You.com MCP server and update tests to prove MCP works with a live query."

That's it. No "add streaming," no "handle errors," no "support custom env vars." Two turns, two verifiable outcomes.

The eval question is simple: did the seed sprout a working integration with tests that prove it works?

The grader runs the tests against real APIs (using actual ANTHROPIC_API_KEY and YDC_API_KEY), then sends the test output to Haiku for scoring.

One lesson we learned the hard way: the test results are ground truth, not the LLM judge. Haiku hallucinated that @youdotcom-oss/teams-anthropic was a fabricated package — despite tests passing with 3+ second network timings that proved real API calls happened. We fixed the judge prompt to explicitly anchor on test evidence:

"If tests passed (exit code 0), the code WORKS with real packages and real endpoints. Do not second-guess whether packages exist or endpoints are real; the test output proves they do."

The lesson: LLM judges are useful for qualitative assessment, but they should never be able to override empirical test results.

We also learned this the hard way with the judge's fallback behavior. When Haiku's API call throws an error mid-eval—transient network issue, rate limit, whatever—the grader originally returned a score of 0.5 rather than surfacing the test result it already had. A 0.5 score falls below the 0.65 pass threshold, so a job with all tests passing would report as a failure.

The fix was simple: when the LLM judge fails, trust the test exit code.

The judge prompt now scores on two dimensions rather than one:

  • The first is the same as before: did the integration work? (Test exit code is ground truth.)
  • The second is new: are the tests meaningful?

The judge reads the test source from generatedFiles and looks for keyword assertions over toBeDefined(), "Search the web for..." queries that force tool invocation, and coverage of both the basic integration and the MCP extension. A score of 0.92–1.0 requires both dimensions; tests that pass with only length checks land in the 0.85–0.91 band. This creates useful differentiation—the eval can flag a skill whose tests pass but whose assertions are weak enough that you can't trust them.

Results

After the refactor, all seven integration skills passed with scores ranging from 0.92–0.96:

Skill Score Tests
teams-anthropic-integration 0.92 4/4
ydc-ai-sdk-integration 0.95 2/2
ydc-claude-agent-sdk-integration (Python) 0.93 2/2
ydc-claude-agent-sdk-integration (TypeScript) 0.94 2/2
ydc-crewrai-mcp-integration 0.93 4/4
ydc-openai-agent-sdk-integration (Python) 0.95 2/2
ydc-openai-agent-sdk-integration (TypeScript) 0.96 2/2

Before: 5/7 passing at mixed scores, with two skills scoring 0.35 due to a missing pyproject.toml—Python skills need this for uv run pytest to find the test environment. The seed pattern caught this. The asset now includes a pyproject.toml template so agents always generate one.

Continuous Integration (CI): Skills That Degrade Get Caught

The last piece is automation. Skills rot. APIs change, packages update, new SDK versions break patterns. We added a GitHub Actions workflow that:

  • On PR: detects which skills changed, runs only those evals, blocks merge if any score below 0.65
  • On push to main: same detection and scoring
  • Weekly schedule: runs all skills, opens a GitHub issue with eval-failure label if anything regressed

The weekly run is the safety net. If an upstream package changes its API and breaks a skill, we find out on Monday morning rather than when a developer tries to use it.

The CI run for our first PR demonstrated this immediately.

Three skills failed: both OpenAI Agents SDK skills, and the crewAI skill. The root cause in all three cases was the same missing ingredient—OPENAI_API_KEY wasn't passed to the eval step env block. The OpenAI skills failed with a clean assertion error. The crewAI skill failed less obviously—without OPENAI_API_KEY, crewAI couldn't initialize its default LLM and errored before the MCP connection was even attempted. The fix was a one-line addition to the workflow. CI surfaced the gap within minutes of the PR opening.

Applying This to Your Own Skills

If you're building skills for your engineering org—whether for Claude Code, Cursor, or another AI coding tool—the seed pattern applies broadly:

1. Ship working reference code in assets/

Don't describe what correct code looks like. Show it. The agent reads the asset, not your description of the asset.

2. Write tests that call real services

Mocks test that your mock returns what you told it to return. Real API calls test that the integration actually works. Use keyword assertions, not length checks.

3. Force tool use explicitly in test queries

"Search the web for X" is not the same as "What is X?". The former makes tool invocation part of the instruction; the latter lets the model skip the tool entirely.

4. Evaluate outcomes, not instructions

The eval question is: did it work? Not: did the agent follow the instructions correctly? Tests that call real APIs are the ground truth. LLM judges are useful for qualitative color, not for overriding empirical evidence.

5. Automate regression detection

Skills that work today will break next month. A weekly CI run costs little and catches regressions before your users do.

Open Source

The eval harness, @plaited/agent-eval-harness, is open source. The skill patterns from this post are visible in our agent-skills repository. We hope other teams building skills for their organizations find both useful.

The seed is the unit of distribution. What grows from it depends on the soil—the developer's context, their codebase, their APIs. Your job is to make the seed viable. The tests tell you whether it is.

Edward Irby is a Developer Experience engineer at You.com. The @plaited/agent-eval-harness is part of his open-source Plaited project.

Featured resources.

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Company

You.com Welcomes Graft to Accelerate Enterprise AI Search Infrastructure

You.com Team,

October 21, 2025

News & Press

A close-up image of a shiny needle protruding from a pile of golden straw, symbolizing the concept of "finding a needle in a haystack," set against a smooth purple background.
AI 101

What Is a Search API for LLMs?

Chak Pothina, Product Marketing Manager, APIs

October 21, 2025

Blog

API Management & Evolution

How to Integrate a Real-Time Web Search API Into Your GenAI Apps

You.com Team,

October 20, 2025

Guides

Future-Proofing & Change Management

How to Unlock Enterprise Value through AI Employee Training 

You.com Team,

October 16, 2025

Blog

Future-Proofing & Change Management

Uplevel Your AI Implementation: The AI Training Checklist

Doug Duker, Head of Customer Success

October 15, 2025

Guides

Product Updates

October 2025 API Roundup: Introducing MCP, Enhanced Search API, and More

Justin Fink, VP of Marketing

October 15, 2025

Blog

Future-Proofing & Change Management

Do You Know Where to Use Al for the Biggest Business Impact?

Chris Mann, Product Lead, Enterprise AI Products

October 13, 2025

Guides

AI Agents & Custom Indexes

You.com's Deep Research Agent 'ARI' Named on TIME's List of The Best Inventions of 2025

You.com Team,

October 9, 2025

News & Press