Blog

March 2, 2026

Effective AI Skills Are Like Seeds

Senior Software Engineer

LI Test
LI Test

TLDR: Early AI coding skills relied on detailed instructions, but this approach proved brittle and hard to verify in real-world scenarios. The shift to a "seed" model—providing working reference code and meaningful tests—enabled robust, adaptable integrations. Skills are now verified via real API calls, automated CI, and regression detection.

‍When we first built AI coding skills for Claude Code, we thought about them the way most teams do: as detailed instruction manuals. Write precise steps, enumerate every option, cover every edge case. The more thorough the skill, the better the agent would perform.

We were wrong—or at least, incomplete.

A thorough skill that tells an agent exactly what to do is brittle in a specific way: it's hard to verify whether the agent actually did it correctly. You can read the skill. You can read the generated code. But you can't easily tell, from the outside, whether the integration actually works against real APIs with real data. You're trusting the agent's judgment without any way to check it.

Then, two "aha moments"changed how we think about this.

Two "Aha Moments"

The first came from our own work. We had already used @plaited/agent-eval-harness—an open-source tool I created by for capturing agent trajectories and scoring them— to evaluate web search agents. It occurred to us: what if we used the same harness to evaluate the skills themselves? Not "did the agent follow the instructions," but "did the thing the agent built actually work?"

The second came from Zairah Mustahsan, Staff Product Data Scientist, who shared a post by Shane Butler that crystallized the idea in a single phrase:

"You don't send someone the code? You send them the seed? They plant it in their own soil, answer 7 questions about their world, and a completely different organism grows."

Shane had collapsed a 50,000-line AI data analyst system into a 1,400-line Markdown file—a genome. Drop it into an empty repo, answer a few questions about your data and context, walk away. Hours later, a fully customized AI analyst has grown itself from scratch, adapted to your environment.

That's the mental model we needed. A skill isn't an instruction manual. It's a seed.

What Makes a Skill a Seed

A seed has two properties that an instruction manual doesn't:

It contains everything needed to grow—not just what to do, but what correct growth looks like.
It can be verified—you can tell whether the seed sprouted correctly by examining what grew.

For AI coding skills, this translates directly:

Instruction Manual	Seed
Tells agent what code to write	Shows agent canonical reference code
Describes what correct output looks like	Includes tests that verify output against real APIs
Evaluated by reading	Evaluated by running
Static	Verifiable

The key addition is the assets/ directory. Instead of describing what correct code looks like in prose, we ship the actual working code. The agent reads the assets, understands the pattern, and generates a working integration in the developer's specific context. Then tests prove it sprouted correctly.

The Anatomy of a Seed Skill

Here's what our integration skills look like after the refactor:

Shell

skills/ydc-ai-sdk-integration/ ├── SKILL.md ← Instructions + test templates └── assets/ ├── path-a-generate.ts ← Canonical Path A integration ├── path-b-stream.ts ← Canonical Path B integration └── integration.spec.ts ← Test file agents must replicate

‍‍The SKILL.md body links directly to the asset files so the agent can read them:

Shell

## Reference Assets - [assets/path-a-generate.ts](assets/path-a-generate.ts) — basic integration - [assets/path-b-stream.ts](assets/path-b-stream.ts) — streaming integration - [assets/integration.spec.ts](assets/integration.spec.ts) — test structure

And the "Generate Integration Tests" section in each skill now tells the agent explicitly: write tests that call real APIs, use keyword assertions on the output, and use "Search the web for..." to force tool invocation rather than letting the model answer from memory.

The Test Assertion Problem

This last point—test assertion quality—turned out to matter more than we expected.

Our original tests looked like this:

Shell

const result = await prompt.send('What is TypeScript?') expect(result.content.length).toBeGreaterThan(50)

This test passes even if the MCP tool was never called. A language model can answer "What is TypeScript?" from training data. The test asserts on content length, not on whether a real web search happened.

The fix has two parts:

1. Force tool invocation with an explicit instruction:

Shell

const result = await prompt.send( 'Search the web for the three branches of the US government' )

The phrase "Search the web for..." makes tool use an instruction, not an inference. The model can't easily answer by ignoring the instruction.

2. Assert on semantic content, not length:

Shell

const text = result.content.toLowerCase() expect(text).toContain('legislative') expect(text).toContain('executive') expect(text).toContain('judicial')

These keywords will appear in any real response about the U.S. government. A hallucinated or tool-skipped response is far less likely to hit all three. The test now has genuine signal.

The Eval Harness

With seed skills and meaningful tests, we could build an eval loop. The harness works like this:

Shell

prompts.jsonl → Claude Code agent → generated code + tests ↓ bun test / uv run pytest ↓ LLM judge (Haiku) → score 0.0–1.0

Each entry in prompts.jsonl is two turns:

"Using the [skill], create a basic integration and write tests that prove it calls the real API."
"Extend with You.com MCP server and update tests to prove MCP works with a live query."

That's it. No "add streaming," no "handle errors," no "support custom env vars." Two turns, two verifiable outcomes.

The eval question is simple: did the seed sprout a working integration with tests that prove it works?

The grader runs the tests against real APIs (using actual ANTHROPIC_API_KEY and YDC_API_KEY), then sends the test output to Haiku for scoring.

One lesson we learned the hard way: the test results are ground truth, not the LLM judge. Haiku hallucinated that @youdotcom-oss/teams-anthropic was a fabricated package — despite tests passing with 3+ second network timings that proved real API calls happened. We fixed the judge prompt to explicitly anchor on test evidence:

"If tests passed (exit code 0), the code WORKS with real packages and real endpoints. Do not second-guess whether packages exist or endpoints are real; the test output proves they do."

The lesson: LLM judges are useful for qualitative assessment, but they should never be able to override empirical test results.

We also learned this the hard way with the judge's fallback behavior. When Haiku's API call throws an error mid-eval—transient network issue, rate limit, whatever—the grader originally returned a score of 0.5 rather than surfacing the test result it already had. A 0.5 score falls below the 0.65 pass threshold, so a job with all tests passing would report as a failure.

The fix was simple: when the LLM judge fails, trust the test exit code.

The judge prompt now scores on two dimensions rather than one:

The first is the same as before: did the integration work? (Test exit code is ground truth.)
The second is new: are the tests meaningful?

The judge reads the test source from generatedFiles and looks for keyword assertions over toBeDefined(), "Search the web for..." queries that force tool invocation, and coverage of both the basic integration and the MCP extension. A score of 0.92–1.0 requires both dimensions; tests that pass with only length checks land in the 0.85–0.91 band. This creates useful differentiation—the eval can flag a skill whose tests pass but whose assertions are weak enough that you can't trust them.

Results

After the refactor, all seven integration skills passed with scores ranging from 0.92–0.96:

Skill	Score	Tests
teams-anthropic-integration	0.92	4/4
ydc-ai-sdk-integration	0.95	2/2
ydc-claude-agent-sdk-integration (Python)	0.93	2/2
ydc-claude-agent-sdk-integration (TypeScript)	0.94	2/2
ydc-crewrai-mcp-integration	0.93	4/4
ydc-openai-agent-sdk-integration (Python)	0.95	2/2
ydc-openai-agent-sdk-integration (TypeScript)	0.96	2/2

Before: 5/7 passing at mixed scores, with two skills scoring 0.35 due to a missing pyproject.toml—Python skills need this for uv run pytest to find the test environment. The seed pattern caught this. The asset now includes a pyproject.toml template so agents always generate one.

Continuous Integration (CI): Skills That Degrade Get Caught

The last piece is automation. Skills rot. APIs change, packages update, new SDK versions break patterns. We added a GitHub Actions workflow that:

On PR: detects which skills changed, runs only those evals, blocks merge if any score below 0.65
On push to main: same detection and scoring
Weekly schedule: runs all skills, opens a GitHub issue with eval-failure label if anything regressed

The weekly run is the safety net. If an upstream package changes its API and breaks a skill, we find out on Monday morning rather than when a developer tries to use it.

The CI run for our first PR demonstrated this immediately.

Three skills failed: both OpenAI Agents SDK skills, and the crewAI skill. The root cause in all three cases was the same missing ingredient—OPENAI_API_KEY wasn't passed to the eval step env block. The OpenAI skills failed with a clean assertion error. The crewAI skill failed less obviously—without OPENAI_API_KEY, crewAI couldn't initialize its default LLM and errored before the MCP connection was even attempted. The fix was a one-line addition to the workflow. CI surfaced the gap within minutes of the PR opening.

Applying This to Your Own Skills

If you're building skills for your engineering org—whether for Claude Code, Cursor, or another AI coding tool—the seed pattern applies broadly:

1. Ship working reference code in assets/

‍Don't describe what correct code looks like. Show it. The agent reads the asset, not your description of the asset.

2. Write tests that call real services

‍Mocks test that your mock returns what you told it to return. Real API calls test that the integration actually works. Use keyword assertions, not length checks.

3. Force tool use explicitly in test queries

"Search the web for X" is not the same as "What is X?". The former makes tool invocation part of the instruction; the latter lets the model skip the tool entirely.

4. Evaluate outcomes, not instructions

‍The eval question is: did it work? Not: did the agent follow the instructions correctly? Tests that call real APIs are the ground truth. LLM judges are useful for qualitative color, not for overriding empirical evidence.

5. Automate regression detection

‍Skills that work today will break next month. A weekly CI run costs little and catches regressions before your users do.

Open Source

The eval harness, @plaited/agent-eval-harness, is open source. The skill patterns from this post are visible in our agent-skills repository. We hope other teams building skills for their organizations find both useful.

The seed is the unit of distribution. What grows from it depends on the soil—the developer's context, their codebase, their APIs. Your job is to make the seed viable. The tests tell you whether it is.

Edward Irby is a Developer Experience engineer at You.com. The @plaited/agent-eval-harness is part of his open-source Plaited project.