The eval system matters most if you are customizing or extending Brain OS skills. If you use the skills as-is, you don’t need to configure anything.
Layer 1: Smart diff check (automatic)
The smart diff check is a PostToolUse hook that fires every time aSKILL.md file is edited. It diffs the new version against the git baseline and warns if critical content was removed — things like section headers, script references, numeric thresholds, and key terms.
What it checks on every SKILL.md edit:
- Word count — warns if content dropped more than 20%, blocks if more than 40%
- Script references — flags any
.pyfile references that disappeared - Section headers — flags any
##headings that were removed - Key terms — checks for important concepts like
notebooklm,information barriers,autoresearch - Numeric thresholds — checks for values like
≥ 95or100%that guard quality requirements
0— all checks passed2— critical removals detected (blocks the edit)
Setup
Add the following to~/.claude/settings.json:
Layer 2: Eval checks (on demand)
Pipeline skills include anevals/evals.json file following the Agent Skills standard. Each eval defines a prompt, expected output, and a list of verifiable assertions that the skill must satisfy.
Run evals for a specific skill:
Output format
The evals.json schema
Each skill’s eval file lives atevals/evals.json inside the skill directory. The schema:
| Field | Description |
|---|---|
id | Unique integer identifier for the eval |
prompt | The task or question being tested |
expected_output | A human-readable description of what a passing result looks like |
expectations | Array of specific, verifiable assertions checked against SKILL.md |
Adding evals to your own skills
Create anevals/ directory inside your skill directory, then add an evals.json file using the schema above. Each eval should test a critical aspect of the skill that would break if accidentally removed.
Creating and improving skills
/grill-me — stress-test your skill design
/grill-me — stress-test your skill design
Run
/grill-me before building a new skill. It stress-tests your design by challenging assumptions and surfacing edge cases before you write any SKILL.md content./skill-creator — build, evaluate, and iterate
/skill-creator — build, evaluate, and iterate
Run
/skill-creator to create a new skill end-to-end: design, write SKILL.md, generate evals, and iterate based on eval results — all in one workflow.Agent Skills specification
Agent Skills specification
The eval format follows the open Agent Skills specification. Refer to this if you want to understand the full schema or integrate Brain OS evals with other Agent Skills-compatible tooling.