evaluate

Measure pass@k.

pass@k shows whether a model can complete a task within a retry budget. It is the first difficulty signal for deciding whether a workflow package is useful.

Definitions

pass@1

First-attempt success. Useful for current deployment reliability.

pass@3

Short recovery budget. Useful for agent loops with limited retry behavior.

pass@5

Exploration budget. Useful for finding recoverable failure modes.

Difficulty signal

difficulty signal:
  pass@1 high, pass@5 high: too easy for training signal
  pass@1 low, pass@5 medium: recoverable and useful
  pass@1 low, pass@5 low: likely too sparse or brittle
  pass@1 varies by model: useful for model-specific failure analysis

Reporting

Report pass@k by model, task set, environment version, and grader version. Do not mix runs across changed UI, changed reset state, or changed verifier behavior without recording a new version.

Next: verifier audits Trust the score before comparing models. Model run reference Read the run record contract.