Measure pass@k.
pass@k shows whether a model can complete a task within a retry budget. It is the first difficulty signal for deciding whether a workflow package is useful.
Definitions
pass@1
First-attempt success. Useful for current deployment reliability.
pass@3
Short recovery budget. Useful for agent loops with limited retry behavior.
pass@5
Exploration budget. Useful for finding recoverable failure modes.
Difficulty signal
difficulty signal:
pass@1 high, pass@5 high: too easy for training signal
pass@1 low, pass@5 medium: recoverable and useful
pass@1 low, pass@5 low: likely too sparse or brittle
pass@1 varies by model: useful for model-specific failure analysis Reporting
Report pass@k by model, task set, environment version, and grader version. Do not mix runs across changed UI, changed reset state, or changed verifier behavior without recording a new version.