Model runs.
A run is one model attempt against one task in one environment version. Runs make the package useful for evaluation, active testing, training comparisons, and published evidence.
Run record
{
"run_id": "run-kca-2026-06-19-001",
"model": "claude-computer-use",
"environment": "korean-commerce-admin",
"task": "product-price-fix-001",
"score": 0.72,
"reward": 0.70,
"verdict": "final state correct, process violation recorded",
"trace": ["click", "click", "type", "save"]
} pass@k
pass@k measures whether a model succeeds within k attempts. UseDesktop records pass@1, pass@3, and pass@5 so a task set can show whether it is too easy, too sparse, or useful for hillclimbing behavior.
pass@1
First-attempt success. Good for measuring current deployment reliability.
pass@3
Short retry budget. Useful for agent loops with limited recovery.
pass@5
Exploration budget. Helps reveal recoverable failure modes.
Failure evidence
The point is not only to publish a score. A useful run includes failure mode taxonomy, trace snippets, wrong-target evidence, and grader reasoning so another team can decide whether the workflow package is worth training on.