concept

Model runs.

A run is one model attempt against one task in one environment version. Runs make the package useful for evaluation, active testing, training comparisons, and published evidence.

Run record

{
  "run_id": "run-kca-2026-06-19-001",
  "model": "claude-computer-use",
  "environment": "korean-commerce-admin",
  "task": "product-price-fix-001",
  "score": 0.72,
  "reward": 0.70,
  "verdict": "final state correct, process violation recorded",
  "trace": ["click", "click", "type", "save"]
}

pass@k

pass@k measures whether a model succeeds within k attempts. UseDesktop records pass@1, pass@3, and pass@5 so a task set can show whether it is too easy, too sparse, or useful for hillclimbing behavior.

pass@1

First-attempt success. Good for measuring current deployment reliability.

pass@3

Short retry budget. Useful for agent loops with limited recovery.

pass@5

Exploration budget. Helps reveal recoverable failure modes.

Failure evidence

The point is not only to publish a score. A useful run includes failure mode taxonomy, trace snippets, wrong-target evidence, and grader reasoning so another team can decide whether the workflow package is worth training on.