concept

Model runs.

A run is one model attempt against one task in one environment version. Runs make the package useful for evaluation, active testing, training comparisons, and published evidence.

Run record

{
  "run_id": "run-kca-2026-06-19-001",
  "model": "claude-computer-use",
  "environment": "korean-commerce-admin",
  "task": "product-price-fix-001",
  "score": 0.72,
  "reward": 0.70,
  "verdict": "final state correct, process violation recorded",
  "trace": ["click", "click", "type", "save"]
}

pass@k

pass@k measures whether a model succeeds within k attempts. UseDesktop records pass@1, pass@3, and pass@5 so a task set can show whether it is too easy, too sparse, or useful for hillclimbing behavior.

pass@1

First-attempt success. Good for measuring current deployment reliability.

pass@3

Short retry budget. Useful for agent loops with limited recovery.

pass@5

Exploration budget. Helps reveal recoverable failure modes.

Failure evidence

The point is not only to publish a score. A useful run includes failure mode taxonomy, trace snippets, wrong-target evidence, and grader reasoning so another team can decide whether the workflow package is worth training on.

Publish an eval package The workflow for moving from local authoring to public evidence. Open model results Compare public model runs.