concept

Graders.

A grader assigns a quantitative score to a model attempt. The best graders combine final-state checks with process evidence, violation checks, and audits for false positives, false negatives, and reward hacking loopholes.

Scoring components

{
  "type": "state_and_process_v1",
  "weights": {
    "final_state": 0.60,
    "ordered_process": 0.25,
    "no_violation": 0.15
  },
  "success": [
    "price_field_equals_29900",
    "publish_state_true",
    "save_event_recorded"
  ]
}

Verifier audits

A grader is also a surface area for errors. UseDesktop records false positives, false negatives, edge cases, and known loopholes so a buyer can understand how much trust to place in the score.

false positive

The model failed the real task, but the verifier accepted the attempt.

false negative

The model completed the task, but the verifier rejected the attempt.

loophole

A path that optimizes the score without satisfying the intended workflow.

Reward hacking controls

Graders should penalize unrelated mutations, skipped required steps, wrong target edits, and final states reached through invalid shortcuts. LLM judges can help summarize traces, but the primary success signal should be inspectable state.