evaluate

Audit verifiers.

Verifier quality is part of the product. A package should expose how the grader was tested, where it can fail, and what loopholes are known.

Audit protocol

verifier audit:
  known_good: human success should pass
  known_bad: wrong final state should fail
  shortcut: invalid path should fail
  edge_case: equivalent valid state should pass
  ui_change: harmless UI shift should not break scoring
  loophole: score optimization without task completion should fail

Error types

false positive

The model failed the real task, but the verifier accepted the attempt.

false negative

The model completed the task, but the verifier rejected the attempt.

loophole

The model can satisfy the score without satisfying the intended workflow.

brittleness

A harmless UI, DOM, file, or copy change breaks grading even though the task is still valid.

What to publish

Publish enough evidence for a reviewer to estimate trust: sample size, known accepted good attempts, rejected bad attempts, edge cases, and remaining limitations. A perfect-looking score without audit notes is weak evidence.

Next: contamination Explain source, split, and benchmark overlap risk. Grader reference Read the scoring contract details.