build

Write a grader.

The grader is the trust boundary. If it accepts shortcuts or rejects real successes, the package teaches the wrong behavior.

Grader shape

grader:
  final_state_checks: required state or artifact
  process_checks: required events or observations
  violation_checks: forbidden mutations or shortcuts
  known_good: human completion accepted
  known_bad: shortcut or wrong state rejected
  limitations: edge cases and loopholes

Checks

Final state

Prefer inspectable app state, file state, DOM state, database state, or exported artifact content.

Process evidence

Add required events only when the process matters for correctness or reward hacking prevention.

Violations

Reject wrong-target edits, skipped saves, unrelated mutations, and states reached by invalid shortcuts.

Audit before running models

Test one known-good attempt and one known-bad attempt before model runs. A grader that cannot pass this audit should not be used for pass@k or training decisions.