Build your first package.
Start with one narrow workflow. The goal is not volume. The goal is a package that another person can inspect and decide whether it is worth training on.
Before you start
Pick one real workflow
Use a task a person actually performs, not a synthetic click sequence.
Know the final state
The grader needs something inspectable: app state, file state, DOM state, or artifact output.
Keep the scope small
One package can start with one task and one grader if the evidence is clean.
Record provenance
Track source, version, privacy treatment, and whether it overlaps with public benchmarks.
Path
Record the workflow in Workbench: screenshots, actions, intermediate states, artifacts, and outcome.
Remove noisy events and keep the actions that explain how the work was completed.
Write the prompt, reset seed, constraints, allowed observations, and expected final state.
Score final state first, then add process checks and violation checks where they matter.
Run one known-good attempt, one known-bad attempt, and at least one model attempt.
Package the manifest, traces, screenshots, grader output, audit notes, and provenance.
Target output
first package:
source workflow: one real completion path
environment: resettable seed or workflow twin
task: prompt + constraints + expected state
grader: final-state check + known-good/known-bad audit
run: at least one model attempt with trace and score
export: manifest + artifacts + provenance Stop after one package if the grader cannot distinguish a real success from a shortcut. Fix the evidence path before collecting more trajectories.