Run a model.
A model run is one attempt against one task in one environment version. Runs make the package useful for active testing, eval reports, and training comparisons.
Run record
run record:
model: provider, adapter, checkpoint, or endpoint
environment_version: resettable package version
task: prompt + seed
attempt: k index, trace, screenshots, artifacts
score: grader output and verdict
failure_mode: where and why the attempt failed Runner boundary
Local
Use local runs for smoke tests, grader debugging, and trace inspection.
RunPod or AWS
Use remote runners for repeatable pass@k sweeps, hosted models, and larger model comparisons.
Customer infrastructure
Use customer-side runners when data, credentials, or app state cannot leave their boundary.
Minimum useful run set
For early packages, collect one human-good run, one known-bad run, and model attempts across at least one baseline and one candidate model. The useful output is trace evidence, not only a score.