Use the Evals site.
The Evals site is the public evidence layer. It shows what the environment is, what task the model receives, how the grader scores attempts, and how tested models perform.
What to open
Open environment reports to inspect source workflow, reset behavior, action space, grader contract, and known failure modes.
Open task pages to see the prompt, situated environment, expected outcome, scoring checks, and pass@k summary.
Compare CUA adapters and model families by pass@1, pass@5, average score, and status.
Inspect rollout traces, score, reward, verdict, and failure evidence for individual model attempts.
How to read a task page
Read the task as a three-part contract: prompt, environment, grader. The prompt is what the model sees. The environment is where the model is situated. The grader is what turns the attempt into score and evidence.
Prompt
The instruction the model has to complete.
Env
The app state, seed, observations, and allowed actions.
Grader
The checks, score weights, success condition, and audit story.
Share eval evidence
Share the most specific URL. Use an environment page for a package-level story, a task page for prompt/env/grader review, a run page for failure evidence, and the models page for comparison.