usage

Use the Evals site.

The Evals site is the public evidence layer. It shows what the environment is, what task the model receives, how the grader scores attempts, and how tested models perform.

What to open

Environments

Open environment reports to inspect source workflow, reset behavior, action space, grader contract, and known failure modes.

Tasks

Open task pages to see the prompt, situated environment, expected outcome, scoring checks, and pass@k summary.

Models

Compare CUA adapters and model families by pass@1, pass@5, average score, and status.

Runs

Inspect rollout traces, score, reward, verdict, and failure evidence for individual model attempts.

How to read a task page

Read the task as a three-part contract: prompt, environment, grader. The prompt is what the model sees. The environment is where the model is situated. The grader is what turns the attempt into score and evidence.

Prompt

The instruction the model has to complete.

Env

The app state, seed, observations, and allowed actions.

Grader

The checks, score weights, success condition, and audit story.

Share eval evidence

Share the most specific URL. Use an environment page for a package-level story, a task page for prompt/env/grader review, a run page for failure evidence, and the models page for comparison.