usage

Use the Evals site.

The Evals site is the public evidence layer. It shows what the environment is, what task the model receives, how the grader scores attempts, and how tested models perform.

What to open

Environments

Open environment reports to inspect source workflow, reset behavior, action space, grader contract, and known failure modes.

Tasks

Open task pages to see the prompt, situated environment, expected outcome, scoring checks, and pass@k summary.

Models

Compare CUA adapters and model families by pass@1, pass@5, average score, and status.

Runs

Inspect rollout traces, score, reward, verdict, and failure evidence for individual model attempts.

How to read a task page

Read the task as a three-part contract: prompt, environment, grader. The prompt is what the model sees. The environment is where the model is situated. The grader is what turns the attempt into score and evidence.

Prompt

The instruction the model has to complete.

Env

The app state, seed, observations, and allowed actions.

Grader

The checks, score weights, success condition, and audit story.

Share the most specific URL. Use an environment page for a package-level story, a task page for prompt/env/grader review, a run page for failure evidence, and the models page for comparison.

Open public tasks Browse task prompts and grader summaries. Publish eval package Prepare a package for public evidence.

Use the Evals site.

What to open

How to read a task page

Prompt

Env

Grader

Share eval evidence