MODELS · EVALUATION STUDIO · REVIEW BEFORE RELEASE

Review it before release.

Tracing tells you what the model did. AuraOne records who approved it. Score every release against your rubric, turn a failed case into a signed release gate, and keep the proof attached. The experts who score the rubric are coordinated through Human Data OS → Annotation.

Start a project See pricing

SCORE

Not a release

A failed case becomes a reviewer assignment, a gate, and a hold.

RUN

Every release

Same rubric. Different model. Same standard.

PROOF

Attached

Scorecards and reviewer notes follow the release out the door.

HOW IT WORKS

Three steps. No eval script.

Codify the rubric. Run every release against it. Ship with the proof.

STEP 01

WHAT WE CODIFY

Define the rubric

Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.

→

STEP 02

WHAT WE SCORE

Run every release against it

Each release passes through the same rubric. Drift, regression, and bias show up before the gate.

→

STEP 03

WHAT WE SIGN

Gate with evidence attached

Scorecards, reviewer notes, and decisions stay with the release review.

RUBRIC ANATOMY · WHAT WE CODIFY

A rubric is governed before it scores.

Rubric Studio writes the standard. Evaluation Studio runs it. Prompt context, model outputs, and expert judgment become approved criteria, criterion-level grading, evidence capture, scorecards, and regression memory. Four named steps. One evaluation record.

01 · AI DRAFT

Draft criteria from the prompt context

The model reads the prompt, retrieved context, and prior failures. It proposes criteria and warnings. Nothing activates yet.

↳ REVIEW REQUIRED

02 · EXPERT APPROVAL

No AI-drafted rubric moves forward without a human signoff

A named approver reads the draft, edits the criteria, sets the weights, and signs the version. The history is permanent.

↳ BLOCKED UNTIL SIGNED

03 · WORKER GRADING

Criterion-level grades with required evidence

Reviewers grade each criterion with the rubric and the evidence visible. Blocker reasons are first-class fields, not free-text notes.

↳ CRITERION LEVEL

04 · SCORECARD MEMORY

Submitted grades roll into release reads

Each grade contributes to failure breakdowns, regression bank entries, and the model scorecard the release team carries to approval.

↳ STORED · INDEXED

METHODOLOGY · HOW WE SCORE

Five practices that make a rubric survive a release.

A rubric that holds up under release pressure is not a Likert scale. It is a weighted contract with ground-truth anchors, a calibration loop, and a memory of the cases that have already escaped.

WEIGHTED CRITERIA

The rubric carries weights, not just labels

Each criterion has a weight, a passing threshold, and a fail-state contract. A release that wins on grounding but loses on disclosure is not an averaged-out pass — it is a hold.

GROUND-TRUTH ANCHORS

Every run carries reference cases

We anchor each scoring run with a small set of ground-truth cases the team has agreed on. Drift in the score against the anchors is itself a signal.

BASELINE vs CANDIDATE

Two columns, same rubric

Each run produces a side-by-side: the current baseline model and the candidate, scored on the same cases against the same criteria. The deltas are reviewable.

JUDGE CALIBRATION

AI judges with human override

AI judges run first for scale, with calibration scores held against a human reviewer panel. Disagreement above threshold routes to expert review with the case attached.

REGRESSION MEMORY

Every escaped failure becomes a permanent check

The cases the rubric missed get added to the next release's required set. Misses do not get to escape twice.

WHAT COMES OUT

What your team leaves with.

Every run leaves something the team can act on — and something the next release has to clear.

Scorecards

One read on what passed, what failed, and what needs another look.

↳ ARTIFACT

Review queues

Cases the rubric flags get routed to the right reviewer with the rubric reading attached.

↳ ARTIFACT

Replay suites

Every escaped case becomes a repeatable check the next release must pass.

↳ ARTIFACT

Decision timelines

What changed. Who approved it. What the rubric said at the time.

↳ ARTIFACT

Evidence packets

Rubric, reviewer notes, and verdict — ready when someone asks.

↳ ARTIFACT

TRACE ANATOMY · WHAT THE SCORE IS TIED TO

The score sits on the reasoning path.

Every score in the studio is tied to the exact prompt, retrieval, tool call, answer, judge reading, and human override that produced it. The reviewer never has to ask “what did the model see?” — it is attached.

01 · PROMPT

The user scenario or test case the run is grounded on.

→

02 · RETRIEVAL

Retrieved context, policy lookups, and prior case memory.

→

03 · TOOL CALL

External calls — account, policy, knowledge base, action.

→

04 · ANSWER

The candidate's reply, attached to the path that produced it.

→

05 · JUDGE SCORE

AI judge scoring each criterion with calibration to humans.

→

06 · HUMAN OVERRIDE

Reviewer accepts, edits, or holds the verdict with reason.

EXAMPLE RUN · SUPPORT ASSISTANT V42

Customer asks for a refund exception after the policy window. The candidate offered partial credit but missed the disclosure copy the policy requires. AI judges scored 82/100 with escalation clarity below threshold. The reviewer holds the release until the disclosure copy is fixed. Forty-one cases route to the regression bank in the same step.

READING · CANDIDATE vs BASELINE

CALIBRATED JUDGES · AI + HUMAN

Two AI judges. One human reviewer. One signed verdict.

The judges run in parallel for scale. Each one is calibrated against a human panel on a rolling sample, and you set the agreement band a judge must hold. When the judges disagree above threshold — or either one falls outside its band — the case routes to a reviewer with the rubric, the case, and the AI readings attached.

JUDGE A

Policy fidelity

Reads the candidate's answer against the policy and grades grounding, citation accuracy, and contradiction risk.

JUDGE B

Customer impact

Reads the candidate's answer against the user's stated need and grades resolution, tone, and downstream effect.

REVIEWER

Final disclosure

A human reviewer sees both AI judge readings and the case. They accept, edit, or hold. Their verdict signs the release packet.

WHERE IT FITS

In the loop, this is where you test.

Test the run. Review the hard cases. Recruit the right specialist. Remember the misses. Approve what's right.

Test

● YOU ARE HERE

Review

Recruit

Remember

Approve

REVIEW NOTE · REPRESENTATIVE EVAL PROGRAM

“We replaced four eval scripts and a slack thread with one rubric and a scorecard. The release meeting takes twenty minutes now. The hold-or-release decision is already on the page.”
Release lead · representative eval program

FAQ · WHAT TEAMS ASK

Five common questions, direct answers.

Q · 01

How is this different from our tracing dashboard?

A dashboard tells you what changed. Evaluation Studio records who approved it. The failed case becomes a reviewer assignment, a regression gate, a release hold, and an evidence packet — not a chart and a Slack thread.

Q · 02

Do we keep the model we improve on this work?

Yes. Start open, improve the model on the cases your reviewers signed, and keep the tuned weights. You run the layer; you do not rent a managed endpoint that owns the improvement.

Q · 03

How do AI judges stay honest?

Each judge has a rolling calibration sample scored by a human panel. If a judge drifts outside its calibration band, runs that depend on it are flagged and the band is re-fit.

Q · 04

What if the rubric itself is wrong?

Every rubric is versioned and signed. When a release surfaces a missing criterion, the rubric gets a new version, the criterion is added, and the prior decisions stay tied to the version that produced them.

Q · 05

Can we run this against custom models?

Yes. The studio is model-agnostic — frontier APIs, on-prem checkpoints, and your own tuned models. Runs execute where your checkpoints live, so review records stay independent of any single data vendor.

RELATED MODULES

Next to this in Models.

Rubric Studio writes the standard. Evaluation Studio runs it. Regression Bank remembers the misses. Control Center signs the release. One path, not four tools.

AURAQC

Quality that continues after release.

Every issue. Every reviewer. One screen.

See the page →

REGRESSION BANK

Every mistake. Only once.

Every escaped failure becomes a gate the next release cannot cross.

See the page →

COMPLIANCE MONITORING

Compliance records without scramble.

The record builds as the work is done.

See the page →

EVALUATION STUDIO

Review it before release.

Bring the rubric your team already trusts. We'll make it the bar every release has to clear — with the reviewer rationale and provenance attached, ready for the August 2026 high-risk provenance deadline.

Start a project See pricing