Define the rubric
Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.
→Tracing tells you what the model did. AuraOne records who approved it. Score every release against your rubric, turn a failed case into a signed release gate, and keep the proof attached. The experts who score the rubric are coordinated through Human Data OS → Annotation.
A failed case becomes a reviewer assignment, a gate, and a hold.
Same rubric. Different model. Same standard.
Scorecards and reviewer notes follow the release out the door.
Codify the rubric. Run every release against it. Ship with the proof.
Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.
→Each release passes through the same rubric. Drift, regression, and bias show up before the gate.
→Scorecards, reviewer notes, and decisions stay with the release review.
Rubric Studio writes the standard. Evaluation Studio runs it. Prompt context, model outputs, and expert judgment become approved criteria, criterion-level grading, evidence capture, scorecards, and regression memory. Four named steps. One evaluation record.
The model reads the prompt, retrieved context, and prior failures. It proposes criteria and warnings. Nothing activates yet.
A named approver reads the draft, edits the criteria, sets the weights, and signs the version. The history is permanent.
Reviewers grade each criterion with the rubric and the evidence visible. Blocker reasons are first-class fields, not free-text notes.
Each grade contributes to failure breakdowns, regression bank entries, and the model scorecard the release team carries to approval.
A rubric that holds up under release pressure is not a Likert scale. It is a weighted contract with ground-truth anchors, a calibration loop, and a memory of the cases that have already escaped.
Each criterion has a weight, a passing threshold, and a fail-state contract. A release that wins on grounding but loses on disclosure is not an averaged-out pass — it is a hold.
We anchor each scoring run with a small set of ground-truth cases the team has agreed on. Drift in the score against the anchors is itself a signal.
Each run produces a side-by-side: the current baseline model and the candidate, scored on the same cases against the same criteria. The deltas are reviewable.
AI judges run first for scale, with calibration scores held against a human reviewer panel. Disagreement above threshold routes to expert review with the case attached.
The cases the rubric missed get added to the next release's required set. Misses do not get to escape twice.
Every run leaves something the team can act on — and something the next release has to clear.
One read on what passed, what failed, and what needs another look.
Cases the rubric flags get routed to the right reviewer with the rubric reading attached.
Every escaped case becomes a repeatable check the next release must pass.
What changed. Who approved it. What the rubric said at the time.
Rubric, reviewer notes, and verdict — ready when someone asks.
Every score in the studio is tied to the exact prompt, retrieval, tool call, answer, judge reading, and human override that produced it. The reviewer never has to ask “what did the model see?” — it is attached.
The user scenario or test case the run is grounded on.
→Retrieved context, policy lookups, and prior case memory.
→External calls — account, policy, knowledge base, action.
→The candidate's reply, attached to the path that produced it.
→AI judge scoring each criterion with calibration to humans.
→Reviewer accepts, edits, or holds the verdict with reason.
Customer asks for a refund exception after the policy window. The candidate offered partial credit but missed the disclosure copy the policy requires. AI judges scored 82/100 with escalation clarity below threshold. The reviewer holds the release until the disclosure copy is fixed. Forty-one cases route to the regression bank in the same step.
The judges run in parallel for scale. Each one is calibrated against a human panel on a rolling sample, and you set the agreement band a judge must hold. When the judges disagree above threshold — or either one falls outside its band — the case routes to a reviewer with the rubric, the case, and the AI readings attached.
Reads the candidate's answer against the policy and grades grounding, citation accuracy, and contradiction risk.
Reads the candidate's answer against the user's stated need and grades resolution, tone, and downstream effect.
A human reviewer sees both AI judge readings and the case. They accept, edit, or hold. Their verdict signs the release packet.
Test the run. Review the hard cases. Recruit the right specialist. Remember the misses. Approve what's right.
“We replaced four eval scripts and a slack thread with one rubric and a scorecard. The release meeting takes twenty minutes now. The hold-or-release decision is already on the page.”
A dashboard tells you what changed. Evaluation Studio records who approved it. The failed case becomes a reviewer assignment, a regression gate, a release hold, and an evidence packet — not a chart and a Slack thread.
Yes. Start open, improve the model on the cases your reviewers signed, and keep the tuned weights. You run the layer; you do not rent a managed endpoint that owns the improvement.
Each judge has a rolling calibration sample scored by a human panel. If a judge drifts outside its calibration band, runs that depend on it are flagged and the band is re-fit.
Every rubric is versioned and signed. When a release surfaces a missing criterion, the rubric gets a new version, the criterion is added, and the prior decisions stay tied to the version that produced them.
Yes. The studio is model-agnostic — frontier APIs, on-prem checkpoints, and your own tuned models. Runs execute where your checkpoints live, so review records stay independent of any single data vendor.
Rubric Studio writes the standard. Evaluation Studio runs it. Regression Bank remembers the misses. Control Center signs the release. One path, not four tools.
Every issue. Every reviewer. One screen.
See the page →Every escaped failure becomes a gate the next release cannot cross.
See the page →The record builds as the work is done.
See the page →Bring the rubric your team already trusts. We'll make it the bar every release has to clear — with the reviewer rationale and provenance attached, ready for the August 2026 high-risk provenance deadline.