The AI Interviewer Is the New Funnel - But Who Audits the Interviewer?
AI interviews are becoming the front door to the specialist economy.
That makes sense. Frontier labs need more specialists than human recruiters can screen one call at a time. They need domain experts, code reviewers, lawyers, clinicians, analysts, translators, and operators. The funnel has to scale.
Mercor has described Monty, its AI interviewer, as running large daily interview volume across many job categories. The technical achievement is real.
The governance question is larger: who audits the interviewer?
Scale is not the hard part forever
The first problem is keeping the interview running. Low latency. Good turn-taking. Reliable speech recognition. Resilient infrastructure. Useful personalization.
That is hard engineering.
But once the system works, the next problem becomes more important. Is the interview measuring the right capability? Is the scoring calibrated? Does the interview overfit to polished speakers? Does it underrate candidates who are strong on task and weaker in conversation? Does it drift when prompts change? Does performance in the interview predict performance in the actual annotation, review, or red-team workflow?
If the answer is unknown, the AI interview is not a qualification system. It is a fast filter.
Fast filters can scale mistakes.
What an audited AI interview needs
An audited interview needs a record across five stages.
First, the role brief. What work is this person actually being evaluated for?
Second, the rubric. What skills, judgment patterns, and domain knowledge are being tested?
Third, the conversation trace. What did the interviewer ask, what did the candidate answer, and where did the system probe deeper?
Fourth, the score and rationale. Why did the candidate pass, fail, or route to a human reviewer?
Fifth, the downstream outcome. Did the candidate perform well once assigned to real tasks?
The fifth stage is where most systems are weak. An interview can feel good and still fail to predict performance. The only way to know is to connect the interview record to the workforce record.
Why this matters for AI labs
AI labs are not hiring specialists for general employment. They are assigning them to work that changes model behavior.
A weak reviewer can pollute a preference dataset. A miscalibrated domain expert can create false confidence. A biased screening process can remove the very specialists who catch edge cases. A rubric that does not match downstream work can fill the roster with people who interview well and review poorly.
That is why AI interviews have to be part of a Human Data OS, not a standalone hiring widget.
Cleo can source and rank candidates. AI Interviews can qualify them. Workforce can track calibration and task performance after hire. Annotation can attach reviewer identity and quality to the work. Regression Bank can show whether reviewers helped catch failures that mattered.
The interview becomes the beginning of the record, not the end of the funnel.
What to do this quarter
If you run AI interviews, audit three things.
First, score stability. Run similar candidate profiles through the process and check whether the scoring is stable enough to trust.
Second, reviewer override. Track cases where human hiring leads disagree with the AI interview and force those disagreements into a rubric review.
Third, downstream validity. Compare interview scores to actual task quality after the candidate is live. If the correlation is weak, the interview is measuring the wrong thing.
The AI interviewer is here to stay. It should be. The specialist economy needs scale.
But the interviewer cannot be the only judge. It has to be part of a governed system that proves whether the interview worked.