Quality assurance review

Reviewers grade past AI replies on accuracy, voice, and helpfulness. Their notes feed the learned library and tune the AI.

By ChristopherUpdated May 14, 20263 min read

Quality assurance review

QA review is post-hoc grading of AI replies. Reviewers pull a sample of past conversations, score them on a few dimensions, and leave notes. The notes feed the Learned Q&A library and tighten AI calibration over time.

It is the closest thing in Ochre to "training the AI". You are not changing the model. You are correcting your team's earned answers.

Why QA review matters

Three reasons:

Catches drift. Voice slips, brain gets stale, edge cases multiply. QA finds those.
Builds trust. A team that reviews 50 AI replies a week knows when to trust auto-send.
Makes the AI better. Reviewer notes feed the library, which feeds future drafts.

Without QA, AI quality is a blind spot. With QA, it is a tracked metric.

What gets reviewed

You decide. Common queues:

Heavily edited drafts. Replies where the agent rewrote a lot before sending. Surfaces voice and facts issues.
Auto-sent replies. Replies the AI sent without review. Worth spot-checking even when CSAT is good.
Low-CSAT conversations. Any conversation rated 3 or below.
Random sample. A weekly random sample for general health.

Set the queues from the QA card on the AI overview page (/ai).

How review works

A reviewer opens the QA tab, picks a queue, and gets one conversation at a time. For each AI reply they rate:

Accuracy. Is the factual content correct?
Voice. Does it sound like our team?
Helpfulness. Did it actually answer the question?
Length. Was it the right length?

Each on a 1-5 scale. Optional notes.

Reviewers can also:

Mark "promote to library". Sends the Q+A pair into the Learned Q&A library candidates.
Mark "fix in brain". Flags an article in The brain: KB graph for editing.
Mark "ignore". Skip this one.

Review takes about 30 seconds per reply once you are warmed up.

Who reviews

Owners, Admins, and Agents marked as QA reviewers. Light agents can also be QA reviewers if you give them the role explicitly.

A small team can run QA effectively. One reviewer doing 30 minutes a day is enough for most workspaces.

What the data does

Reviewer ratings feed three things:

The QA dashboard. Median scores by reviewer, by topic, by channel, over time.
The learned library. High-rated replies get nominated for promotion.
AI calibration. Patterns across reviews adjust how the AI weights similar future tickets.

The third is gentle, not aggressive. We do not retrain models. We adjust retrieval weights and confidence calibration based on signal density.

QA cadence

Most teams review weekly:

Monday: 30 minutes through the heavily-edited queue.
Friday: 30 minutes through the random sample.

Adjust based on volume. A workspace doing 100,000 conversations a month wants more cadence than one doing 1,000.

Disagreement and escalation

When two reviewers rate the same reply differently, the system flags it. Those are useful artifacts. Disagreement usually points to an unclear policy ("should we be cheerful or terse on apology emails?").

Resolve disagreements at a team level. The flag closes when both reviewers update or one defers.

QA and CSAT together

QA is internal review. CSAT is customer feedback. They correlate but are not the same.

A reply can score 5/5 on QA and 1/5 on CSAT (the customer was already unhappy when they wrote in). And vice versa: a sloppy reply that got the right answer fast can score high on CSAT and low on QA.

Use both. CSAT is the leading indicator of customer outcome. QA is the leading indicator of AI drift.

Reports

The QA dashboard surfaces:

Average accuracy, voice, helpfulness, length scores over time.
Topics with the lowest scores (where to focus).
Reviewers' calibration with each other.
Replies marked for library promotion.
Replies marked for brain fix.

What QA review is not

Not a customer feedback tool. Use CSAT.
Not a coaching tool for human agents. Use Agent leaderboard for that.
Not retraining. We do not fine-tune models. The signal feeds calibration and the library.

Recommended setup

QA queue: heavily-edited drafts.
Reviewers: 1 to 3 people on rotating cadence.
Cadence: 30 minutes, twice a week.
Review dimensions: all four (accuracy, voice, helpfulness, length).
Promote library candidates aggressively.
Re-check QA dashboard monthly to spot trends.

Was this article helpful?

← Back to Ochre Help

Quality assurance review

Why QA review matters

What gets reviewed

How review works

Who reviews

What the data does

QA cadence

Disagreement and escalation

QA and CSAT together

Reports

What QA review is not

Recommended setup

Related