Your Rubric Was Written by Someone Who Has Never Done the Job

If you're doing RL post-training and wondering why your model produces outputs that look right on paper but embarrass anyone who actually works in the domain, the problem might not be the model. It might be the rubric. This is one of the most common failures I see in both closed and open source labs.

The Problem with Googled Rubrics

This is painfully common in closed source and open source labs. An ML engineer or data contractor Googles a domain for an afternoon, writes a rubric, and now that rubric is supposedly the standard for what "good" looks like in finance, or law, or insurance, or whatever the domain is. The problem is that they have no idea what actually matters to a practitioner. They'll check for surface-level things - did the model produce an output? does it have the right number of items? is it formatted correctly? - while completely missing the qualitative judgment calls that a real professional would make in two seconds.

If you're building a rubric for helping a chatbot write better legal contracts, does it just check that the contract exists and has the expected sections? Or does it verify that the indemnification clause actually limits liability in a way a lawyer would consider enforceable, that the governing law is appropriate for the jurisdiction, that the payment terms don't create ambiguity a counterparty would immediately flag? A first-year associate and a senior partner would score the same output very differently, and the senior partner's reasoning is what you actually want encoded in your rubric.

Comparison Surface-Level vs. Practitioner-Informed

Surface-Level Rubric

Written by: ML Engineer after an afternoon of Googling

✓ Contract exists?
✓ Has expected sections?
✓ Formatted correctly?
✓ Output is non-empty?
✓ Word count in range?

Practitioner-Informed Rubric

Written by: Senior Partner with 15 years in contracts

✓ Indemnification clause limits liability appropriately?
✓ Governing law matches jurisdiction?
✓ Payment terms free of counterparty ambiguity?
✓ Force majeure clause covers realistic scenarios?
✓ Non-compete scope is enforceable in target state?

The Model Will Optimize for Exactly What You Reward

If the verifier is informed by surface-level knowledge that a real person in that domain would never consider a marker of quality, the model will learn to optimize for the wrong things. It'll produce outputs that look right to someone who doesn't know the domain and look embarrassing to someone who does.

Elo Scores, POV Rubrics, and Evolving Artifacts

This is why I personally like getting both Elo scores and people's POV based rubrics because it gives a post-training team more things to reason about when analyzing trends across task outcomes. Reward functions and rubrics are ever-evolving artifacts that the whole post-training team needs to work on together. But the starting point has to be directionally correct, and "directionally correct" can only come from people who have actually lived in that domain. Give the rubric, then explain how the rubric was created. That's the difference between a checklist and a useful artifact.

Give the rubric, then explain how the rubric was created. That's the difference between a checklist and a useful artifact.

This is part of Auriel's RL Data Pet Peeves series. Read more at aurielws.github.io.

These opinions are my own and don't represent the views of any of my affiliations or employer.