Auriel's RL Data Pet Peeves Mini-Series

Your Tasks Are Not Grounded in Economic Reality

If you can't explain why the domain matters economically, your task design is already off.

You can have a solid RL environment and well-written rubrics and still end up training a model that nobody cares about. If the tasks you're training on don't map to work that real people actually do and pay for, you're optimizing for a universe that doesn't exist.

Know Why the Domain Matters

If you're building tasks for a domain, you should be able to articulate why that domain matters. How many people do this job? What does a day in their life actually look like? What are the high-value tasks that eat their time and that a model could realistically take on?

Look at Bureau of Labor Statistics data. Ground your task design in the actual economic weight of the work. When you push for an environment for let's say digital copywriting, you should know what a digital copywriter actually does in a day, how much of the economy that represents, and which of those tasks your environment is targeting. That framing is what separates "we made some tasks" from "we've mapped the highest-value copywriting workflows and built evaluation surfaces for them." One of those is valuable. The other gets thrown away.

This Helps You Prioritize

This also helps you prioritize. Not all tasks within a domain are equally valuable to train on. A model that can reliably do the $200/hour part of a Marketing Persona's job is a lot more interesting than one that does the $15/hour part. Ground the task selection in what actually moves the needle economically.

Task Value $200/hr vs $15/hr
$200/hr Work
Domain judgment, strategic decisions, downstream consequences
Contract negotiation strategy Architectural trade-off analysis Marketing campaign positioning Compliance risk assessment
- - - - - -
$15/hr Work
Template-solvable, commodity, format-checking
Fill in template fields Reformat document Copy-paste with substitution Basic data entry

Startups: Check Your Own Product First

If you're a startup and you are trying to set your own product up as an RL environment to automate workflows in your product by making custom post-trained models, ask yourself before you begin doing all this work: is my own product actually even stable enough / doing well enough for me to justify starting to do post-training on? This is basically another way of applying this principle. If you don't have stable product metrics or any indicator that you have product-market fit (unless literally building a post-trained model is part of your differentiation in terms of product I guess?), you should be really, really careful before you start trying to post-train custom sub-agent models without justification that the process you are trying to automate out using RL is actually viable.

Furthermore, it's going to be really hard to build a rubric if your product metrics and user experience are not stable. My fav tool in this space btw is hud.ai.

Ground the task selection in what actually moves the needle economically.

This is part of Auriel's RL Data Pet Peeves series. Visit aurielws.github.io for more content.

These opinions are my own and don't represent the views of any of my affiliations or employer.