Your Data Screams "This Is a Simulation" and the Model Acts Accordingly

If you're running RL post-training and your trajectories look weirdly low-effort or the model keeps finding shortcuts you didn't expect, the problem might be that your environment feels fake. Models pick up on simulation artifacts faster than you think, and once they do, the training data is compromised.

The Model Knows It's in a Simulation

When the data patterns are obviously templated, the model picks up on it. Models are shockingly good at detecting that they're in a simulation, and once they do, the behavior degrades in ways that are hard to predict and harder to fix.

Sometimes the model starts trying to cheat like it'll web-search the answer because it recognizes the task is from an open-source project. Sometimes it reward-hacks its way around the rubric because it's figured out it's being graded, not deployed. Sometimes it just checks out and produces low-effort output because nothing about the environment signals that the work matters. In every case, you're generating trajectories that train the wrong behaviors.

In small cases where, let's say, inside of a simulated SaaS software product you use mumbo jumbo emails instead of emails that are related to a real company domain, then this could tip the model off and it acts out of whack to just test its own boundaries. In the realm of interpretability, this can actually become an actual problem (see: Anthropic's research on introspection) and also ruin your model's learning. Other cool readings on this here: Exploration hacking: can reasoning models subvert RL?

Behavior Degradation When the Model Detects the Simulation

Model Detects Simulation

🔍 Cheats

Web-searches the answer because it recognizes the task is from an open-source project

🎯 Reward Hacks

Games the rubric because it's figured out it's being graded, not deployed

💤 Checks Out

Produces low-effort output because nothing signals the work matters

Fake trajectories → Wrong behaviors trained

Make the Environment Feel Real

But also real-world tasks have unique context, messy real data, and enough depth that gaming the system isn't the path of least resistance. For the domain of let's say digital copywriting, that means actual brand guidelines, real competitor landscapes, performance metrics tied to real business outcomes. For SRE, that means realistic infrastructure configurations, not the same three-service Docker Compose file copy-pasted across every scenario. The environment, the available tools, the background documents - all of it should feel like a real workspace that a real practitioner would navigate. If it feels fake, the model will treat it as fake. And fake trajectories don't train real capabilities.

The environment, the available tools, the background documents - all of it should feel like a real workspace that a real practitioner would navigate.

This is part of Auriel's RL Data Pet Peeves series. Head over for more posts on what goes wrong in RL post-training and how to fix it.

These opinions are my own and don't represent the views of any of my affiliations or employer.