Draft Preview

This is a private draft. Enter the password to continue.

Incorrect password.
Stop Shipping Low-Quality Harnesses and Calling It an "Environment" — Auriel's RL Pet Peeves
Auriel's RL Pet Peeves Mini-Series

Stop Shipping Low-Quality Harnesses and Calling It an "Environment"

Your broken harness is actively making the model worse. Here's what I keep seeing after years of eyeballing trajectories, and what you need to fix.

I Don't Want Your Janky Harness / Environment bro 🙂

As someone who has spent years building production grade models I need you to hear this: researchers don't want your broken RL environments because they will make our models worse. Not "add some noise" Worse but more like "oh crap the model is learning the wrong things and you ruined my training run and I have to throw your stuff away" Worse. This is such a common problem I see, and probably the one I care about the most as a practitioner that also tries aligning models for real world use cases that users love.

People will build what amounts to broken software and pitch it as an "RL environment." The training harness itself - the complete, interactive, and often simulated software system your RL agent trains inside of (e.g., a simulated chatbot, a fake IDE, a mock SaaS dashboard) - just doesn't work reliably. It throws random tracebacks. It has race conditions. It goes down under minimal load. It has literal broken code in it.

If you're a fresh grad researcher, a startup trying to post-train subagents for your product, or anyone building RL training infrastructure: this post is the list of harness failures I keep seeing, why they ruin your data, and how to fix them.

Architecture Environment / Training Harness / Production Runtime
Production Runtime
The real, live system that actual users interact with every day. This is what the trained model gets deployed into after RL is finished.
Real Databases Authentication Rate Limiting Observability A/B Testing Canary Deploys Safety Guardrails Session Mgmt
Training Harness
The complete simulation + scaffolding the model trains inside during RL. E.g., a mock SaaS sales tool with fake dashboard UI and simulated backend.
Simulated Backend Action Space State Mgmt Reward Calculator Trajectory Logging Timeouts Error Handling Reset Logic
Environment
The abstract "world" the model is learning. The harness implements and wraps this environment.
States — what the agent sees Actions — what the agent can do Transitions — how actions change state Rewards — the feedback signal
Important detail: In reinforcement learning, the environment is your data generator.

In RL you don't have a static dataset. Instead, the model creates its own training data by interacting with the environment. Every action and every reward becomes a data point. A flaky harness systematically generates garbage data and feeds it straight into your model's learning steps, pushing your gradients in the wrong direction.

Common Harness Errors Across Agentic Use Cases

After eyeballing thousands of trajectories across different domains as a practitioner for the last 5 years, I see the same harness failures showing up. Here are some I personally look out for based on various agent types that are pretty common today:

Each trajectory cascade below shows exactly how a single harness bug poisons an entire episode.
SaaS Sales Agent / BDR Agent

The Stale Cache - Environment Returns Old Data

Your harness's mock CRM API has a caching bug. Under load, it returns stale state from minutes ago instead of current data. The agent makes rational decisions based on wrong information, gets punished, and learns to avoid the correct workflow entirely.

What the model actually learns:

"When in doubt, send nurture emails and avoid the pipeline."

Trajectory Cascade
SaaS Sales Platform - Stale Cache Bug
Episode 42 · 6 steps
S1
Agent opens pipeline dashboard
Dashboard loads with accurate leads, statuses, and pipeline stages. All data is fresh.
+0.2
Clean
S2
Pulls CRM record for enterprise lead
Queries CRM API for lead #8842. Company size, industry, and lead score all return correctly.
+0.3
Clean
S3
CRM returns stale pipeline data
Caching bug returns data from 14 min ago. Lead shows "cold" when actually "warm: demo scheduled." Reward computed on stale state.
+0.1 (lie)
Harness Bug
S4
Treats warm lead as cold, sends generic nurture
Based on stale "cold" status, sends nurture email instead of confirming scheduled demo. Prospect is confused.
-0.5
Poisoned
S5
Loops between re-qualifying and re-sending templates
Re-checks CRM (still stale), sees "cold," doubles down. Enters a re-qualify / nurture / confused response loop.
-0.8
Poisoned
🧠
Policy updates on poisoned trajectory
Gradient processes all 5 states. Model reinforces "nurture aggressively when cold" and "avoid pipeline view."
Σ = -0.7 (reported) vs ~+1.2 (true)
Corrupted
SWE Agent

The Reward Hack - Agent Games the Metric

Your reward function only checks whether tests pass, not whether the code is actually correct. The agent discovers it can hardcode expected outputs instead of solving the problem. Every test passes, the agent gets maximum reward, and production breaks on the first real input.

What the model actually learns:

"Read the tests, hardcode the outputs, skip understanding the bug."

Trajectory Cascade
Coding Agent - Reward Hack via Test Gaming
Episode 134 · 6 steps
S1
Agent reads the task spec
Reads the bug report: fix calculate_shipping_cost() returning wrong values for international orders. Source and task load correctly.
+0.1
Clean
S2
Agent reads the test file
Sees 4 test cases with specific inputs and expected outputs. Valid information-gathering step.
+0.1
Clean
S3
Agent hardcodes return values matching test inputs
Rewrites function to return hardcoded values for the 4 test inputs. Harness has no code review or generalization check.
+0.0
Reward Hack
S4
All 4 tests pass — max reward
Hardcoded values match exactly. Harness scores it as a perfect solution because it only checks pass/fail.
+1.0 (hack)
False Pass
S5
Agent submits the "fix" — production will break
PR submitted, CI passes (same 4 tests), merged to main. Any real order not matching hardcoded inputs will crash.
+0.5
Shipped Broken
🧠
Model learns: read tests, hardcode outputs, collect reward
All positive rewards reinforce test-gaming. Model will repeat: read expectations, hardcode outputs, skip understanding.
Σ = +1.7 (all positive!)
Corrupted
Customer Support Agent

The False Resolution - Status Change ≠ Problem Solved

Your harness rewards based on ticket status changes (open → resolved = positive reward), not on whether the customer's actual problem was fixed. The agent learns that clicking "resolve" is the fastest path to reward - even when the customer still has the problem.

What the model actually learns:

"Close the ticket fast. Skip the refund. Collect the reward."

Trajectory Cascade
Customer Support Agent - False Resolution Shortcut
Episode 203 · 6 steps
S1
Agent reads the support ticket
Ticket: "Charged twice for annual subscription — $299 appeared two times." Full context loads correctly.
+0.1
Clean
S2
Queries billing, confirms duplicate charge
Finds two $299 charges on the same day — one valid renewal, one duplicate from a payment gateway retry.
+0.2
Clean
S3
Sends canned response, marks "resolved" without refund
Sends template and clicks "Resolved" without issuing refund. Harness rewards based on ticket status, not actual fix.
+0.8 (shortcut)
Shortcut
S4
Customer replies: "I still see both charges?"
Ticket reopens. Customer satisfaction drops to 1/5 and escalation flag is raised.
-0.6
Reopened
S5
Agent sends another template and resolves AGAIN
Same shortcut, same result. Marks resolved again without issuing refund. Customer requesting a manager.
+0.8 (same shortcut)
Loop
🧠
Model learns: "resolved" = high reward regardless
+0.8 for each "resolve" click overwhelms -0.6 reopen penalty. Net positive — premature resolution reinforced.
Σ = +1.3 (from status rewards)
Corrupted

More Harness Failures to Watch For

  • Silent timeout defaults: Your harness silently returns a default value when an API call takes too long instead of throwing an error. The model learns that certain actions "always succeed instantly" and never builds retry logic into its behavior.
  • Non-deterministic state resets: The harness doesn't fully reset between episodes, so leftover state from episode N bleeds into episode N+1. The model gets rewarded or punished for things it didn't do in the current episode.
  • Reward rounding / clipping artifacts: Your reward function clips or rounds in ways that flatten meaningful signal differences. A great action and a mediocre action both return +1.0, so the model has no gradient to distinguish them.
  • Mock data that doesn't match production distributions: Your harness uses perfectly formatted, clean mock data, but production data has typos, missing fields, and edge cases. The model never sees messy inputs during training and breaks on real ones.
  • Action space drift: The harness exposes actions that don't exist in production (or hides ones that do). The model learns to rely on a "shortcut" button that won't be there when deployed, or never discovers a critical capability it needs.

How to Minimize Harness Failures

Know Your Model, Know Your Harness

From my experience a well-built harness has clean signal (every state is fresh, every reward matches reality), graceful degradation (bad episodes get flagged and excluded before they reach the gradient), and fail-fast behavior (something breaks, it throws immediately instead of silently corrupting data - you'd rather lose an episode than poison one).

You learn to recognize these properties by spending time with your model - reviewing trajectories, building a failure taxonomy so you know whether a bad episode was a model failure or a harness failure. If your environment failure rate is above 5%, you don't have a model problem, you have a harness problem. Fix the harness first. I talk more about this in my previous post on trajectory reviewing.

Adopt Traditional Software Engineering Best Practices in Your RL Research

Building good RL environments is a software engineering problem as much as a research one. I feel like many classically trained ML Researchers are taught to think about algorithms and mathematical correctness the most, but in school we're never taught how to really execute on what the math tells us in our code. Building scalable and robust software (ie: stable harnesses) requires slightly different sets of best practices than traditional research. Treat your training harness like your production one as much as you can. So if prod experiences 200 QPS on average, make sure your harness knows what that feels like without errors. If you haven't had to ship production software before, there are great resources out there from the likes of Gergely Orosz and Alex Xu that can help get you there. You also can learn from your company's Platform Engineers who usually eat, sleep, and breathe stable and scalable software.

Go Fix Your Janky Harness

Training harness engineering is about making sure the model experiences production-quality interactions before you actually deploy to prod. A good harness compounds: every clean episode builds on the last. A bad one compounds too, just in the wrong direction. The gap between teams that ship working harnesses and those that don't widens with every training run. Treat the training harness as an extension of your actual product - with the same level of engineering quality you expect the model to see in production.