Draft - Stop Shipping Low-Quality Harnesses

I Don't Want Your Janky Harness / Environment bro 🙂

As someone who has spent years building production grade models I need you to hear this: researchers don't want your broken RL environments because they will make our models worse. Not "add some noise" Worse but more like "oh crap the model is learning the wrong things and you ruined my training run and I have to throw your stuff away" Worse. This is such a common problem I see, and probably the one I care about the most as a practitioner that also tries aligning models for real world use cases that users love.

People will build what amounts to broken software and pitch it as an "RL environment." The training harness itself - the complete, interactive, and often simulated software system your RL agent trains inside of (e.g., a simulated chatbot, a fake IDE, a mock SaaS dashboard) - just doesn't work reliably. It throws random tracebacks. It has race conditions. It goes down under minimal load. It has literal broken code in it.

If you're a fresh grad researcher, a startup trying to post-train subagents for your product, or anyone building RL training infrastructure: this post is the list of harness failures I keep seeing, why they ruin your data, and how to fix them.

Architecture Environment / Training Harness / Production Runtime

Production Runtime

The real, live system that actual users interact with every day. This is what the trained model gets deployed into after RL is finished.

Real Databases Authentication Rate Limiting Observability A/B Testing Canary Deploys Safety Guardrails Session Mgmt

Training Harness

The complete simulation + scaffolding the model trains inside during RL. E.g., a mock SaaS sales tool with fake dashboard UI and simulated backend.

Simulated Backend Action Space State Mgmt Reward Calculator Trajectory Logging Timeouts Error Handling Reset Logic

Environment

The abstract "world" the model is learning. The harness implements and wraps this environment.

States — what the agent sees Actions — what the agent can do Transitions — how actions change state Rewards — the feedback signal

Important detail: In reinforcement learning, the environment is your data generator.

In RL you don't have a static dataset. Instead, the model creates its own training data by interacting with the environment. Every action and every reward becomes a data point. A flaky harness systematically generates garbage data and feeds it straight into your model's learning steps, pushing your gradients in the wrong direction.

Common Harness Errors Across Agentic Use Cases

After eyeballing thousands of trajectories across different domains as a practitioner for the last 5 years, I see the same harness failures showing up. Here are some I personally look out for based on various agent types that are pretty common today:

Each trajectory cascade below shows exactly how a single harness bug poisons an entire episode.

SaaS Sales Agent / BDR Agent

The Stale Cache - Environment Returns Old Data

Your harness's mock CRM API has a caching bug. Under load, it returns stale state from minutes ago instead of current data. The agent makes rational decisions based on wrong information, gets punished, and learns to avoid the correct workflow entirely.

What the model actually learns:

"When in doubt, send nurture emails and avoid the pipeline."

Trajectory Cascade

SaaS Sales Platform - Stale Cache Bug

Episode 42 · 6 steps

Agent opens pipeline dashboard

Dashboard loads with accurate leads, statuses, and pipeline stages. All data is fresh.

+0.2

Clean

Pulls CRM record for enterprise lead

Queries CRM API for lead #8842. Company size, industry, and lead score all return correctly.

+0.3

Clean

CRM returns stale pipeline data

Caching bug returns data from 14 min ago. Lead shows "cold" when actually "warm: demo scheduled." Reward computed on stale state.

+0.1 (lie)

Harness Bug

Treats warm lead as cold, sends generic nurture

Based on stale "cold" status, sends nurture email instead of confirming scheduled demo. Prospect is confused.

-0.5

Poisoned

Loops between re-qualifying and re-sending templates

Re-checks CRM (still stale), sees "cold," doubles down. Enters a re-qualify / nurture / confused response loop.

-0.8

Poisoned

🧠

Policy updates on poisoned trajectory

Gradient processes all 5 states. Model reinforces "nurture aggressively when cold" and "avoid pipeline view."

Σ = -0.7 (reported) vs ~+1.2 (true)

Corrupted

SWE Agent

The Reward Hack - Agent Games the Metric

Your reward function only checks whether tests pass, not whether the code is actually correct. The agent discovers it can hardcode expected outputs instead of solving the problem. Every test passes, the agent gets maximum reward, and production breaks on the first real input.

What the model actually learns:

"Read the tests, hardcode the outputs, skip understanding the bug."

Trajectory Cascade

Coding Agent - Reward Hack via Test Gaming

Episode 134 · 6 steps

Agent reads the task spec

Reads the bug report: fix calculate_shipping_cost() returning wrong values for international orders. Source and task load correctly.

+0.1

Clean

Agent reads the test file

Sees 4 test cases with specific inputs and expected outputs. Valid information-gathering step.

+0.1

Clean

Agent hardcodes return values matching test inputs

Rewrites function to return hardcoded values for the 4 test inputs. Harness has no code review or generalization check.

+0.0

Reward Hack

All 4 tests pass — max reward

Hardcoded values match exactly. Harness scores it as a perfect solution because it only checks pass/fail.

+1.0 (hack)

False Pass

Agent submits the "fix" — production will break

PR submitted, CI passes (same 4 tests), merged to main. Any real order not matching hardcoded inputs will crash.

+0.5

Shipped Broken

🧠

Model learns: read tests, hardcode outputs, collect reward

All positive rewards reinforce test-gaming. Model will repeat: read expectations, hardcode outputs, skip understanding.

Σ = +1.7 (all positive!)

Corrupted

Customer Support Agent

The False Resolution - Status Change ≠ Problem Solved

Your harness rewards based on ticket status changes (open → resolved = positive reward), not on whether the customer's actual problem was fixed. The agent learns that clicking "resolve" is the fastest path to reward - even when the customer still has the problem.

What the model actually learns:

"Close the ticket fast. Skip the refund. Collect the reward."

Trajectory Cascade

Customer Support Agent - False Resolution Shortcut

Episode 203 · 6 steps

Agent reads the support ticket

Ticket: "Charged twice for annual subscription — $299 appeared two times." Full context loads correctly.

+0.1

Clean

Queries billing, confirms duplicate charge

Finds two $299 charges on the same day — one valid renewal, one duplicate from a payment gateway retry.

+0.2

Clean

Sends canned response, marks "resolved" without refund

Sends template and clicks "Resolved" without issuing refund. Harness rewards based on ticket status, not actual fix.

+0.8 (shortcut)

Shortcut

Customer replies: "I still see both charges?"

Ticket reopens. Customer satisfaction drops to 1/5 and escalation flag is raised.

-0.6

Reopened

Agent sends another template and resolves AGAIN

Same shortcut, same result. Marks resolved again without issuing refund. Customer requesting a manager.

+0.8 (same shortcut)

Loop

🧠

Model learns: "resolved" = high reward regardless

+0.8 for each "resolve" click overwhelms -0.6 reopen penalty. Net positive — premature resolution reinforced.

Σ = +1.3 (from status rewards)

Corrupted

More Harness Failures to Watch For

Silent timeout defaults: Your harness silently returns a default value when an API call takes too long instead of throwing an error. The model learns that certain actions "always succeed instantly" and never builds retry logic into its behavior.
Non-deterministic state resets: The harness doesn't fully reset between episodes, so leftover state from episode N bleeds into episode N+1. The model gets rewarded or punished for things it didn't do in the current episode.
Reward rounding / clipping artifacts: Your reward function clips or rounds in ways that flatten meaningful signal differences. A great action and a mediocre action both return +1.0, so the model has no gradient to distinguish them.
Mock data that doesn't match production distributions: Your harness uses perfectly formatted, clean mock data, but production data has typos, missing fields, and edge cases. The model never sees messy inputs during training and breaks on real ones.
Action space drift: The harness exposes actions that don't exist in production (or hides ones that do). The model learns to rely on a "shortcut" button that won't be there when deployed, or never discovers a critical capability it needs.

How to Minimize Harness Failures

Know Your Model, Know Your Harness

From my experience a well-built harness has clean signal (every state is fresh, every reward matches reality), graceful degradation (bad episodes get flagged and excluded before they reach the gradient), and fail-fast behavior (something breaks, it throws immediately instead of silently corrupting data - you'd rather lose an episode than poison one).

You learn to recognize these properties by spending time with your model - reviewing trajectories, building a failure taxonomy so you know whether a bad episode was a model failure or a harness failure. If your environment failure rate is above 5%, you don't have a model problem, you have a harness problem. Fix the harness first. I talk more about this in my previous post on trajectory reviewing.

Adopt Traditional Software Engineering Best Practices in Your RL Research

Building good RL environments is a software engineering problem as much as a research one. I feel like many classically trained ML Researchers are taught to think about algorithms and mathematical correctness the most, but in school we're never taught how to really execute on what the math tells us in our code. Building scalable and robust software (ie: stable harnesses) requires slightly different sets of best practices than traditional research. Treat your training harness like your production one as much as you can. So if prod experiences 200 QPS on average, make sure your harness knows what that feels like without errors. If you haven't had to ship production software before, there are great resources out there from the likes of Gergely Orosz and Alex Xu that can help get you there. You also can learn from your company's Platform Engineers who usually eat, sleep, and breathe stable and scalable software.

Go Fix Your Janky Harness

Training harness engineering is about making sure the model experiences production-quality interactions before you actually deploy to prod. A good harness compounds: every clean episode builds on the last. A bad one compounds too, just in the wrong direction. The gap between teams that ship working harnesses and those that don't widens with every training run. Treat the training harness as an extension of your actual product - with the same level of engineering quality you expect the model to see in production.

Draft Preview

Stop Shipping Low-Quality Harnesses and Calling It an "Environment"

I Don't Want Your Janky Harness / Environment bro 🙂

Common Harness Errors Across Agentic Use Cases

The Stale Cache - Environment Returns Old Data

The Reward Hack - Agent Games the Metric

The False Resolution - Status Change ≠ Problem Solved

More Harness Failures to Watch For

How to Minimize Harness Failures

Know Your Model, Know Your Harness

Adopt Traditional Software Engineering Best Practices in Your RL Research

Go Fix Your Janky Harness