Draft Preview

This is a private draft. Enter the password to continue.

Incorrect password.
RL Glossary — Auriel's RL Pet Peeves
Auriel's RL Pet Peeves Mini-Series

Glossary: Key Concepts for RL Newcomers

New to RL, or pivoting into post-training for autoregressive models from another ML specialty or from school? This section grounds you in the core vocabulary used throughout the post. Each definition is written to bridge from engineering concepts you already maybe know.
← Back to the post
Model

At its core, a model is a mathematical function that takes inputs and produces predictions. If you've ever written y = mx + b in a math class, that's the simplest possible model. In machine learning, models are the same idea taken to an extreme: instead of two parameters (m and b), a modern neural network might have billions of parameters, and the computer adjusts all of them during training to minimize how wrong the model's predictions are. The result is a system that can recognize images, generate text, or make decisions — all by learning patterns from data.

Reinforcement Learning (RL)

Reinforcement learning is a way to train a model by letting it learn through trial and error, rather than showing it a pre-labeled answer key.

Analogy: think about how you learned to ride a bike. Nobody handed you a textbook of "correct" body positions for each millisecond. Instead, you got on, wobbled, fell, adjusted, and eventually figured it out. You received a "reward signal" in the form of staying upright and moving forward, and a "penalty" in the form of hitting the pavement.

RL works the same way. An agent (the model) takes actions inside an environment (the world it interacts with), and receives numerical rewards that tell it how well it's doing. Over many rounds of trial and error, the agent's policy — its learned strategy for which action to take in any given situation — improves to maximize the total reward it accumulates.

This stands in contrast to supervised learning, the paradigm most people who have been in ML a while are probably more familiar with — think classifying tumors from labeled X-ray datasets, predicting house prices from a CSV of historical features, or flagging spam from a pre-collected corpus of emails. In all of these cases, you give the model a dataset of input-output pairs and it learns to map inputs to the correct outputs. In supervised learning, the data is static — you collect it once and train on it. In RL, the agent generates its own data by interacting with the environment, which is why environment quality is so critical and why we're here today.

The Environment Is Your Data Generator

You probably remember studying the traditional ML classes the one golden rule: "Garbage in, garbage out." If you feed a model inaccurate or low-quality data, no amount of algorithmic sophistication can save it. So because in RL you don't have a static dataset, instead the agent creates its own training data on the fly by interacting with the environment, every single action the agent takes, and every reward it receives in return, becomes a data point that shapes the model's future behavior. This means the quality of your environment directly determines the quality of your training data. A flaky harness doesn't just introduce noise; it actively generates garbage data and feeds it straight into your model's learning steps.

Data Flow Supervised Learning vs. Reinforcement Learning
VS
Supervised Learning
Static data in, predictions out
1
Collect DatasetImages, text, labels — fixed
2
Train ModelMinimize loss on static pairs
3
Deploy & PredictOne direction. Data never changes.
Reinforcement Learning
The environment generates your data
1
Agent ActsTakes action in environment
2
Environment RespondsReturns state + reward
3
Policy UpdatesModel learns from its own experience
🔄 Continuous loop. Every interaction = new training data. Break the environment = break the data.
Important detail: In reinforcement learning, the environment is your data generator.
Harness

A harness (or, more precisely, a training/evaluation harness) is the interactive system that the RL agent trains in and is evaluated against. It is a simulation or emulation of the real product experience that the agent is supposed to learn to operate in.

Example: if you're building an RL agent that's supposed to learn to use a customer support tool, the harness is a working replica of that tool. It's a piece of running software that the agent can interact with in real time — clicking buttons, typing into fields, reading responses — just like a real user would. The harness is responsible for three things:

  1. Defining the current state of the world: What does the agent "see" right now? (e.g., the current page of the dashboard, the contents of a text field, an API response it just received.)
  2. Defining what actions are valid: What can the agent do from here? (e.g., click this button, type in this field, submit this form, call this API endpoint.)
  3. Returning a reward after a state transition: After the agent takes an action, how good was it? (e.g., +1 for successfully resolving a ticket, -0.1 for each unnecessary step, -1 for crashing the workflow.)

In formal RL terms, the harness implements the components of a Markov Decision Process (MDP) — the State space, the valid Actions, the Transition dynamics (how actions change the state), and the Reward function.

If you've ever used a testing framework like Selenium or Playwright to write end-to-end tests for a web app, you already have a decent mental model. The harness is like a programmable staging environment that your agent interacts with instead of your QA team. The key difference is that in RL, this environment needs to be robust enough to handle not just the happy path, but thousands of creative, adversarial, and outright bizarre interaction patterns — because an RL agent will explore all of them.

Parametric Knowledge

When we say a model has "learned something," what does that actually mean physically? It means the model's parameters (those billions of numbers mentioned earlier) have been adjusted so that the right patterns are encoded in them. This is the model's true knowledge and understanding — we call this parametric knowledge.

The goal of RL training is to get the model to acquire generalizable parametric knowledge — genuine understanding of how to solve the task that holds up even in new situations. We don't want the model to just memorize a specific sequence of clicks that happened to work one time. We want it to understand the underlying task deeply enough to generalize. When the environment is broken, the model can't build this real understanding. Instead, it overfits to the harness's specific flaws, learning brittle surface-level tricks to cope with the broken environment. This is a form of parametric knowledge, but it's useless — it won't transfer to the real task.

Trajectory

A trajectory is a single complete episode of an agent interacting with the environment — the full sequence of states, actions, and rewards from start to finish. Think of it like a recorded session or a log of everything the agent did during one attempt at the task. When researchers "review trajectories," they're watching replays of the agent's behavior to diagnose what went right or wrong.

Gradient Noise

Variance in policy updates. Even when your harness is solid, gradients are estimated from a small set of samples. Limited sampling makes updates high variance by default, which is normal in RL. Some update noise is unavoidable before any harness bugs enter the picture.

Environment Noise

Stochastic rewards and random transitions. Real environments often have some randomness baked in. A robot's sensor readings are slightly noisy; a user's behavior isn't perfectly predictable from prod post-training data. In most post-training teams across major labs, people spend substantial time making sure the setup and pipelines can handle this kind of natural stochasticity.

I had Claude generate this glossary but I reviewed it myself by hand to be sure I stand by the outputs lol. Hope it helps.