Saturday, Jan 31, 2026

From Empirical Risk Minimization to REINFORCE

In this blog post I’ll attempt to clearly link Empirical Risk Minimization, Montecarlo Estimation, Score Gradient Estimation and REINFORCE from first principles. The goal is to present them under the same view of a learning problem, estimation and inductive inference, which is the business that everyone in AI is involved with. To start let me set the stage first with Empirical Risk Minimization.

Empirical Risk Minimization

Vapnik in his book The Nature of Statistical Learning Theory [1] defined the learning problem as:

a problem of finding a desired dependence using a limited number of observations.

This dependence is defined under three components:

  1. A generator $G$ of random vectors $x \in \mathbb{R}^n$, drawn independently from a fixed but unknown probability distribution function $F(x)$.
  2. A supervisor $S$ who returns an output value $y$ to every input vector $x$, according to a conditional distribution function $F(y|x)$, also fixed but unknown.
  3. A learning machine $LM$ capable of implementing a set of functions $f(x, \theta)$, $\theta \in \Theta$, where $\Theta$ is a set of parameters.

This desired dependence is found by choosing from the given set of functions $f(x, \theta)$, $\theta \in \Theta$, the one that best approximates the supervisor’s response given a finite sample ${(x_i, y_i)}_{i=1}^{l} \sim F(x,y) = F(x)F(y|x)$.

This learning problem can be approached via Risk Minimization in which the goal is to measure a loss or discrepancy $L(y, f(x, \theta))$ between $y$ and $f(x, \theta)$ and find the $\theta$ that minimizes the expected discrepancy. Note that we are minimizing the expected risk functional $R(f)$ since both $x, y$ are random variables!

$$R(f) = \int L(y, f(x, \theta)) \ dF(x, y) = \underset{(x,y) \sim F(x,y)}{\mathbb{E}}[L(y, f(x, \theta))]$$

Unfortunately we can’t minimize this function as $F(x, y)$ is unknown, otherwise it wouldn’t be a learning problem. Instead we minimize the Empirical Risk which is defined as:

$$R_{emp}(f) = \frac{1}{l} \sum_{i=1}^{l} L(y_i, f(x_i))$$

and the minimization of this empirical risk functional is solved by

$$f^* = \underset{f \in \mathcal{H}}{\arg\min} , R_{emp}(f) = \underset{f \in \mathcal{H}}{\arg\min} \frac{1}{l} \sum_{i=1}^{l} L(y_i, f(x_i))$$

where $\mathcal{H} = {f(x, \theta) : \theta \in \Theta}$ is the hypothesis space, the set of all functions that the learning machine can implement by varying the parameter $\theta$ over the parameter space $\Theta$.

ERM requires uniform convergence over all $f \in \mathcal{H}$ simultaneously (defined below), which depends on the complexity of the hypothesis space.

$$\sup_{f \in \mathcal{H}} |R_{emp}(f) - R(f)| \xrightarrow{P} 0$$

Under the above condition, we have guarantees that empirical risk minimization will minimize the risk functional $R(f)$. This however, doesn’t mean that Monte Carlo Estimation is out of the picture!

Monte Carlo Estimation

Now the problem becomes how to find the function $f^*$ that minimizes the empirical risk functional. Searching the entire hypothesis space $\mathcal{H}$ is intractable. One way to explore the hypothesis space is via Stochastic Approximation Inference [1], also known as stochastic gradient descent, batch gradient descent and minibatch gradient descent. In general, this process looks like:

$$\theta_{t+1} = \theta_t - \alpha_t \nabla_\theta R(f_t), \quad t=1,2,3,\ldots,T$$

Unfortunately, we can’t calculate this quantity directly since $F(x,y)$ is unknown, so we perform a pointwise approximation via Monte Carlo estimation (justified by the law of large numbers) using a batch of $B$ samples from our training set. Effectively computing an estimator for the gradient of the risk.

$$\nabla_\theta R(f_t) = \nabla_\theta \underset{(x,y) \sim F(x,y)}{\mathbb{E}}[L(y, f(x, \theta_t))] = \underset{(x,y) \sim F(x,y)}{\mathbb{E}}[\nabla_\theta L(y, f(x, \theta_t))]$$

$$\approx \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(y_i, f(x_i, \theta_t)) = \nabla_\theta R_{emp}(f_t) = \widehat{\nabla_\theta R(f_t)}$$

Thus, the weight update becomes:

$$\theta_{t+1} = \theta_t - \alpha_t \nabla_\theta R_{emp}(f_t) = \theta_t - \alpha_t \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(y_i, f(x_i, \theta_t))$$

Each gradient step moves us from one hypothesis to another in the hypothesis space:

$$f_1 \xrightarrow{\nabla_\theta R} f_2 \xrightarrow{\nabla_\theta R} \cdots \xrightarrow{\nabla_\theta R} f_T$$

or equivalently in parameter space:

$$\theta_1 \to \theta_2 \to \cdots \to \theta_T$$

This stochastic approximation process has convergence guarantees under the Robbins-Monro conditions [2]. With this technique we move through the hypothesis space in search for $f^*$.

Score Gradient Estimation

One of the key things that was used to introduce the Monte Carlo estimator for the gradient of $R(f_t)$ is that the gradient of the expectation is the expectation of the gradient. This works above but it isn’t true everywhere.

Imagine you have a function $Q(s)$ that you want to maximize or minimize and that it depends on $s \sim p(s; \theta)$ which is a sample from a distribution parametrized by, let’s say, a neural network. Since the function depends on a stochastic variable, we maximize its expected value:

$$J(\theta) = \underset{s \sim p(s; \theta)}{\mathbb{E}}[Q(s)] = \int Q(s) p(s; \theta) , ds$$

If we want to maximize the function by exploring the hypothesis space $\mathcal{H}$ in a similar fashion as above via stochastic approximation given that it’s too large, we need to get an estimator of the gradient of the expectation in order to do a parameter update like:

$$\theta_{t+1} = \theta_t + \alpha_t \nabla_\theta J(\theta_t), \quad t=1,2,3,\ldots,T$$

So let’s see what happens when we take the gradient of $J(\theta)$:

$$\nabla_\theta J(\theta) = \nabla_\theta \underset{s \sim p(s; \theta)}{\mathbb{E}}[Q(s)] = \nabla_\theta \int Q(s) p(s; \theta) ds = \int Q(s) \nabla_\theta p(s; \theta) ds$$

We can move the gradient inside the integral since it’s a linear operator. However, once we arrive at the above expression, we can’t re-define the integral as an expectation. To alleviate this we use the log-derivative trick.

We start by multiplying and dividing by $p(s; \theta)$ inside the integral:

$$\int Q(s) \nabla_\theta p(s; \theta) ds = \int Q(s) p(s; \theta) \frac{\nabla_\theta p(s; \theta)}{p(s; \theta)} ds$$

Notice that $\frac{\nabla_\theta p(s; \theta)}{p(s; \theta)} = \nabla_\theta \log p(s; \theta)$ by the chain rule. So:

$$= \int Q(s) p(s; \theta) \nabla_\theta \log p(s; \theta) ds$$

$$= \underset{s \sim p(s; \theta)}{\mathbb{E}}[Q(s) \nabla_\theta \log p(s; \theta)]$$

$$\approx \frac{1}{B} \sum_{i=1}^{B} Q(s_i) \nabla_\theta \log p(s_i; \theta) = \widehat{\nabla_\theta J(\theta)}$$

This is the way we arrive at the gradient estimator $\widehat{\nabla_\theta J(\theta)}$. This is called the Score Gradient Estimator since the gradient of the log probability distribution $\nabla_\theta \log p(s; \theta)$ is called the score (or informant). Armed with this estimator we can update our stochastic parameter update process accordingly and optimize away:

$$\theta_{t+1} = \theta_t + \alpha_t \widehat{\nabla_\theta J(\theta_t)} = \theta_t + \alpha_t \frac{1}{B} \sum_{i=1}^{B} Q(s_i) \nabla_\theta \log p(s_i; \theta_t)$$

REINFORCE

Under the original definition of REINFORCE from Ronald J. Williams [3] we have that REINFORCE algorithms:

…apply in general to any learner whose input-output mappings consists of a parameterized input-controlled distribution function from which outputs are randomly generated, and the corresponding algorithms modify the learner’s distribution function on the basis of performance feedback. Because of the gradient approach used here, the only restriction on the potential applicability of these results is that certain obvious differentiability conditions must be met…

In this setting, an agent interacts with an environment by taking actions and receiving rewards. The policy $\pi(a | s; \theta)$ is a probability distribution over actions $a$ given a state $s$, parameterized by $\theta$ (e.g., a neural network). At each timestep $t$, the agent samples an action $a_t \sim \pi(a | s_t; \theta)$ and receives a reward $r_{t+1}$. A whole sequence of these interactions is called a trajectory:

$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_T)$$

The mapping from the Score Gradient Estimator to REINFORCE is:

Score Gradient EstimatorREINFORCE
$p(s; \theta)$$\pi(a | s; \theta)$ (policy)
$s$$a$ (action)
$Q(s)$$G_t$ (return)

The return $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$ is the discounted sum of future rewards starting from time $t$.

The objective then becomes to maximize the expected return:

$$J(\theta) = \underset{a_t \sim \pi(a | s_t; \theta)}{\mathbb{E}}[G_0] = \underset{a_t \sim \pi(a | s_t; \theta)}{\mathbb{E}}\left[\sum_{t=0}^{T} \gamma^t r_{t+1}\right]$$

Applying the Score Gradient Estimator, the gradient of the objective becomes:

$$\nabla_\theta J(\theta) = \underset{a_t \sim \pi(a | s_t; \theta)}{\mathbb{E}}\left[\sum_{t=0}^{T} G_t \nabla_\theta \log \pi(a_t | s_t; \theta)\right]$$

Note that applying the score gradient trick is needed since the expectation is defined with respect to the policy $\pi(a | s; \theta)$ which depends on the parameters $\theta$ we are differentiating:

$$\nabla_\theta J(\theta) = \nabla_\theta \int G_0 \pi(\tau; \theta) d\tau = \int G_0 \nabla_\theta \pi(\tau; \theta) d\tau$$

The gradient lands on $\pi(\tau; \theta)$, so we cannot directly express this as an expectation. Using the log-derivative trick allows us to rewrite it as an expectation we can estimate from.

And the Monte Carlo approximation using a single trajectory gives us the REINFORCE update:

$$\nabla_\theta J(\theta) \approx \sum_{t=0}^{T} G_t \nabla_\theta \log \pi(a_t | s_t; \theta) = \widehat{\nabla_\theta J(\theta)}$$

Finally, the parameter update rule:

$$\theta_{t+1} = \theta_t + \alpha_t G_t \nabla_\theta \log \pi(a_t | s_t; \theta_t)$$

and we’ve arrived back to SGE!

In summary what REINFORCE does is to sample a trajectory by taking actions $a_t \sim \pi(a | s_t; \theta)$, compute returns, and update the policy parameters in the direction that increases the probability of actions that led to higher returns. Or in other words, the score $\nabla_\theta \log \pi(a_t | s_t; \theta)$ tells us how to change $\theta$ to make action $a_t$ more likely, and we weight this by $G_t$ so that actions leading to higher returns are reinforced more strongly.

In the end

…all of these things are tightly related in the sense that each one of them helps us navigate the intractable land of the hypothesis space $\mathcal{H}$ we have always dreamt about.

References

[ 1 ] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Second Edition. Springer, 2000.

[ 2 ] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400-407, 1951.

[ 3 ] Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229-256, 1992.