The effect of data shuffling in mini-batch training

Often when we train a neural network with mini batches we shuffle the training set before every epoch. It is a very good practice but why? Do we need to do this? I'll try to answer these question in this blog post.

data shuffling

When we train a neural network, we try to minimize an objective function, which is

$$J(\theta) = \frac{1}{N}\sum_{i=1}^{N}L(\theta, x_i) \qquad (1)$$

where $N$ is the number of samples in the training set and $x_i$ is the $ith$ training sample. This objective is a function of the set of parameters $\theta$ of the model and is parameterized by the whole training set. This is only practical when our training set is small. When the dataset becomes larger, the amount of time needed for one update can become too large. This is where mini-batch gradient descent comes to the rescue. Mini-batch gradient descent make the model update frequency higher than batch gradient descent which speed up the training process.

But mini-batch training has its own shortcomings. When we use mini-batch training, we are not minimizing the objective function defined in Eq. 1. We actually minimizes a sequence of approximate functions

$$\hat{J}_1(\theta) = \frac{1}{M}\sum_{i=1}^{M}L(\theta, x_i)$$$$\hat{J}_2(\theta) = \frac{1}{M}\sum_{i=M+1}^{2M}L(\theta, x_i)$$$$\hat{J}_3(\theta) = \frac{1}{M}\sum_{i=2M+1}^{3M}L(\theta, x_i)$$$$...$$

where $M$ is batch size. Note that those are functions of $\theta$ - the parameters of the models, not functions of $x_i$. Each of those functions is an estimate of $J(\theta)$.

What if we do not shuffle the data?

If the training set is not shuffled after each epoch which means for every epoch, $\hat{J}_1(\theta)$ get the same parameters (same set of $x_i$) => They are fixed functions (not sure i'm using the right word here). We can prove that, in this case, they are all biased estimate of the true $J(\theta)$. Let's pick $\hat{J}_1(\theta)$ for example, its expectation is

$$E_{x_i}[\hat{J}_1(\theta)] = E_{x_i}\left [ \frac{1}{M}\sum_{i=1}^{M}L(\theta, x_i) \right ] \qquad (2)$$

If we assume that $x_i$'s are i.i.d which is often the case, then we can have $$E_{x_i}[\hat{J}_1(\theta)] = \frac{1}{M}\sum_{i=1}^{M} E_{x_i}[L(\theta, x_i)] \qquad (3)$$

Since $x_i$ is fixed, Eq. 3 can be re-written as

$$E_{x_i}[\hat{J}_1(\theta)] = \frac{1}{M}\sum_{i=1}^{M} L(\theta, x_i) \neq \frac{1}{N}\sum_{i=1}^{N}L(\theta, x_i)$$

Which makes $\hat{J}_1(\theta)$ a biased estimate of $J(\theta)$. This is not what we want. We want un-biased estimates of $J(\theta)$. So how can we achieve that?

The effect of data shuffling

Let's say instead of fix set of $x_i$ for each $\hat{J}_j$, we draw sample uniformly from the training set for every iteration (every batch). This time, from Eq. 3, the expectation of $\hat{J}_j$ is

$$E_{x_i}[\hat{J}_1(\theta)] = \frac{1}{M}\sum_{i=1}^{M} E_{x_i}[L(\theta, x_i)]$$$$= \frac{1}{M}\sum_{i=1}^{M} \frac{1}{N} \sum_{k = 1}^{N} L(\theta, x_k) = \frac{1}{M} M \frac{1}{N} \sum_{k = 1}^{N} L(\theta, x_k) = J(\theta) \qquad (4)$$

From Eq. 4 we can see that with $x_i$ being draw uniformly from the training set, $\hat{J}_j$ become an un-biased estimate of $J$. Data shuffling is not randomly drawing $x_i$ uniformly from the training set but it's close and thus has the same effect.

TL;DR: To answer the questions, data shuffling in mini-batch training produces good gradient updates by un-biasedly estimate the true objective function. If we dont shuffle the data, our model might still converge but it might take longer to train while having higher training and testing error (high bias, since the estimate objective functions are bias).

In [ ]: