NN-Z2H Lesson 4: Building makemore part 3 - Activations & Gradients, BatchNorm

dive into the internals of MLPs, scrutinize the statistics of the forward pass activations, backward pass gradients, understand the health of your deep network, introduce batch normalization
til
python
andrej karpathy
nn-z2h
neural networks
Author
Published

November 26, 2024

Modified

November 29, 2024

This is not orginal content!

This is my study notes / codes along with Andrej Karpathy’s “Neural Networks: Zero to Hero” series.

We want to stay a bit longer with the MLPs, to have more concrete intuitive of the activations in the neural nets and gradients that flowing backwards. It’s good to learn about the development history of these architectures. Since Recurrent Neural Network (RNN), they are although very expressive but not easily optimizable with current gradient techniques we have so far. Let’s get started!

Part 1: intro

starter code

Show the code
import torch
import torch.nn.functional as F
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Show the code
import pandas as pd

url = "https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt"
words = pd.read_csv(url, header=None).iloc[:, 0].tolist()
words[:8]
['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']
Show the code
len(words)
32033
Show the code
# build the vocabulary of characters and mapping to/from integer
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i: s for s, i in stoi.items()}
vocab_size = len(itos)
print(itos)
print(vocab_size)
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}
27
Show the code
block_size = 3
# build the dataset
def buid_dataset(words):
    X, Y = [], []

    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

Xtr, Ytr = buid_dataset(words[:n1])        # 80#
Xdev, Ydev = buid_dataset(words[n1:n2])    # 10%
Xte, Yte = buid_dataset(words[n2:])        # 10%
torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])
Show the code
# MLP revisited
n_emb = 10 # no of dimensions of the embedding space.
n_hidden = 200 # size of the hidden - tanh layer

# Lookup table - 10 dimensional space
g = torch.Generator().manual_seed(2147483647) # for reproductivity
C = torch.randn((vocab_size, n_emb),                  generator=g)

# Layer 1 - tanh - 300 neurons
W1 = torch.randn((block_size * n_emb, n_hidden),      generator=g)
b1 = torch.randn(n_hidden,                            generator=g)

# Layer 2 - softmax
W2 = torch.randn((n_hidden, vocab_size),              generator=g)
b2 = torch.randn(vocab_size,                          generator=g)

# All params
parameters = [C, W1, b1, W2, b2]
print("No of params: ", sum(p.nelement() for p in parameters))

# Pre-training
for p in parameters:
    p.requires_grad = True
No of params:  11897
Show the code
# Optimization
max_steps = 50_000 #200_000
batch_size = 32

# Stats holders
lossi = []

# Training on Xtr, Ytr
for i in range(max_steps):

    # minibatch construct      
    ix = torch.randint(0, Xtr.shape[0], (batch_size,)) 
    Xb, Yb = Xtr[ix], Ytr[ix] # batch X, Y

    # forward pass:
    emb = C[Xb] # embed the characters into vectors   
    emb_cat = emb.view(emb.shape[0], -1) # concatenate the vectors
    h_pre_act = emb_cat @ W1 + b1 # hidden layer pre-activation
    h = torch.tanh(h_pre_act) # hidden layer
    logits = h @ W2 + b2 # output layer
    loss = F.cross_entropy(logits, Yb) # loss function

    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    lr = 0.1 if i <= max_steps / 2 else 0.01 # step learning rate decay
    for p in parameters:
        p.data += - lr * p.grad

    # track stats
    if i % 10000 == 0: # print once every while
      print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
    lossi.append(loss.log10().item())
      0/  50000: 22.5552
  10000/  50000: 2.3148
  20000/  50000: 2.1559
  30000/  50000: 2.3941
  40000/  50000: 2.2245
Show the code
plt.plot(lossi)

Show the code
@torch.no_grad() # disables gradient tracking
def split_loss(split: str):
  x, y = {
    'train': (Xtr, Ytr),
    'val': (Xdev, Ydev),
    'test': (Xte, Yte)
  }[split]
  emb = C[x] # (N, block_size, n_emb)
  emb_cat = emb.view(emb.shape[0], -1) # concatenate into (N, block_size * n_emb)
  h = torch.tanh(emb_cat @ W1 + b1) # (N, n_hidden)
  logits = h @ W2 + b2 # (N, vocab_size)
  loss = F.cross_entropy(logits, y) # loss function
  print(split, loss.item())

split_loss('train')
split_loss('val')
train 2.22798752784729
val 2.250197410583496
Show the code
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      # forward pass the neural net
      emb = C[torch.tensor([context])] # (1,block_size,n_embd)
      h = torch.tanh(emb.view(1, -1) @ W1 + b1)
      logits = h @ W2 + b2
      probs = F.softmax(logits, dim=1)
      # sample from the distribution
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      # shift the context window and track the samples
      context = context[1:] + [ix]
      out.append(ix)
      # if we sample the special '.' token, break
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out)) # decode and print the generated word
moraagmyaz.
seel.
npyn.
alarethastendrarg.
aderedieliighlynnelle.
elieananaraelyn.
malara.
noshabergihimies.
kindreelle.
jeberorius.
kynd.
riyah.
faeha.
kaysh.
samyah.
hil.
salynnsti.
zakel.
juren.
cresti.

Okay so now our network has multiple things wrong at the initialization, let’s list down below. The final code will be presented in the end of part 1, with # 👈 for lines that had been added / modified. The right code cell below re-initializes states at the beginning of network’s parameter (in my notebook, it’s rendered linearly!).

Show the code
n_emb = 10 # no of dimensions of the embedding space.
n_hidden = 200 # size of the hidden - tanh layer
# Lookup table - 10 dimensional space
g = torch.Generator().manual_seed(2147483647) # for reproductivity
C = torch.randn((vocab_size, n_emb),                  generator=g)
# Layer 1 - tanh - 300 neurons
W1 = torch.randn((block_size * n_emb, n_hidden),      generator=g)
b1 = torch.randn(n_hidden,                            generator=g)
# Layer 2 - softmax
W2 = torch.randn((n_hidden, vocab_size),              generator=g)
b2 = torch.randn(vocab_size,                          generator=g)
# All params
parameters = [C, W1, b1, W2, b2]
# Pre-training
for p in parameters:
    p.requires_grad = True
# Optimization
max_steps = 50_000 #200_000
batch_size = 32
# Training on Xtr, Ytr
for i in range(max_steps):
    # minibatch construct      
    ix = torch.randint(0, Xtr.shape[0], (batch_size,)) 
    Xb, Yb = Xtr[ix], Ytr[ix] # batch X, Y
    # forward pass:
    emb = C[Xb] # embed the characters into vectors   
    emb_cat = emb.view(emb.shape[0], -1) # concatenate the vectors
    h_pre_act = emb_cat @ W1 + b1 # hidden layer pre-activation
    h = torch.tanh(h_pre_act) # hidden layer
    logits = h @ W2 + b2 # output layer
    loss = F.cross_entropy(logits, Yb) # loss function

    break

fixing the initial loss

We can see at the step = 0, the loss was 27 and after some ks training loops it decreased to 1 or 2. It extremely high at the begining. In practice, we should give the network somehow the expectation we want when generating a character after some characters (3).

Show the code
loss.item()
24.27707862854004

In this case, without training yet, we expect all 27 characters’ possibilities to be equal (1 / 27.0) ~ uniform distribution, so the loss ~ negative log likelihood would be:

Show the code
- torch.tensor(1 / 27.0).log()
tensor(3.2958)

It’s far lower than 27, we say that the network is confidently wrong. Andrej demonstrated by another simple 5 elements tensor and showed that the loss is lowest when all elements are equal.

We want the logits to be low entropy as possible (but not equal to 0, which will be showed later), we added multipliers 0.01 to W2, and 0 to b2. We got the loss to be 3.xx at the beginning.

Show the code
# Layer 2 - softmax
W2 = torch.randn((n_hidden, vocab_size),              generator=g) * 0.01
b2 = torch.randn(vocab_size,                          generator=g) * 0

Now re-train the model and we will notice the the lossi will not look like the hookey stick anymore! Morever the final loss on train set and dev set is better!

fixing the saturated tanh

The logits are now okay, the next problem is about the h - the activations of the hidden states! It’s hard to see but in the output of code cell below, there are too many values of 1 and -1 in this tensor.

Show the code
h
tensor([[-0.9999, -0.7462, -0.9995,  ...,  0.6402, -1.0000, -0.9974],
        [-0.9983, -1.0000, -1.0000,  ...,  1.0000, -1.0000, -1.0000],
        [-0.9793, -0.9999, -1.0000,  ...,  0.7836, -0.7058,  0.2913],
        ...,
        [-1.0000, -0.9995, -0.9891,  ...,  0.9995, -0.8793,  0.9375],
        [-0.9997, -0.9674, -1.0000,  ..., -0.8636, -0.0804,  0.7250],
        [ 0.9988,  1.0000,  0.9998,  ..., -1.0000,  0.9901,  0.9985]],
       grad_fn=<TanhBackward0>)

Recall that tanh is activation function that squashing arbitrary numbers to the range [-1:1]. Let’s visualize the distribution of h.

Show the code
plt.hist(h.view(-1).tolist(), 50); # the ";" removes the presenting of data in code-block's output

Most of them were distributed to the extreme values -1 and 1. Now come to the h_pre_act, we can see a flat-tails distribution from -15 to 15.

Show the code
plt.hist(h_pre_act.view(-1).tolist(), 50);

Looking back to how we implemented tanh in micrograd (which is mathematically the same with PyTorch), we’re multiplying the forward node’s gradient with (1 - t**2), which t is local tanh. When tanh is near -1 or 1, this is close to 0, we are killing the gradients. We are stopping the backpropagation through this tanh unit.

Show the code
...
    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
        out = Value(t, (self, ), 'tanh')

        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
...

When the gradients become zero, the previous nodes’ gradients will be vanishing. We call this saturated tanh, this leads to dead neurons ~ always off and because the gradient is zero then they will never be turned on, and happens for other activations as well: sigmoid, ReLU, etc (but less significant on Leaky ReLU or ELU). The network is not learning!

The same with logits, now we want h to be more near zero, we add multipliers to the W1 and b1:

Show the code
# Layer 1 - tanh - 300 neurons
W1 = torch.randn((block_size * n_emb, n_hidden),      generator=g)  * 0.2
b1 = torch.randn(n_hidden,                            generator=g) * 0.01 # keep a little bit entropy, 
# It's okay to initialize the b1 to zero but AK found emperically this will enhance the optimiaztion

We can see now less peak distribution of h:

tanh

pre-activation tanh

calculating the init scale: “Kaiming init”

Now let’s look to the number 0.02, in practice no one will set it manually. Let’s look into the example below to see how parameters of Gaussian Distribution of y differ from x when multiplying by W.

The question is how we set the W to preserve the Gaussian Distribution of X. Emperical researches found out that the multiplier to W should be square root of the “fan in”, in this case is 10^0.5.

Show the code
x = torch.randn(1000, 10)
W = torch.randn(10, 200)
y = x @ W

W1 = torch.randn(10, 200) / 10**0.5
y1 = x @ W1
print(x.mean(), x.std())
print(y.mean(), y.std())
print(y1.mean(), y1.std())
plt.figure(figsize=(20,5))
plt.subplot(131).set_title("Input X")
plt.hist(x.view(-1).tolist(), 50, density=True);
plt.subplot(132).set_title("Initial output y, expanded by W")
plt.hist(y.view(-1).tolist(), 50, density=True);
plt.subplot(133).set_title("y1, preserve the X's Gaussian Dist")
plt.hist(y1.view(-1).tolist(), 50, density=True);
tensor(-0.0039) tensor(1.0023)
tensor(-0.0044) tensor(3.1220)
tensor(-0.0002) tensor(1.0091)

Please investigate more here:

  1. Kaiming et al. paper: https://arxiv.org/abs/1502.01852
  2. Implementation in Pytorch: https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_

It’s recommended in Kaiming paper to use a gain multiplier base on nonlinearity/activation function (here), for tanh it’s 5/3. We endup modified the initialization of W1 with:

Show the code
W1 = torch.randn((block_size * n_emb, n_hidden),      generator=g)  * (5/3) / ((block_size * n_emb)**0.5) # * 0.2

In this case is roughly 0.3, re-train and although the loss only improved so insignificant (because previously we set it to be 0.2 - very close), but we’ve parameterized this hyper-constant.

batch normalization

As discussed before, we dont want the h_pre_act to be way too small (~is not doing anything) or too large (~saturated), we want it to just roughly follow the standardized Gaussian Distribution (ie. mean equal to 0, std equal to 1).

We’ve done it at the initialization, why don’t we just normalize the hidden states to be unit Gaussian? in batch normalization, this can be achieved by 4 steps, demonstrated with our case:

Show the code
# 1. mini-batch mean
hpa_mean = h_pre_act.mean(0, keepdim=True)
# 2. mini-batch variance / standard deviation
hpa_std = h_pre_act.std(0, keepdim=True)
# 3. normalize
h_pre_act = (h_pre_act - hpa_mean) / hpa_std
# 4. scale and shift
# multiply by a "gain" then "shift" it with a bias
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
h_pre_act = bngain * h_pre_act + bnbias

We modified our code accordingly and re-run the code, actually this time the model did not improve much. Because actually this is very simple and shallow neural network. We also notice that the training loop now is slower than before, because the calculation volumn is bigger. Batch Normalization also unexpectedly comes up with a side effect, the forward and backward pass of any input now also depend on the mini-batch, not just itself (because of mean()/std()). This effect is suprisingly a good thing and acts as a regularizer.

There are also non-coupling regularizers such as: Linear Normalization, Layer Normalization, Group Normalization.

One othering to consider is in the deployment/testing phase, we dont want to use the batch norm calculated by a mini-batch. Instead we want to use the mean and standard deviation from the whole training data set:

Show the code
# calibrate the batch norm after training

with torch.no_grad():
    # pass the training set through
    emb = C[x_train]
    embcat = emb.view(-1, emb.shape[1] * emb.shape[2])
    hpreact = embcat @ W1 + b1
    # measure the mean/std over the entire training set
    bnmean = hpreact.mean(0, keepdim=True)
    bnstd = hpreact.std(0, keepdim=True)

Rather, we can also use the running mean and standard deviation as implemented below which will give close estimates. Remaining 2 notes on the BN are:

  1. Dividing zeros: we add a \(\epsilon\) value to the variance to avoid. We do not include this here as it likely not to happen with out example;
  2. The bias b1 will be subtracting in BN calculation, we will notice the b1.grad will be zeros as it does not impact any other calculation. Thus when using the BN, for layer before like weight, we should remove the bias. The bnbias now will be incharge for biasing the distributions.

real example: resnet50 walkthrough

The code AK presented here: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py#L108

The architecture of ResNet-50 model.

summary of the lecture

Understand the activations (non-linearity) and gradients is crucial when training deep / large neural networks, in part 1 we have observed some issue and come up with many solutions:

  1. Confidently wrong of network at init leads to hookey stick for loss in training loop: adding multipliers to logits’s weights and biases;
  2. Flat-tails distribution or saturated tanh: Kaiming init;
  3. Normalization of the hidden states: introduction to BN.

Our final code in part 1 (un-fold to see), # 👈 indicates a change:

Show the code
block_size = 3

# MLP revisited
n_emb = 10 # no of dimensions of the embedding space.
n_hidden = 200 # size of the hidden - tanh layer

# Lookup table - 10 dimensional space
g = torch.Generator().manual_seed(2147483647) # for reproductivity
C = torch.randn((vocab_size, n_emb),                  generator=g)

# Layer 1 - tanh - 300 neurons
W1 = torch.randn((block_size * n_emb, n_hidden),      generator=g) * (5/3) / ((block_size * n_emb)**0.5) # * 0.2       # 👈
# b1 = torch.randn(n_hidden,                            generator=g) * 0.01       # 👈

# Layer 2 - softmax
W2 = torch.randn((n_hidden, vocab_size),              generator=g) * 0.01       # 👈
b2 = torch.randn(vocab_size,                          generator=g) * 0          # 👈

# Batch Normalization gain and bias
bngain = torch.ones((1, n_hidden))                                              # 👈
bnbias = torch.zeros((1, n_hidden))                                             # 👈

# Add running mean/std
bnmean_running = torch.zeros((1, n_hidden))                             # 👈
bnstd_running = torch.ones((1, n_hidden))                               # 👈

# All params (deleted b1)
parameters = [C, W1, W2, b2, bngain, bnbias]                                # 👈
print("No of params: ", sum(p.nelement() for p in parameters))

# Pre-training
for p in parameters:
    p.requires_grad = True

# Optimization
max_steps = 50_000 #200_000
batch_size = 32

# Stats holders
lossi = []

# Training on Xtr, Ytr
for i in range(max_steps):

    # minibatch construct      
    ix = torch.randint(0, Xtr.shape[0], (batch_size,)) 
    Xb, Yb = Xtr[ix], Ytr[ix] # batch X, Y

    # forward pass:
    emb = C[Xb] # embed the characters into vectors   
    emb_cat = emb.view(emb.shape[0], -1) # concatenate the vectors
    # Linear layer
    h_pre_act = emb_cat @ W1 # + b1 # hidden layer pre-activation                               # 👈
    # BatchNorm layer
    bnmeani = h_pre_act.mean(0, keepdim=True)                                                   # 👈
    bnstdi = h_pre_act.std(0, keepdim=True)                                                     # 👈
    h_pre_act = bngain * ((h_pre_act - bnmeani) / bnstdi) + bnbias                              # 👈
    # Updating running mean and std (this runs outside the training loop)
    with torch.no_grad():                                                                       # 👈
        bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani                               # 👈
        bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi                                  # 👈
    # Non-linearity
    h = torch.tanh(h_pre_act) # hidden layer
    logits = h @ W2 + b2 # output layer
    loss = F.cross_entropy(logits, Yb) # loss function

    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    lr = 0.1 if i <= max_steps / 2 else 0.01 # step learning rate decay
    for p in parameters:
        p.data += - lr * p.grad

    # track stats
    if i % 10000 == 0: # print once every while
      print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
    lossi.append(loss.log10().item())
No of params:  12097
      0/  50000: 3.3045
  10000/  50000: 2.2005
  20000/  50000: 2.1628
  30000/  50000: 2.0014
  40000/  50000: 2.1175
Show the code
plt.plot(lossi)

Show the code
@torch.no_grad() # disables gradient tracking
def split_loss(split: str):
  x, y = {
    'train': (Xtr, Ytr),
    'val': (Xdev, Ydev),
    'test': (Xte, Yte)
  }[split]
  emb = C[x]
  emb_cat = emb.view(emb.shape[0], -1) 
  h_pre_act = emb_cat @ W1 + b1                                                                                         # 👈 
  # h_pre_act = bngain * ((h_pre_act - h_pre_act.mean(0, keepdim=True)) / h_pre_act.std(0, keepdim=True)) + bnbias      # 👈
  # h_pre_act = bngain * ((h_pre_act - bnmean) / bnstd) + bnbias                                                        # 👈
  h_pre_act = bngain * ((h_pre_act - bnmean_running) / bnstd_running) + bnbias                                          # 👈
  h = torch.tanh(h_pre_act) 
  logits = h @ W2 + b2
  loss = F.cross_entropy(logits, y)
  print(split, loss.item())

split_loss('train')
split_loss('val')
train 2.4766831398010254
val 2.490326404571533

loss logs

The numbers somehow are approximate, I don’t know why my Thinkpad-E14 gave different results when running codes multiple times 😂.

Loss logs
Step What we did Loss we got (accum)
1 original

train 2.1169614791870117

val 2.1623435020446777

2 fixed softmax confidently wrong

train 2.0666463375091553

val 2.1468191146850586

3 fixed tanh layer too saturated at init

train 2.033477544784546

val 2.115907907485962

4 used semi principle “kaiming init” instead of hacking init

train 2.038902997970581

val 2.1138899326324463

5 added batch norm layer

train 2.0662825107574463

val 2.1201331615448

Part 2: PyTorch-ifying the code, and train a deeper network

Below is PyTorch-ified code by Andrej, some comments inputted by me:

Show the code
# Let's train a deeper network
# The classes we create here are the same API as nn.Module in PyTorch

class Linear:
    """
    Simplifying Pytorch Linear Layer: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear
    """
    def __init__(self, fan_in, fan_out, bias=True):
        self.weight = torch.randn((fan_in, fan_out), generator=g) / fan_in**0.5
        self.bias = torch.zeros(fan_out) if bias else None

    def __call__(self, x):
        self.out = x @ self.weight
        if self.bias is not None:
            self.out += self.bias
        return self.out

    def parameters(self):
        return [self.weight] + ([] if self.bias is None else [self.bias])


class BatchNorm1d:
    """
    Simplifying Pytorch BatchNorm1D: https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
    """ 
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.momentum = momentum
        self.training = True # to differentiate usage of class in training or evaluation (using running mean/std)
        # parameters (trained with backprop)
        self.gamma = torch.ones(dim) # gain
        self.beta = torch.zeros(dim) # bias
        # buffers (trained with a running 'momentum update')
        self.running_mean = torch.zeros(dim)
        self.running_var = torch.ones(dim)

    def __call__(self, x):
        # calculate the forward pass
        if self.training:
            xmean = x.mean(0, keepdim=True) # batch mean
            xvar = x.var(0, keepdim=True) # batch variance, follow the paper exactly
        else:
            xmean = self.running_mean
            xvar = self.running_var
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
        self.out = self.gamma * xhat + self.beta # to tracking and visualizing data later on, PyTorch does not have this
        # update the buffers
        if self.training:
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
        return self.out

    def parameters(self):
        return [self.gamma, self.beta]

class Tanh:
    """
    Just calculate the Tanh, just PyTorch: https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html
    """
    def __call__(self, x):
        self.out = torch.tanh(x)
        return self.out
    def parameters(self):
        return []

n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 100 # the number of neurons in the hidden layer of the MLP
g = torch.Generator().manual_seed(2147483647) # for reproducibility

C = torch.randn((vocab_size, n_embd),            generator=g)

layers = [
    Linear(n_embd * block_size, n_hidden), Tanh(),
    Linear(           n_hidden, n_hidden), Tanh(),
    Linear(           n_hidden, n_hidden), Tanh(),
    Linear(           n_hidden, n_hidden), Tanh(),
    Linear(           n_hidden, n_hidden), Tanh(),
    Linear(           n_hidden, vocab_size),
]

with torch.no_grad():
    # last layer: make less confident
    layers[-1].weight *= 0.1
    # all other layers: apply gain
    for layer in layers[:-1]:
        if isinstance(layer, Linear):
            layer.weight *= 5/3

parameters = [C] + [p for layer in layers for p in layer.parameters()]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
    p.requires_grad = True
46497
Show the code
# same optimization as last time
max_steps = 200000
batch_size = 32
lossi = []
ud = []

for i in range(max_steps):
  
    # minibatch construct
    ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
    Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

    # forward pass
    emb = C[Xb] # embed the characters into vectors
    x = emb.view(emb.shape[0], -1) # concatenate the vectors
    for layer in layers:
        x = layer(x)
    loss = F.cross_entropy(x, Yb) # loss function

    # backward pass
    for layer in layers:
        layer.out.retain_grad() # AFTER_DEBUG: would take out retain_graph
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    lr = 0.1 if i < 150000 else 0.01 # step learning rate decay
    for p in parameters:
        p.data += -lr * p.grad

    # track stats
    if i % 10000 == 0: # print every once in a while
        print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
    lossi.append(loss.log10().item())
    with torch.no_grad():
        ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])

    break
    # if i >= 1000:
    #     break # AFTER_DEBUG: would take out obviously to run full optimization
      0/ 200000: 3.2962

viz #1: forward pass activations statistics

Show the code
# visualize histograms
plt.figure(figsize=(11, 3)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
  if isinstance(layer, Tanh):
    t = layer.out
    print('layer %d (%10s): mean %+.2f, std %.2f, saturated: %.2f%%' % (i, layer.__class__.__name__, t.mean(), t.std(), (t.abs() > 0.97).float().mean()*100))
    hy, hx = torch.histogram(t, density=True)
    plt.plot(hx[:-1].detach(), hy.detach())
    legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('activation distribution')
layer 1 (      Tanh): mean -0.02, std 0.75, saturated: 20.25%
layer 3 (      Tanh): mean -0.00, std 0.69, saturated: 8.38%
layer 5 (      Tanh): mean +0.00, std 0.67, saturated: 6.62%
layer 7 (      Tanh): mean -0.01, std 0.66, saturated: 5.47%
layer 9 (      Tanh): mean -0.02, std 0.66, saturated: 6.12%
Text(0.5, 1.0, 'activation distribution')

If we set the gain to 1, the std is shrinking, and the saturation is coming to zeros, due to the first layer is pretty decent, but the next ones are shrinking to zero because of the tanh() - a squashing function.

layer 1 (      Tanh): mean -0.02, std 0.62, saturated: 3.50%
layer 3 (      Tanh): mean -0.00, std 0.48, saturated: 0.03%
layer 5 (      Tanh): mean +0.00, std 0.41, saturated: 0.06%
layer 7 (      Tanh): mean +0.00, std 0.35, saturated: 0.00%
layer 9 (      Tanh): mean -0.02, std 0.32, saturated: 0.00%
Text(0.5, 1.0, 'activation distribution')

If the gain is 1

But if we set the gain is far too high, let’s say 3, we can see the saturation is too high.

layer 1 (      Tanh): mean -0.03, std 0.85, saturated: 47.66%
layer 3 (      Tanh): mean +0.00, std 0.84, saturated: 40.47%
layer 5 (      Tanh): mean -0.01, std 0.84, saturated: 42.38%
layer 7 (      Tanh): mean -0.01, std 0.84, saturated: 42.00%
layer 9 (      Tanh): mean -0.03, std 0.84, saturated: 42.41%
Text(0.5, 1.0, 'activation distribution')

If the gain is 3

So 5/3 is a nice one, balancing the std and saturation.

Why 5/3?

A comment in his video explains why 5/3 is recommended, it comes from the avg of \([\tanh(x)]^2\) where \(x\) is distributed as a Gaussian:

\(\int_{-\infty}^{\infty} \frac{[\tanh(x)]^2 \exp(-\frac{x^2}{2})}{\sqrt{2\pi}} \, dx \approx 0.39\)

The square root of this value is how much the tanh squeezes the variance of the incoming variable: 0.39 ** .5 ~= 0.63 ~= 3/5 (hence 5/3 is just an approximation of the exact gain).

viz #2: backward pass gradient statistics

Similarly, we can do the same thing with gradients. With the setting of gain as 5/3, the distribution of gradients through layers quite the same. Layer by layer, the value of gradients will be shrank close to zero, the distributions would be more and more peak, so the gain here will help expanding those distributions.

Show the code
# visualize histograms
plt.figure(figsize=(11, 3)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
  if isinstance(layer, Tanh):
    t = layer.out.grad
    print('layer %d (%10s): mean %+f, std %e' % (i, layer.__class__.__name__, t.mean(), t.std()))
    hy, hx = torch.histogram(t, density=True)
    plt.plot(hx[:-1].detach(), hy.detach())
    legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('gradient distribution')
layer 1 (      Tanh): mean +0.000010, std 4.205588e-04
layer 3 (      Tanh): mean -0.000003, std 3.991179e-04
layer 5 (      Tanh): mean +0.000003, std 3.743020e-04
layer 7 (      Tanh): mean +0.000015, std 3.290473e-04
layer 9 (      Tanh): mean -0.000014, std 3.054035e-04
Text(0.5, 1.0, 'gradient distribution')

the fully linear case of no non-linearity

Now imagine if we remove the tanh from all layers, the recommend gain now for Linear is 1.

Show the code
layers = [
  Linear(n_embd * block_size, n_hidden), #Tanh(),
  Linear(           n_hidden, n_hidden), #Tanh(),
  Linear(           n_hidden, n_hidden), #Tanh(),
  Linear(           n_hidden, n_hidden), #Tanh(),
  Linear(           n_hidden, n_hidden), #Tanh(),
  Linear(           n_hidden, vocab_size),
]

But you’ll end up getting a pure linear network. No matter of how many Linear Layers you stacked up, it just the combination of all layers to a massive linear function \(y = xA^T + b\), which will greatly limit the capacity of the neural nets.

viz #3: parameter activation and gradient statistics

We can also visualize the distribution of paramaters, here below only weight for simplicity (ignoring gamma, beta, etc…). We observed mean, std, and the grad to data ratio (to see how much the data will be updated).

Problem for the last layer is shown in code output below, the weights on last layer are 10 times bigger than previous ones, and the grad to data ratio is too high.

We can try run 1st 1000 training loops and this can be slight reduced, but since we are using a simple optimizer SGD rather than modern one like Adam, it is still problematic.

Show the code
# visualize histograms
plt.figure(figsize=(11, 3)) # width and height of the plot
legends = []
for i,p in enumerate(parameters):
  t = p.grad
  if p.ndim == 2:
    print('weight %10s | mean %+f | std %e | grad:data ratio %e' % (tuple(p.shape), t.mean(), t.std(), t.std() / p.std()))
    hy, hx = torch.histogram(t, density=True)
    plt.plot(hx[:-1].detach(), hy.detach())
    legends.append(f'{i} {tuple(p.shape)}')
plt.legend(legends)
plt.title('weights gradient distribution');
weight   (27, 10) | mean -0.000031 | std 1.365078e-03 | grad:data ratio 1.364090e-03
weight  (30, 100) | mean -0.000049 | std 1.207430e-03 | grad:data ratio 3.871660e-03
weight (100, 100) | mean +0.000016 | std 1.096730e-03 | grad:data ratio 6.601988e-03
weight (100, 100) | mean -0.000010 | std 9.893572e-04 | grad:data ratio 5.893091e-03
weight (100, 100) | mean -0.000011 | std 8.623432e-04 | grad:data ratio 5.158123e-03
weight (100, 100) | mean -0.000004 | std 7.388576e-04 | grad:data ratio 4.415211e-03
weight  (100, 27) | mean -0.000000 | std 2.364824e-02 | grad:data ratio 2.328203e+00

viz #4: update data ratio over time

The grad to data above ratio is at the end not really informative (only at one point in time), what matter is actual amount which we change the data in these tensors (over time). AK introduce a tracking list ud (update to data). This calculates the ratio between (std) of the grad to the data of parameters (and log10() for a nicer viz) without context of gradient.

Show the code
plt.figure(figsize=(11, 3))
legends = []
for i,p in enumerate(parameters):
  if p.ndim == 2:
    plt.plot([ud[j][i] for j in range(len(ud))])
    legends.append('param %d' % i)
plt.plot([0, len(ud)], [-3, -3], 'k') # these ratios should be ~1e-3, indicate on plot
plt.legend(legends);

Below is the visualization from data collected after 1000 training loops:

Viz 4 1000

Recall what we did to the last layer, avoiding over confidence, so the pink line looks different among others. In general, the learning process are good, if we change the learning rate to 0.0001, the chart looks much worse.

Below are viz 1 after 1000 training loops:

Viz 1 1000

and viz 2:

Viz 2 1000

and viz 3:

Viz 3 1000

Pretty decent till now. Let’s bring the BatchNorm back.

bringing back batchnorm, looking at the visualizations

We re-define the layers, and change gamma in last layer under no gradient instead of weight. We also dont want the “manual normalization” fan-in, and the gain 5/3 as well:

Show the code
layers = [
  Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]

summary of the lecture for real this time

  1. Intruduction of Batch Normalization - the 1st one of modern innovation to stablize Deep NN training;
  2. PyTorch-ifying code;
  3. Introduction to some diagnostic tools that we can use to verify the network is in good state dynamically.

What he did not try to improve here is the loss of the network. It’s now somehow bottleneck not by the Optimization, but by the Context Length he suspect.

Training Neural Network is like balancing a pencil on a finger.

Final network architecture and training:

Show the code
# BatchNorm1D and Tanh are the same
class Linear:
    """
    Simplifying Pytorch Linear Layer: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear
    """
    def __init__(self, fan_in, fan_out, bias=True):
        self.weight = torch.randn((fan_in, fan_out), generator=g) # / fan_in**0.5
        self.bias = torch.zeros(fan_out) if bias else None

    def __call__(self, x):
        self.out = x @ self.weight
        if self.bias is not None:
            self.out += self.bias
        return self.out

    def parameters(self):
        return [self.weight] + ([] if self.bias is None else [self.bias])

n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 100 # the number of neurons in the hidden layer of the MLP
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((vocab_size, n_embd),            generator=g)
layers = [
  Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]

with torch.no_grad():
    # last layer: make less confident
    layers[-1].gamma *= 0.1
    # all other layers: apply gain
    for layer in layers[:-1]:
        if isinstance(layer, Linear):
            layer.weight *= 1.0 #5/3

parameters = [C] + [p for layer in layers for p in layer.parameters()]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
    p.requires_grad = True

# same optimization as last time
max_steps = 200000
batch_size = 32
lossi = []
ud = []

for i in range(max_steps):
  
    # minibatch construct
    ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
    Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

    # forward pass
    emb = C[Xb] # embed the characters into vectors
    x = emb.view(emb.shape[0], -1) # concatenate the vectors
    for layer in layers:
        x = layer(x)
    loss = F.cross_entropy(x, Yb) # loss function

    # backward pass
    for layer in layers:
        layer.out.retain_grad() # AFTER_DEBUG: would take out retain_graph
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    lr = 0.1 if i < 150000 else 0.01 # step learning rate decay
    for p in parameters:
        p.data += -lr * p.grad

    # track stats
    if i % 10000 == 0: # print every once in a while
        print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
    lossi.append(loss.log10().item())
    with torch.no_grad():
        ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])

    # break
    # if i >= 1000:
    #     break # AFTER_DEBUG: would take out obviously to run full optimization
47024
      0/ 200000: 3.2870
  10000/ 200000: 2.4521
  20000/ 200000: 2.0847
  30000/ 200000: 2.1838
  40000/ 200000: 2.1515
  50000/ 200000: 2.2246
  60000/ 200000: 1.9450
  70000/ 200000: 2.2514
  80000/ 200000: 2.4420
  90000/ 200000: 2.0624
 100000/ 200000: 2.5850
 110000/ 200000: 2.3225
 120000/ 200000: 2.2004
 130000/ 200000: 2.0352
 140000/ 200000: 1.8516
 150000/ 200000: 2.0424
 160000/ 200000: 2.2229
 170000/ 200000: 2.0384
 180000/ 200000: 2.2274
 190000/ 200000: 2.0901

Final visualization:

Viz 1:

Viz 1 final

Viz 2:

Viz 2 final

Viz 3:

Viz 3 final

Viz 4:

Viz 4 final

The final loss on train/val:

Show the code
@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
  x,y = {
    'train': (Xtr, Ytr),
    'val': (Xdev, Ydev),
    'test': (Xte, Yte),
  }[split]
  emb = C[x] # (N, block_size, n_embd)
  x = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
  for layer in layers:
    x = layer(x)
  loss = F.cross_entropy(x, y)
  print(split, loss.item())

# put layers into eval mode
for layer in layers:
  layer.training = False
split_loss('train')
split_loss('val')
train 2.103635549545288
val 2.1365904808044434

Sample from the model:

Show the code
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      # forward pass the neural net
      emb = C[torch.tensor([context])] # (1,block_size,n_embd)
      x = emb.view(emb.shape[0], -1) # concatenate the vectors
      for layer in layers:
        x = layer(x)
      logits = x
      probs = F.softmax(logits, dim=1)
      # sample from the distribution
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      # shift the context window and track the samples
      context = context[1:] + [ix]
      out.append(ix)
      # if we sample the special '.' token, break
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out)) # decode and print the generated word
mona.
mayannielle.
dhryah.
rethan.
ejdraeg.
adelynnelin.
shi.
jen.
edelisson.
arleigh.
malaia.
nosadbergiaghiel.
kinde.
jennex.
terofius.
kaven.
jamyleyeh.
yuma.
myston.
azhil.

Happy learning!

Exercises:

  • E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn’t train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
  • E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be “folded into” the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then “fold” the batchnorm gamma/beta into the preceeding Linear layer’s W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

resources:

  1. other people learn from AK like me: https://bedirtapkan.com/posts/blog_posts/karpathy_3_makemore_activations/; https://skeptric.com/index.html#category=makemore - a replicate (?) with more OOPs on another dataset;
  2. some good papers recommended by Andrej:
  3. Notebook: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part3_bn.ipynb