NN-Z2H Lesson 3: Building makemore part 2 - MLP

implement a multilayer perceptron (MLP) character-level language model, introduce model training, learning rate tuning, hyperparameters, evaluation, train/dev/test splits, under/overfitting, etc.
til
python
andrej karpathy
nn-z2h
neural networks
Author
Published

November 20, 2024

Modified

November 20, 2024

This is not orginal content!

This is my study notes / codes along with Andrej Karpathy’s “Neural Networks: Zero to Hero” series.

In the previous lecture, we built a simple bigram character-level language model, using 2 different approaches that are (1) count, and (2) 1 layer neural network. They produced the same (and both poor - since the context is 1 character only) result but the neural network option offers more flexibility so that we can complexify our model to get better performance.

In this lecture we are going to implement 20-years ago neural probabilistic language model by Bengio et al. (2003).

PART 1: intro to MLP

Bengio et al. 2003 (MLP language model) paper walkthrough

Summary

Problem Statement:

  • Traditional n-gram language models suffer from the curse of dimensionality: they can’t effectively generalize to word sequences not seen in training data;
  • The core issue is treating words as atomic units with no inherent similarity to each other;
  • For example, if we’ve seen “dog is eating” in training but never “cat is eating”, n-gram models can’t leverage the similarity between “dog” and “cat”;
  • This leads to poor probability estimates for rare or unseen word sequences.

Solution:

  • Learn a distributed representation (embedding) for each word in a continuous vector space where similar words are close to each other;
  • Use a neural network architecture with:
    • Input layer: concatenated embeddings of n-1 previous words;
    • Hidden layer: dense neural network with tanh activation;
    • Output layer: softmax over entire vocabulary to predict next word probability.

The model simultaneously learns:

  • Word feature vectors (embeddings) that capture semantic/syntactic word similarities;
  • Neural network parameters that combine these features to estimate probability distributions.

Key advantages:

  • Words with similar meanings get similar feature vectors, enabling better generalization;
  • The probability function is smooth with respect to word embeddings, so similar words yield similar predictions;
  • Can generalize to unseen sequences by leveraging learned word similarities.

Methodology:

  • Traditional Problem:

    • In n-gram models, each word sequence of length n is a separate parameter;
    • For vocabulary size \(|V|\), need \(|V|^n\) parameters;
    • Most sequences never appear in training, leading to poor generalization;
  • Solution via Distributed Representation:

    • Each word mapped to a dense vector in \(R^m\) (typically m=50-100);
    • Similar words get similar vectors through training;
    • Probability function is smooth w.r.t these vectors;
    • Key benefit: If “dog” and “cat” have similar vectors, model can generalize from “dog is eating” to “cat is eating”;
    • Number of parameters reduces to \(O(|V|×m + m×h + h×|V|)\), where \(h\) is hidden layer size;
    • This is much smaller than \(|V|^n\) and allows better generalization;

Neural architecture:

Input Layer:

  • Takes \(n-1\) previous words (context window);
  • Each word i mapped to vector \(C(i) ∈ R^m\) via lookup table;
  • Concatenates these vectors: \(x = [C(wₜ₋ₙ₊₁), ..., C(wₜ₋₁)]\);
  • \(x\) dimension is \((n-1)×m\);

Hidden Layer:

  • Dense layer with tanh activation;
  • Computation: \(h = tanh(d + Hx)\);
  • \(H\) is weight matrix, \(d\) is bias vector;
  • Maps concatenated context to hidden representation;

Output Layer:

  • Computes probability distribution over all words;
  • \(y = b + Wx + Uh\);
  • Softmax activation: \(P(wₜ|context) = exp(yᵢ)/Σⱼexp(yⱼ)\);
  • \(W\) provides “shortcut” connections from input to output;
  • Direct connection helps learn simpler patterns;

Training:

  • Maximizes log-likelihood of training data;
  • Uses stochastic gradient descent;
  • Learns both word vectors \(C(i)\) and neural network parameters \((H, d, W, U, b)\);
  • Word vectors capture similarities as they help predict similar contexts;
  • Can initialize word vectors randomly or with pretrained vectors.

Neural Language Model proposed by (Bengio et al., 2003). C(i) is the i th word embedding.

(re-)building our training dataset

Loading library, reading data, building dictionary:

Show the code
import torch
import torch.nn.functional as F
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Show the code
import pandas as pd

url = "https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt"
words = pd.read_csv(url, header=None).iloc[:, 0].tolist()
words[:8]
['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']
Show the code
len(words)
32033
Show the code
# build the vocabulary of characters and mapping to/from integer
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0

itos = {i: s for s, i in stoi.items()}
itos
{1: 'a',
 2: 'b',
 3: 'c',
 4: 'd',
 5: 'e',
 6: 'f',
 7: 'g',
 8: 'h',
 9: 'i',
 10: 'j',
 11: 'k',
 12: 'l',
 13: 'm',
 14: 'n',
 15: 'o',
 16: 'p',
 17: 'q',
 18: 'r',
 19: 's',
 20: 't',
 21: 'u',
 22: 'v',
 23: 'w',
 24: 'x',
 25: 'y',
 26: 'z',
 0: '.'}

Building the dataset:

Show the code
block_size = 3 # the context length: how many characters do we take to predict the next one?
X, Y = [], []

for w in words[:5]:
    print(w)
    context = [0] * block_size # 0 so context will be padded by '.'
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '----->', itos[ix] )
        context = context[1:] + [ix] # rolling to the next one

X = torch.tensor(X)
Y = torch.tensor(Y)
emma
... -----> e
..e -----> m
.em -----> m
emm -----> a
mma -----> .
olivia
... -----> o
..o -----> l
.ol -----> i
oli -----> v
liv -----> i
ivi -----> a
via -----> .
ava
... -----> a
..a -----> v
.av -----> a
ava -----> .
isabella
... -----> i
..i -----> s
.is -----> a
isa -----> b
sab -----> e
abe -----> l
bel -----> l
ell -----> a
lla -----> .
sophia
... -----> s
..s -----> o
.so -----> p
sop -----> h
oph -----> i
phi -----> a
hia -----> .
Show the code
X.shape, X.dtype, Y.shape, Y.dtype
(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

implementing the embedding lookup table

In the paper they cram 17k word into as-low-as-possible 30 dimensions space, for our data, we just cram words into 2D space.

Show the code
C = torch.randn((27, 2))

We can access the element of torch.tensor by:

Show the code
C[5] # can be integer, list [5, 6, 7], or torch.tensor([5,6,7])
# > tensor([1.0825, 0.2010])

# or

F.one_hot(torch.tensor(5), num_classes=27).float() @ C
# produce identical result, remember torch.tensor() infer long dtype int64, so we need to cast to float
tensor([ 0.3055, -0.4069])

…but in this lecture accessing by C[5] would be sufficient. We can even access using a more than 1 dimension tensor:

Show the code
print(C[X].shape)
print(X[13, 2]) # integer 1 for 13rd index of 2nd dimension
print(C[X][13,2]) # will be the embedding of that element
print(C[1]) # so C[X][13,2] = C[1]
torch.Size([32, 3, 2])
tensor(1)
tensor([0.0748, 0.8711])
tensor([0.0748, 0.8711])

PyTorch is great for embedding words:

Show the code
emb = C[X]
emb.shape
torch.Size([32, 3, 2])

We’ve compeleted the first layer with context and lookup table!

implementing the hidden layer + internals of torch.Tensor: storage, views

Show the code
# input of tanh layer will be 6 (3 words in context x 2 dimensions)
# and the number or neurons is up to us - let's set it 100
W1 = torch.randn((6, 100))
b1 = torch.randn(100)

Now we need to do something like emb @ W1 + b1, but emb.shape is [32, 3, 2] and W1.shape is [6, 100]. We need to somehow concatnate/transform:

Show the code
# emb[:, 0, :] is tensor for each input in the 3-words context, shape is [32, 2]
# cat 3 of them using the 2nd dimension (index 1) -> so we set dim = 1
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], dim=1).shape
torch.Size([32, 6])

However this code does not change dynamically when we change the block size. We will be using torch.unbind()

Show the code
# this is good!
torch.cat(torch.unbind(emb, 1), 1).shape
# new memory for storage is created, so it is not efficient
torch.Size([32, 6])

This works, but we have a better and more efficient way to do this. Since:

  • every torch.Tensor have .storage() which is one-dimensional vector tensor;
  • when we call .view(), we instruct how this vector tensor is interpreted;
  • no memory is being changed/copied/moved/or created. the storage is identical.

Readmore: http://blog.ezyang.com/2019/05/pytorch-internals/

So this hidden layer can be declared:

Show the code
# instead or 32 we can write emb.shape[1], or -1 (whatever fitted)
h = emb.view(-1, 6) @ W1 + b1
h.shape
torch.Size([32, 100])

Notice that in the final operation, b1 will be broadcasted.

implementing the output layer

Show the code
W2 = torch.randn((100, 27))
b2 = torch.randn(27)

In Deep Learning, people use logits for what raw output that range from negative inf to positive inf.

Show the code
logits = h @ W2 + b2
Show the code
logits.shape
torch.Size([32, 27])

Now we need to exponentiate it and get the probability.

Show the code
counts = logits.exp()
Show the code
probs = counts / counts.sum(1, keepdims=True)
Show the code
probs.shape
torch.Size([32, 27])

Every row of probs has sum of 1.

Show the code
probs[0].sum()
tensor(1.)

And this is the probs of each ground true Y in current output of the neural nets:

Show the code
probs[torch.arange(32), Y]
tensor([2.0660e-21, 3.6184e-15, 1.5675e-11, 1.2796e-08, 4.4312e-02, 2.4003e-12,
        3.5811e-13, 1.3876e-18, 3.3465e-14, 2.5158e-22, 8.3230e-35, 3.4999e-08,
        7.5305e-10, 6.8868e-24, 1.8081e-28, 8.6262e-08, 4.0514e-18, 4.2847e-19,
        4.9013e-15, 1.0952e-10, 8.4563e-11, 2.1141e-26, 4.4209e-22, 6.9570e-30,
        3.9779e-10, 3.2419e-13, 2.2802e-07, 6.5380e-23, 3.0035e-37, 0.0000e+00,
        0.0000e+00, 3.2723e-26])

Result is not good as we’ve not trained the network yet!

implementing the negative log likelihood loss

We define the negative log likelihood as:

Show the code
loss = - probs[torch.arange(32), Y].log().mean()
loss
tensor(inf)

summary of the full network

Dataset:

Show the code
X.shape, Y.shape
(torch.Size([32, 3]), torch.Size([32]))

Neural network layers:

Show the code
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)

parameters = [C, W1, b1, W2, b2]

Size of the network:

Show the code
sum(p.nelement() for p in parameters)
3481

Constructing forward pass:

Show the code
emb = C[X] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = - probs[torch.arange(32), Y].log().mean()
loss
tensor(17.7697)

PART 2: intro to many basics of machine learning

introducing F.cross_entropy and why

We re-define loss:

Show the code
loss = F.cross_entropy(logits, Y)
loss
tensor(17.7697)

Why?

  • Pytorch will create more intermediate tensor for every assignment: counts, probs -> more memory;
  • Backward pass will be more optimized, because the expressions are much analytically and mathematically interpreted;
  • Cross entropy can be significantly & numerically well behaved (for eg when we exponentiate a large positive number we got inf, PyTorch cross entropy will calculate the max of set and subtract it - which will not impact the exp result)

implementing the training loop, overfitting one batch

So the forward pass, backward pass, and update loop will be implemented as below:

Show the code
for p in parameters:
    p.requires_grad = True
Show the code
for _ in range(10):
    # forward pass:
    emb = C[X] # (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Y)
    print(loss.item())
    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()
    # update
    for p in parameters:
        p.data += -0.1 * p.grad

print(loss.item())
17.76971435546875
13.656400680541992
11.298768997192383
9.452457427978516
7.984262466430664
6.891321182250977
6.100014686584473
5.452036380767822
4.898152828216553
4.4146647453308105
4.4146647453308105

We are fitting 32 examples to a neural nets of 3481 params, so it’s super easy to be overfitting. We got a low final loss, but it would never be 0, because the output can varry for the same input, for eg, ....

Show the code
logits.max(1)
torch.return_types.max(
values=tensor([10.7865, 12.2558, 17.3982, 13.2739, 10.6965, 10.7865,  9.5145,  9.0495,
        14.0280, 11.8378,  9.9038, 15.4187, 10.7865, 10.1476,  9.8372, 11.7660,
        10.7865, 10.0029,  9.2940,  9.6824, 11.4241,  9.4885,  8.1164,  9.5176,
        12.6383, 10.7865, 10.6021, 11.0822,  6.3617, 17.3157, 12.4544,  8.1669],
       grad_fn=<MaxBackward0>),
indices=tensor([ 1,  8,  9,  0, 15,  1, 17,  2,  9,  9,  2,  0,  1, 15,  1,  0,  1, 19,
         1,  1, 16, 10, 26,  9,  0,  1, 15, 16,  3,  9, 19,  1]))

training on the full dataset, minibatches

We can deploy our code to all the dataset, un-fold the below code block to see full code.

Show the code
block_size = 3
X, Y = [], []

# Dataset
for w in words:
    # print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        # print(''.join(itos[i] for i in context), '----->', itos[ix] )
        context = context[1:] + [ix] # rolling to the next one

# Input and ground true
X = torch.tensor(X)
Y = torch.tensor(Y)
print("Data size", X.shape, Y.shape)

# Lookup table
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator=g)
emb = C[X] # (32, 3, 2)

# Layer 1 - tanh
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)

# Layer 2 - softmax
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Y)

# All params
parameters = [C, W1, b1, W2, b2]
print("No of params: ", sum(p.nelement() for p in parameters))

# Pre-training
for p in parameters:
    p.requires_grad = True
Data size torch.Size([228146, 3]) torch.Size([228146])
No of params:  3481

We notice that it takes a bit long time for each training in the loop. In practice, we will perform the forward/backward passes and update parameters for a small batch of the dataset. The minibatch construction is added/modified for lines of code with #👈.

Read more: https://nttuan8.com/bai-10-cac-ky-thuat-co-ban-trong-deep-learning/

Show the code
# Training
for _ in range(10000):
    # minibatch construct                                           #👈
    ix = torch.randint(0, X.shape[0], (32,))                        #👈

    # forward pass:
    emb = C[X[ix]] # (32, 3, 2)                                     #👈
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Y[ix])                           #👈
    if _ >= 9990: print(f"___after running {_} time: ", loss.item())
    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()
    # update
    for p in parameters:
        p.data += -0.1 * p.grad

print("final minibatch loss: ", loss.item())
___after running 9990 time:  2.4769816398620605
___after running 9991 time:  2.719482660293579
___after running 9992 time:  2.5886943340301514
___after running 9993 time:  2.2563774585723877
___after running 9994 time:  2.584904432296753
___after running 9995 time:  2.8459784984588623
___after running 9996 time:  2.4365286827087402
___after running 9997 time:  2.2122902870178223
___after running 9998 time:  2.440680742263794
___after running 9999 time:  2.183750867843628
final minibatch loss:  2.183750867843628

The loss decrease much much better, although the direction of gradient might be not correct direction. But it is good enough for an approximation. Notice the loss for a minibatch is not the loss of whole dataset.

Show the code
emb = C[X] # (32, 3, 2)                                    
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Y)   
loss.item()
2.492901086807251

We archived 2.39 loss for final minibatch and 2.5 on overall network.

finding a good initial learning rate

Now we’re continuing the optimization, let’s focus on how much we update the data from the gradient p.data += -0.1 * p.grad. We do not know if we step too little or too much.

We can create 1000 learning rates to use along with the training loop and see which one offers more stable convergence.

Show the code
lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre

Reset the code:

Show the code
block_size = 3
X, Y = [], []

# Dataset
for w in words:
    # print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        # print(''.join(itos[i] for i in context), '----->', itos[ix] )
        context = context[1:] + [ix] # rolling to the next one

# Input and ground true
X = torch.tensor(X)
Y = torch.tensor(Y)
print("Data size", X.shape, Y.shape)

# Lookup table
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator=g)
emb = C[X] # (32, 3, 2)

# Layer 1 - tanh
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)

# Layer 2 - softmax
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Y)

# All params
parameters = [C, W1, b1, W2, b2]
print("No of params: ", sum(p.nelement() for p in parameters))

# Pre-training
for p in parameters:
    p.requires_grad = True
Data size torch.Size([228146, 3]) torch.Size([228146])
No of params:  3481

Training and tracking stats:

Show the code
lri = []
lossi = []

for i in range(1000):
    # minibatch construct                                           
    ix = torch.randint(0, X.shape[0], (32,))                        
    # forward pass:
    emb = C[X[ix]] # (32, 3, 2)                                    
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Y[ix])                           
    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()
    # update
    lr = lrs[i]
    for p in parameters:
        p.data += - lr * p.grad

    # track stats
    lri.append(lre[i])
    lossi.append(loss.item())

loss.item()
8.216608047485352

Plotting, we see a good exponential element of learning rate turn out to be around -1.

\(10^{-1}\) is 0.1 so our initial guess seems good.

Show the code
plt.plot(lri, lossi)

splitting up the dataset into train/val/test splits and why

Now we can keep lengthening the training loop to continue decreasing loss. We can try some techniques like change the learning rate to 0.001 after 20k, 30k loops of training with 0.1.

But it will come to be overfitting when we try to keep training or increase the size of network to achieve a lower loss. The model just memorizing our training set verbatim, so if we try to sample from the model it just gives us the same thing in the dataset. Or if we calculate the loss on another dataset, it might be very high.

So another industry standard is we will split the data set into 3 pieces: (1) training set; (2) dev/validation set; and (3) test set, they can be 80% - 10% - 10% roughly and respectively.

  1. Training split: train the parameters;
  2. Dev/validation split: train the hyperparamerters (size of hidden layer, size of embedding, streng of regularization, etc);
  3. Test split: evaluate the performance of the model at the end, we only work on this a very very few times, otherwise we learn from it and repeat overfitting.

We are going to implement this train/dev/test splits:

Show the code
# build the dataset
def buid_dataset(words):
    block_size = 3
    X, Y = [], []

    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

Xtr, Ytr = buid_dataset(words[:n1])
Xdev, Ydev = buid_dataset(words[n1:n2])
Xte, Yte = buid_dataset(words[n2:])
torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])

Now we’re already to train on splits of the dataset, but let’s hold on as we are talking abount overfitting. As discussed, overfitting also come from using a complex (too many parameters) for a small data set.

Our dataset has roughly 228k records, while the size of network is only 3.4k. So we are still underfitting, let’s continue to complexify our neural networks.

2 things to consider here:

  • the size of tanh - hidden layer; and
  • dimensions of embedding space.

visualizing the loss, character embeddings

First we want to see: - how the loss decrease with 200k training loop with current network setting, learning rate decay to 0.01 after first 100k; and - how the current character embeddings recognize the similarity between characters in (2D) space.

Training on the Xtr, Ytr:

Show the code
# Lookup table
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator=g)

# Layer 1 - tanh
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)

# Layer 2 - softmax
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)

# All params
parameters = [C, W1, b1, W2, b2]
print("No of params: ", sum(p.nelement() for p in parameters))

# Pre-training
for p in parameters:
    p.requires_grad = True

# Stats holders
lossi = []
stepi = []

# Training on Xtr, Ytr
for i in range(200_000):
    # minibatch construct                                           
    ix = torch.randint(0, Xtr.shape[0], (32,))                       #👈
    # forward pass:
    emb = C[Xtr[ix]] # (32, 3, 2)                                    #👈
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Ytr[ix])                          #👈
    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()
    # update
    lr = 0.1 if i <= 100_000 else 0.01                                #👈
    for p in parameters:
        p.data += - lr * p.grad

    # track stats
    lossi.append(loss.item())
    stepi.append(i)

print("Loss on minibatch: ", loss.item())
No of params:  3481
Loss on minibatch:  2.1069579124450684

Loss on whole training dataset:

Show the code
emb = C[Xtr] # (32, 3, 2)                                    
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ytr)   
loss.item()
2.259035110473633

Loss on dev/validation dataset, it’s not much different from loss on training as the model is still underfitting, it still generalizes thing:

Show the code
emb = C[Xdev] # (32, 3, 2)                                    
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)   
loss.item()
2.257272958755493

Visualizing loss, we can see the loss shaking significantly as the batch size still small - 32.

Show the code
plt.plot(stepi, lossi)

Visualizing the character embeddings, we can see the model can cluster for eg. vowels a, e, i, o, u.

Show the code
plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:, 1].data, s=200)
for i in range(C.shape[0]):
    plt.text(C[i,0].item(),C[i,1].item(), itos[i], ha="center", va="center", color="white")
plt.grid('minor')

experiment: larger hidden layer, larger embedding size

Now we can experiment a larger hidden layer (300), and larger embedding_size (10). Below is the whole code:

Show the code
# hyper-parameters
block_size = 3 # number of chracters / inputs to predict the nextone
no_chars = 27 # number of possible chracters, include '.'
emb_size = 10 # no of dimensions of the embedding space.
hidden_size = 300 # size of the hidden - tanh layer
batch_size = 32 # minibatch size for training, 2, 4, 8, 16, 32, 64, etc

# build the dataset
def buid_dataset(words):

    X, Y = [], []

    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

# 80 - 10 - 10 splits
Xtr, Ytr = buid_dataset(words[:n1])
Xdev, Ydev = buid_dataset(words[n1:n2])
Xte, Yte = buid_dataset(words[n2:])

# Lookup table - 10 dimensional space
g = torch.Generator().manual_seed(2147483647) # for reproductivity
C = torch.randn((no_chars, emb_size), generator=g)

# Layer 1 - tanh - 300 neurons
W1 = torch.randn((block_size * emb_size, hidden_size), generator=g)
b1 = torch.randn(hidden_size, generator=g)

# Layer 2 - softmax
W2 = torch.randn((hidden_size, no_chars), generator=g)
b2 = torch.randn(no_chars, generator=g)

# All params
parameters = [C, W1, b1, W2, b2]
print("No of params: ", sum(p.nelement() for p in parameters))

# Pre-training
for p in parameters:
    p.requires_grad = True

# Stats holders
lossi = []
stepi = []

# Training on Xtr, Ytr
for i in range(200_000):
    # minibatch construct                                           
    ix = torch.randint(0, Xtr.shape[0], (batch_size,))                      
    # forward pass:
    emb = C[Xtr[ix]]                                                
    h = torch.tanh(emb.view(-1, block_size * emb_size) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Ytr[ix]) 
    # backward pass:
    for p in parameters:
        p.grad = None
    loss.backward()
    # update
    lr = 0.1 if i <= 100_000 else 0.01 
    for p in parameters:
        p.data += - lr * p.grad

    # track stats
    lossi.append(loss.item())
    stepi.append(i)

print("Loss on minibatch: ", loss.item())
torch.Size([182580, 3]) torch.Size([182580])
torch.Size([22767, 3]) torch.Size([22767])
torch.Size([22799, 3]) torch.Size([22799])
No of params:  17697
Loss on minibatch:  2.0859084129333496
Show the code
emb = C[Xtr]                                 
h = torch.tanh(emb.view(-1, block_size * emb_size) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Ytr)   
print("Loss on whole training set: ", loss.item())

emb = C[Xdev]                                
h = torch.tanh(emb.view(-1, block_size * emb_size) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Ydev)   
print("Loss on dev/validation set: ", loss.item())

emb = C[Xte]                                
h = torch.tanh(emb.view(-1, block_size * emb_size) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Yte)   
print("Loss on test set: ", loss.item())
Loss on whole training set:  2.117095947265625
Loss on dev/validation set:  2.1767637729644775
Loss on test set:  2.1748299598693848

summary of our final code, conclusion

Show the code
plt.plot(stepi, lossi)

Show the code
plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:, 1].data, s=200)
for i in range(C.shape[0]):
    plt.text(C[i,0].item(),C[i,1].item(), itos[i], ha="center", va="center", color="white")
plt.grid('minor')

We can see the loss on validation set and test set are quite similar as we are not try different scenarios to calibrate/tune hyperparamters much. So they both have the same suprise to the model training by Xtr.

We still have rooms for improvement!

sampling from the model

But our networks now can generate more name-like name!

Show the code
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      emb = C[torch.tensor([context])] # (1,block_size,d)
      h = torch.tanh(emb.view(1, -1) @ W1 + b1)
      logits = h @ W2 + b2
      probs = F.softmax(logits, dim=1)
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      context = context[1:] + [ix]
      out.append(ix)
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out))
eria.
kayanniee.
mad.
rylle.
evers.
endra.
kalie.
kaillie.
shivonna.
keisenna.
araelyzion.
kalin.
shubergenghies.
kindrendy.
pan.
puon.
ubertedir.
yarleyel.
yule.
myshelda.

google collab (new!!) notebook advertisement

Colab link: https://colab.research.google.com/drive/1YIfmkftLrz6MPTOO9Vwqrop2Q5llHIGK?usp=sharing

Thanks Andrej!

resources

  1. A Neural Probabilistic Language Model, Bengio et al. (2003)
  2. Video lecturer
  3. Notebook
  4. makemore on Github
  5. torch.Tensor() documentation