Long time no LLMs, past months struggled with job-search and personal stuffs distracted me from learning AI. No I am continuing my favorite AI series Neural Networks, from Zero to Hero by Andrej Karpathy.
I will be learning to build a GPT from scratch. This is my note and hope I’ll survive :)
1 intro: ChatGPT, Transformers, nanoGPT
, Shakespeare
- ChatGPT (GPT stands for Generative Pre-trained Transformer) is a probabilistic system that for anyone’s prompt it can give us multiple answers. It models sequence of words (or token) and tries to predict the next word to complete the prompt we give;
- Transformers, proposed by Vaswani et al. in the landmark paper “Attention is All You Need” back in 2017, is an architecture that did all those heavy lifting under the hood of ChatGPT;
nanoGPT
is “the simplest, fastest repository for training/finetuning medium-sized GPTs” written as a side project by Andrej. We gonna follow this code structure in this lecture, but will not rebuild the whole 124M params GPT-2;- Shakespearse tiny dataset will be the text data that we’ll be working on.
2 baseline language modeling, code setup
2.1 reading and exploring the data
The tinyshakepeare dataset contains 1,115,394 characters, here are the first 1000 ones look like:
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
First Citizen:
First, you know Caius Marcius is chief enemy to the people.
All:
We know't, we know't.
First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?
All:
No more talking on't; let it be done: away, away!
Second Citizen:
One word, good citizens.
First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.
In contains 65 different characters, the first one was space:
This list of characters can be call the dictionary, same with previous lectures, we create mapping from characters to integers/indicesindices stoi
and vice versa itos
.
2.2 tokenization, train/val split
The two functions encode
and decode
can let us transition from string (for human reading) to number (for machine reading) and vice versa.
This is a very simple tokenization and detokenization, there are more complex/effective ones out there in the industry, for eg., SentencePiece by Google, and tiktoken by OpenAI, which implement more sophisticated paradigm like BPE.
Then use this set of utility we can convert our dataset to PyTorch tensor:
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100]) # the first 100 characters we looked at earier will to the GPT look like this
torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44,
53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63,
1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1,
57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49,
6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47,
58, 47, 64, 43, 52, 10, 0, 37, 53, 59])
Finally for the very first pre-processing we want to use the last 10% of out dataset as the validation/development split. We dont want to build a model that mimick the Shakepeare’s tone, we want it to be generative/creative.
2.3 data loader, batches of chunks of data
Now the idea of our character-level language model is to predict the next character(s) given a sequence of characters, with a certain maximum length, not the whole previous part of the dataset. Same with previos lecture, we can set it (block_size
, or context_length
) to 8.
And the second constraint that we want to setup is the batch size of chunk that we feed to the transformer, it will reduce the cost of calculation while retain the efficiency of training. You can imagine the context
and target
like this:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x, y
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)
print('----')
for b in range(batch_size): # batch dimension
for t in range(block_size): # time dimension
context = xb[b, :t+1]
target = yb[b,t]
print(f"when input is {context.tolist()} the target: {target}")
inputs:
torch.Size([4, 8])
tensor([[24, 43, 58, 5, 57, 1, 46, 43],
[44, 53, 56, 1, 58, 46, 39, 58],
[52, 58, 1, 58, 46, 39, 58, 1],
[25, 17, 27, 10, 0, 21, 1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58, 5, 57, 1, 46, 43, 39],
[53, 56, 1, 58, 46, 39, 58, 1],
[58, 1, 58, 46, 39, 58, 1, 46],
[17, 27, 10, 0, 21, 1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53, 56, 1, 58, 46] the target: 39
when input is [44, 53, 56, 1, 58, 46, 39] the target: 58
when input is [44, 53, 56, 1, 58, 46, 39, 58] the target: 1
when input is [52] the target: 58
when input is [52, 58] the target: 1
when input is [52, 58, 1] the target: 58
when input is [52, 58, 1, 58] the target: 46
when input is [52, 58, 1, 58, 46] the target: 39
when input is [52, 58, 1, 58, 46, 39] the target: 58
when input is [52, 58, 1, 58, 46, 39, 58] the target: 1
when input is [52, 58, 1, 58, 46, 39, 58, 1] the target: 46
when input is [25] the target: 17
when input is [25, 17] the target: 27
when input is [25, 17, 27] the target: 10
when input is [25, 17, 27, 10] the target: 0
when input is [25, 17, 27, 10, 0] the target: 21
when input is [25, 17, 27, 10, 0, 21] the target: 1
when input is [25, 17, 27, 10, 0, 21, 1] the target: 54
when input is [25, 17, 27, 10, 0, 21, 1, 54] the target: 39
2.4 simplest baseline: bigram language model, loss, generation
We’ll go into the most simplest thing in the realm of NLP - the bigram
language model ~ where the next character is predicted based on 1 single previous character.
Below we are constructing a child class of nn.Module
which is a bigram
language model under a vocab_size
= 65 language space. The state of the model is presented by token_embedding_table
~ how it thinks about the next character given one.
We constructed a loss
function based on cross_entropy
to quantify the quality of the model, and also implemented a generate
function to sample from the model.
#| code-fold: true
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
# 26x26 2D Tensor, represents possibility of each char right after a char.
def forward(self, idx, targets=None):
# idx and targets are both (B,T) tensor of integers
logits = self.token_embedding_table(idx) # (B,T,C)
# (Batch = 4,Time = 8,Channel = 65)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C) # stretch this out of Time dimension
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
# https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html
# C = number of classes
# B or N = Batch size
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# get the predictions
logits, loss = self(idx)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)
SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ
The sampling is garbage, since our model are in a completely random state right now - the loss is 4.88
which is near to probability of each character in the vocab_size
space (1/65
). We are going to train the model!
2.5 training the bigram model
2.6 port our code to a script
3 Building the “self-attention”
3.1 version 1: averaging past context with for loops, the weakest form of aggregation
3.2 the trick in self-attention: matrix multiply as weighted aggregation
3.3 version 2: using matrix multiply
3.4 version 3: adding softmax
3.5 minor code cleanup
3.6 positional encoding
3.7 THE CRUX OF THE VIDEO: version 4: self-attention
3.8 note 1: attention as communication
3.9 note 2: attention has no notion of space, operates over sets
3.10 note 3: there is no communication across batch dimension
3.11 note 4: encoder blocks vs. decoder blocks
3.12 note 5: attention vs. self-attention vs. cross-attention
3.13 note 6: “scaled” self-attention. why divide by sqrt(head_size)
4 Building the Transformer
4.1 inserting a single self-attention block to our network
4.2 multi-headed self-attention
4.3 feedforward layers of transformer block
4.4 residual connections
4.5 layernorm (and its relationship to our previous batchnorm)
4.6 scaling up the model! creating a few variables. adding dropout
5 Notes on Transformer
5.1 encoder vs. decoder vs. both (?) Transformers
5.2 super quick walkthrough of nanoGPT, batched multi-headed self-attention
5.3 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
5.4 conclusions
6 resources
- Colab notebook: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing;
nanoGPT
: https://github.com/karpathy/nanoGPT;- TinyShakepeare dataset: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt;