A transformer for generating text in Julia, trained on Shakespeare’s plays. This model can be used as a Generative Pre-trained Transformer (GPT) with further work. This post was inspired by Andrej Karpathy’s Zero to Hero course.
See also a previous post: Transformers from first principles in Julia.
The transformer architecture was introduced by Google AI in their famous Attention is all you need (2017) paper. They have dominated the natural language processing (NLP) landscape since then. Nearly all of the state of the NLP models today are transformer models. Most of them have an incredibly similar architecture to the original and differ only on training regimes, datasets and sizes.
In 2018 OpenAI released a paper titled Improving Language Understanding by Generative Pre-Training. This led to the development of their first Generative Pre-trained Transformer (GPT) model. As of 2024 they have released four versions of GPT, with the latest requiring over 1.8 trillion parameters. The interactive version of the model, ChatGPT, has gained widespread fame for its human like responses.
The goal of this post is to create a generative transformer following OpenAI’s methodology for their first GPT-1 paper. It will be a vanilla transformer without many of the additions that have been proposed in this fast paced field. The model will be trained on Shakespeare plays and will be able to generate text that looks and sounds like Shakespeare. This model can then be used as the pre-trained foundation for further supervised tasks.
The goal is to create a model which implements the architecture in the GPT paper:
TransformerGenerator(
Embedding(71 => 32), # 2_272 parameters
Embedding(64 => 32), # 2_048 parameters
Dropout(0.1),
TransformerBlock(
MultiHeadAttention(num_heads=4, head_size=8, 32=>32)(
denseQ = Dense(32 => 32; bias=false), # 1_024 parameters
denseK = Dense(32 => 32; bias=false), # 1_024 parameters
denseV = Dense(32 => 32; bias=false), # 1_024 parameters
denseO = Dense(32 => 32), # 1_056 parameters
),
Dropout(0.1),
LayerNorm(32), # 64 parameters
Dense(32 => 128, relu), # 4_224 parameters
Dense(128 => 32), # 4_128 parameters
Dropout(0.1),
LayerNorm(32), # 64 parameters
),
..., # 2x more TransformerBlocks
Dense(32 => 71), # 2_343 parameters
mask = Bool[1 1 … 1 1; 0 1 … 1 1; … ; 0 0 … 1 1; 0 0 … 0 1], # 1_156 parameters
) # Total: 43 trainable arrays, 44_487 parameters,
# plus 1 non-trainable, 1_156 parameters, summarysize 178.219 KiB.
It will map tokens to indices and will operate on those :
It will return a $ V \times n $ matrix, where $V$ is the vocabulary size and $n$ is the length of the input vector (8 in this example). Each column represents logits for each token. These will then be normalised to values between 0 and 1 using the softmax function. The model will be trained so that each value represents the probability of the next most likely token based on all the tokens before, up to a fixed context length $n$. As a whole the matrix represents the probabilities associated with shifting the input one value to the right.
As an example, during training the input will be “LYSANDER” and the reference “YSANDER\n”. The model will output a probability matrix and after sampling the result will be something like “YSANDR\nH”. This is then compared to the reference to improve the output.
The model computes all the probabilities for all $n$ characters in parallel through the same set of matrix operations, which makes this very efficient during training. We will effectively compare $n$ different predictions for one sample. However at inference time we are only interested in the last ($n$th) character, because we already have the first $n$ characters. Therefore we will discard the first $n-1$ predictions. (They would have already been used internally in the model.)
This is an inherent inefficiency in the transformer model architecture.
Generation will repeat inference many times, each time adding the last generated token to the context and generating a new token. The result is something like:
CLATIO. No, Goe, him buchieds is, hand I was, To queer thee that of till moxselat by twish are. BENET. Are warrain Astier, the Cowlles, bourse and nope, Merfore myen our to of them coun-mothared man, Here is Mafter my thath and herop, and in in have low’t so, veriege a the can eeset thy inscestle marriom. ADY. Thus him stome To so an streeward. Here cas, which id renuderser what thou bee of as the hightseleh-to. CHAESS. With he mand, th’ fouthos. I purcot Lay, You. GATHENT. Who, to hath fres
This was generated by a tiny 42,400 parameter model with a perplexity of 6.3, down from a random sampling perplexity of 71 for 71 characters.
In May 2022 I wrote a blog post on transformers from first principles in Julia. It developed a transformer for a classification task, namely predicting stars for Amazon Reviews. That post was lacking however in that it did not create a decoder transformer. This post is dedicated to that task. I’ve written this as a stand-alone from the original even though much of the code is the same. I refer back to the original post for some explanations. Please see the Design Considerations section which is not repeated here.
This post was inspired by Andrej Karpathy’s Zero to Hero course. I highly recommend it. It covers many ideas like backpropagation, normalisation and embeddings that are assumed knowledge in this post. In particular, this post emulates lesson 7 except the language and framework used are Julia and Flux.jl, not Python and PyTorch. The source code can be accessed at Karpathy’s famed nanoGPT repository.
My own repositories with the code in this blog post can be accessed at TransformersLite.jl and TransformersLite-examples. I will not detail any “pretty” printing function here - please see the repository for those.
This is not meant to be a full scale Julia solution. For that, please see the Transformers.jl package. It has better optimizations, APIs for HuggingFace and more.
The Complete Works of William Shakespeare by William Shakespeare has no copyright attached and can be downloaded legally from Project Gutenburg.
Here is a line to download it with cURL:
curl https://www.gutenberg.org/cache/epub/100/pg100.txt > project_gutenberg_shakespeare.txt
A typical passage from the text looks like:
LYSANDER. How now, my love? Why is your cheek so pale? How chance the roses there do fade so fast? HERMIA. Belike for want of rain, which I could well Beteem them from the tempest of my eyes. LYSANDER. Ay me! For aught that I could ever read, Could ever hear by tale or history, The course of true love never did run smooth. But either it was different in blood—
This is what we want the transformer to learn and the vast majority of the text follows this format. However some pieces do not. These include the Project Gutenberg introduction and conclusion, the table of contents, the sonnets, the preambles - these list the acts and scenes in each play - and so on. Those should all be removed.
Optionally, the small amount of non-ASCII characters (œ, Æ,æ, …) should be removed. I also removed the “&” symbol and changed the archaic usage of “&c.” to “etc.”.
I’ve made a script which does all this work, prepare_shakespeare.jl. It reduces the file size from 5.4 MB to 4.8 MB.
We can load the text in Julia with:
Some basic statistics:
The prepared dataset contains 182,027 lines spanning over approximately 38,409 passages, 921,816 words and 4,963,197 characters.
Most passages are very short - less than 100 characters. The longest is Richard’s monologue in “The Third Part of King Henry the Sixth” which consists of 3047 characters.
Lines have an average of 26.27 characters with the longest being 77 characters in length.
After the data preparation there are 71 unique characters in the text: \n !(),-.:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz—‘’“”
There are approximately 30,040 unique words in the dataset. Of these, approximately 80% appear less than 10 times and 96.5% less than 100 times. The most frequent word is “the” with 23,467 occurrences.
To start, make a package in the Julia REPL:
The purpose of making a package is that we can now use the super helpful Revise package, which will dynamically update most changes during development without errors:
The following packages need to be loaded/added for this tutorial:
The model will predict probabilities for each token in a given vocabulary. There is a choice as to what constitutes a token. One extreme is one token for each word in the dataset. Here there are far too many unique words so it will explode the parameter count while providing too few training samples per token. The other extreme is character level tokens. This compresses the learning space too much to get fully realistic outputs, but otherwise it works surprisingly well. In between is sub-word tokenization such as Byte Pair Pair encoding. This allows configurable vocabulary lengths. See my TokenizersLite.jl package, the BytePairEncoding.jl package or Karpathy’s latest video.
Here we will follow Karpathy’s approach and use character level tokens. The model will learn to predict each word character by character.
First get all the characters:
Karpathy uses two dictionaries to convert between characters and indices: char_to_int
and int_to_char
.
I’m going to wrap these in a slightly more complex IndexTokenizer
struct introduced in my first post.
It holds a vector of the vocabulary (equivalent to int_to_char
) and a lookup
for reversing this (equivalent to char_to_int
).
Additionally, it has an unknown symbol if any of the characters are not in the vocabulary.
The constructor is as follows:
For encoding we lookup the character in the dictionary, returning the index of the unknown symbol by default:
We can add a method to do multiple dispatch on the type IndexTokenizer
itself
which turns the struct into a function:
Encoding example:
Decoding goes the other way:
An example:
Each token is transformed into a vector of floating point numbers. This vector represents some sort of meaning in a large, abstract vector space, where vectors that are closer to each other are more similar. (There is plenty of literature on this subject.)
Flux.jl comes with an embedding layer which can be used directly:
Here is the source code:
This struct stores a weight, by default the smaller datatype of Float32
rather than the usual Julia default of Float64
.
This saves on space without reducing accuracy.
(Float16
, Float8
and as low as Float4
are all used in machine learning models.)
On the forward pass each index is used to retrieve the associated column vector from the matrix.
However instead of using m.weight[:, x]
the function uses NNlib.gather(m.weight, x)
.
This is because gather
comes with an rrule
defined for it (source):
The rrule
is a reverse (backwards) rule that encodes the derivative for backpropagation.
It is what makes the magic of automatic differentiation work.
The function gather
does not have a formal derivative, but scatter
is the opposite of it and is what we need to apply when we calculate the loss:
At the end of backpropagation we need to distribute the error matrix amongst the original word embeddings.
This is what scatter
does. Note that we use the red column twice, so we have two error columns directed towards it.
The rrule
applies +
as the reducing function; that is, the two errors are added together and then to the word embedding.
Scatter can be inefficient. If we do a small experiment and call scatter we will see it results in a large matrix of mostly zeros:
The matrix operations used in the transformer are parallel operations. This speeds up computation and is a major reason why they are so popular. However this is an issue: they do not take order into account. We can shuffle the columns in the embedding matrix and it will not affect the output.
To counter-act this, the authors of the Attention is all you need (2017) paper suggested adding a second embedding to the first where the indices are the positions in the sequence.1
We can use an Embedding
matrix as before, except with a different input:
Transformers are an active area of research and many position encodings have been proposed.
key
and query
in the attention step needs to be multiplied by this rotation matrix instead of adding an encoding once at the start.
This embedding matrix restricts the context size. In the example the embedding matrix is 32×16 so a maximum of 16 tokens that can be passed to the model at time. To overcome this a sliding window must be implemented and the model will completely “forget” any character outside of the window.
Ideally we would create an embedding matrix as large as possible so that the bottleneck is the training data, not the model. However attention, which will be discussed in the next section, scales with $n^2$ for a context length $n$. This is a significant performance penalty for a larger context size.
Attention is the main mechanism at the heart of the transformer. Theoretically it is a weighting of every token towards every other token, including itself. It is asymmetrical and so forms a full $n \times n$ matrix. For example consider word level tokens for the sentence “The elephant in the room”. The tokens “The”, “in”, “the” and “room” might all rate “elephant” the highest, but “elephant” will probably only rate “room” highly.
The attention equation is: \(A = V\text{softmax}\left(\frac{1}{\sqrt{d_h}}K^T Q\right) \label{eq:attention} \tag{3.6.1}\)
where $\text{softmax}$ is given by:
\[\text{softmax}(z, i) = \frac{e^{z_i}}{\sum_r^V e^{z_r}} \label{eq:softmax} \tag{3.6.2}\]Its calculation scales with $\mathcal{O}(n^2d_h)$ where $n$ is the input token length and $d_h$ is the head dimension, also known as the hidden dimension.
Given the $n^2$ scaling of attention much effort has gone into altering this step. This includes sparse attention layers, factorisation/kernels for linear attention and down-sampling. A detailed survey can be found at Efficient Transformers: A Survey (2020). All these alternatives are faster than attention but come at the expense of accuracy.
Another line of research is to improve the computation. This include Flash attention (2022) which improves computational efficiency on a single GPU while Ring attention (2023) aims to distribute the work efficiently across multiple devices.
Here the key $K$, query $Q$ and value $V$ are derived from the input matrix $X$ using weights:
\[\begin{align} K = W_K X \\ Q = W_Q X \\ V = W_V X \end{align}\]Each weight $W$ has a size $d_h \times d_\text{emb}$ and the input matrix has a size $d_\text{emb} \times n$ where $d_\text{emb}$ is the embedding dimension. Each of these matrices therefore has a size $d_h \times n$.
From this we can show that the first matrix product is a weighted dot product of every vector to every other vector in the input matrix, resulting in a $n \times n$ matrix:
\[K^T Q = (W_KX)^T(W_QX) = X^T W_K^T W_Q X\]This is then following by scaling ($1/\sqrt{d_h}$) and normalisation ($\text{softmax}$). Lastly this matrix is used as a weight for $V$. The output is $d_h \times n$.
There is a flaw in this architecture. The attention is computed across all tokens at once. This means that past tokens will be given access to future tokens. However the training objective is to predict future tokens. Therefore only the $n$th token, whose next token is missing, will be trained fairly.
To overcome this the authors of the Attention (2017) paper suggested masking the matrix before the softmax with $-\infty$ at each illegal connection, so that $\exp(-\infty)=0$ which effectively removes their influence.
The masked matrix will look like:
\[\begin{bmatrix} s_{11} & s_{12} & s_{13} &... & s_{1n} \\ -\infty & s_{22} & s_{23} &... & s_{2n} \\ -\infty & -\infty & s_{33} &... & s_{3n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -\infty & -\infty & -\infty & ... & s_{nn} \end{bmatrix}\]Firstly a mask is made where all valid connections have a true
and all illegal connections have a false
.
Here is the code from NNlib.jl:
We don't have to allocate memory to create a mask. The causal mask is defined by $j \geq i$ for all indices $i$, $j$. We can write this as a function as long as we can also write an equivalent rrule
for it as well. See NeuralAttentionlib.jl for such an implementation.
The mask will be applied through ifelse
, where true
s maintain their value but the false
s are replaced with some large negative number.
Usage:
Backpropagation:
As an experiment, set the mask to nothing
during training.
It should be possible to get very low training losses (below 0.5) corresponding to very low perplexities (less than 2) with very small models but without a corresponding increase in generation quality.
The Attention (2017) paper suggested a further enhancement on attention where the input matrix is divided amongst $H$ heads. This results in a $\tfrac{d_\text{emb}}{H} \times n \times H$ array. Furthermore, working with batches adds an extra dimension: $d_h \times n \times H \times B$.
We could work with these arrays as Vector{<:Matrix{T}}
and Vector{<:Vector{<:Matrix{T}}}
respectively, but it is more efficient to work with them as Array{T, 3}
and Array{T, 4}
because then we can work with optimised array functions.
My first post goes into more detail about multiplication with higher order arrays.2 It compares vanilla versions with optimised versions. Here I will present the optimised version only.
Batch multiplication is defined as:
\[C_{ijk} = \sum_r A_{irk} B_{rjk}\]An optimised version is available through the NNlib.jl library, a dependency of Flux.jl:
The 4D batched multiplication is defined as:
\[C_{ijkl} = \sum_r A_{irkl} B_{rjkl}\]We can calculate this array with the same batched_mul
by reshaping any 4D $m\times n \times p \times q$ arrays into 3D $m\times n \times pq$ arrays, do the multiplication, and reshape back.
This is exactly what the implementation does behind the scenes:
The Flux Dense
layer does something similar.
Flux.jl now comes with a Flux.MultiHeadAttention
layer.
However for continuity with my first post, I will present my own MultiHeadAttention
layer except now with masking.
It is very similar to the code in Flux.jl and NNlib.jl.
The differences are in design choices for the inputs and Flux.jl’s implementations are slightly more generic.
First define a struct to hold all the dense layers and a parameter for $H$ called nhead
:
The model is defined by 4 values: the number of heads $H$, the input dimension $d_\text{in}$, the output dimension $d_\text{out}$ and the head dimension $d_h$. The default for $d_h$ is $d_\text{in}/H$.
Now for the forward pass.
In general there are three input matrices with the names of key
, query
and value
.
Later we will pass the same value x
for all of them.
From these we can calculate $Q$, $K$ and $V$ and pass them to the multi_head_scaled_dot_attention
function:
This layer returns the scores as well, like Flux.jl’s MultiHeadAttention
layer.
These are useful for inspecting the model.
The multi_head_scaled_dot_attention
begins as follows:
The $Q$, $K$ and $V$ matrices need to be split from $d_m \times N \times B$ to $d_h \times N \times H \times B$. This is done in two steps:
Then we calculate the scaled dot attention for each head, combine results and return it:
The scaled dot attention is defined by default for 3D arrays. $Q$ is of size $d_h \times d_q \times H$ while $K$ and $V$ are both of size $d_h \times d_{kv} \times H$. Usually $n=d_q=d_{kv}$.
As explained above, we need to reshape 4D arrays into 3D arrays, apply the usual scaled dot attention and then reshape back:
Model:
Forward pass:
Backpropagation:
The other components we need for the transformer block are Layer Norm, Feed Forward (two consecutive dense layers) and dropout. We can use the Flux.jl implementations for these.
This means we can now create a transformer block:
This whole block can be defined with only 5 parameters:
In code:
There are skip connections in the forward pass:3
Model:
Forward pass:
Backpropagation:
We will create a struct to hold the generator.
By default the forward pass will use the model’s mask, else the user can pass a mask to it:
Create a model:
We can test it with a random vector of indices:
Or a random batch:
Let’s now generate text with the model.
The model has a fixed context length. To generate text longer than this fixed length we will implement a sliding window. This window will take the last $n$ tokens (rows) of the current context for each column (sample) in the batch:
The transformer generates a $V\times N \times B$ matrix. We will only take the logits for the last token per iteration, resulting in a $V\times B$ matrix. These logits will be converted to probabilities via the softmax function $\ref{eq:softmax}$.
We have a choice of how to sample these probabilities. The greedy approach is to always take the token with the maximum probability. A better approach is to randomly sample based on the probabilities. That way a token with a high probability is more likely to be chosen, but it is not guaranteed. This gives us some diversity in the results. We then add this to the context and repeat.
The full function is:
Testing it out:
Decode the output using the tokenizer from section 3.2:
The output:
A[RH N)pEy.QEgs?YbgnRsz-ZRDdUXvU Pzwzzxukvv_P;goxe(G;C;I RIgB ‘E[xIqZ-J;gK—wwEUTZYtUg:tEhl-kZ;s:x.ggt
This is nonsense. The model does no better than drawing each character randomly. We need to train the model to get something sensible out of it.
The model will be trained on segments of the text which match the context length $n$. For a text of length $L$ there are $L-n+1$ characters we can select to be the first character of the segments, excluding the last $n-1$ characters. For the Shakespeare text, this results in approximately 4.9 million different segments.
There is however plenty of overlap so we don’t have to train on all of them. We can instead randomly sample segments from the text. Characters at any point in the text will have a probability of appearing of $p\approx n/L$ (the ends are less likely). For many steps $s$ this binomial distribution can be approximated with a normal distribution with a mean $sp\approx sn/L$ and standard deviation $\sqrt{sp(1-p)}\approx \sqrt{sn/L}$. For example, for 4.9 million characters, a context length of 64 and 100,000 steps, each character at each point will appear 1.31±1.14 times.
The other important task is to create the reference text that the model will be trained to generate, which is simply the input text shifted by one. (This reduces the number of valid segments by 1.)
The function is as follows:
Usage:
The outputs look like:
X Y
1 70 66 | 9 60 3
9 60 3 | 26 4 32
26 4 32 | 1 17 35
1 17 35 | 68 54 70
Lastly, it can be convenient to wrap this functionality in a struct similar to Flux.jl’s DataLoader
.
For an example of this, please see the BatchGenerator
object in my generate_batches.jl file.
What is our goal?
We want the probability of the true next character to be the highest.
The model returns a $V \times n \times B$ array. We have an $n \times B$ reference array of the true next characters ($Y$). The first step is to convert it to probabilities - a range of values from 0 to 1 summing to 1 - with the softmax equation $\ref{eq:softmax}$. We can then pick out the next true characters by converting the reference array to a one hot matrix and multiplying:
All the non-zero values are the probabilities of interest.
Since these values are small numbers the convention is to instead use the cross entropy, so $-Y\log(P)$ rather than $YP$. This maps the values from the range $(0, 1)$ to the range $(0, \infty)$. We then reduce it to a single value by taking the mean. This is known as the cross entropy loss:
\[\begin{align} l(y, p) &= -\frac{1}{N}\sum^{N}_i y_i \log(p_i) \\ &= -\frac{1}{N}\sum^{N}_i y_i \log\left(\frac{e^{z_i}}{\sum e^z}\right) \\ &= -\frac{1}{N}\sum^{N}_i y_i \left(z_i - \log\left(\sum e^z\right)\right) \tag{4.2.1} \label{eq:cross_entropy} \end{align}\]where $N=nB$.
As a baseline, imagine a model which predicts characters uniformly randomly. All probabilities will be $1/V$ and hence the loss will reduce to $-\log(1/V)$. For $V=71$ the expected loss is therefore 4.26. A trained model should achieve a value closer to 0.
Flux.jl comes with Flux.logitcrossentropy
that will implement equation $\ref{eq:cross_entropy}$:
In a single function:
I’ve called it the full loss to indicate that it is over all $nB$ token predictions and not only the last ($B$) tokens.
Another common measure of the ability of the model is perplexity, which is the inverse of the average probability for each character. It is defined as:
\[e^{l(y, p)} = \prod_i^N p_i^{-y_i/N} = 1 \div \left(\prod_i^N p_i \right)^{1/N} \tag{4.3} \label{eq:perplexity}\]where $l(y, p)$ is the cross entropy loss.
The perplexity for random sampling with $p_i=1/V$ is simply $V$. In other words, the perplexity for randomly sampling 71 characters is a 1 in 71 chance for each character. A trained model should achieve a value closer to 1 in 1, because the context and known distributions allow the model to select characters with greater than random chance.
Like other types of averages, perplexity does not describe the shape of the distribution and outliers can have an outsized effect on it.
We can use many samples, say 1000 steps of 32 sized batches each to estimate it:
It is always good practice to split the data into train, validation and test splits. For simplicity, we’ll only use a train and validation split. We’ll put the first 95% of data in the train split and the remainder in the validation split.4
We can now setup a training loop:
This works well enough, but will require many more steps to train. I recommend at least 10 epochs, where one epoch is defined as $0.95L/(nB)$ steps. (Based on the logic in Batch Generation each character at each position in the text should appear approximately once per epoch.) For $L=4.9\times10^6$, $n=64$ and $B=32$, this is 2,300 steps per epoch.
Please see my training.jl file for a train!
function which also does the following:
Dictionary
which saves these values for each epoch for each metric.For the most part the model we have created is black box. There are however various techniques to inspect the model. For example, cosine similarities which was showcased in the Position Encoding section.
Another popular technique is to visually examine the embeddings after dimension reduction. For example our model has a dimension of 32, and we can reduce this to 2 dimensions and then create a 2D scatter plot. The popular techniques to do this are PCA (Principal Component Analysis) and t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE starts with PCA and iterates to give better looking results.
Here is an implementation of t-SNE with Julia:
where the vocabulary
is:
The output:
Note that t-SNE is stochastic and each run will give different results.
For the embedding matrix we can see that the model groups all the vowels (a, e, i, o, u) and their capital forms together. It also tends to group the lowercase form and uppercase form together e.g. ‘g’ and ‘G’. The head meanwhile has 3 distinct groups: capital letters, punctuation and lower case letters. It also groups the vowels together.
Perhaps with further training more meaning would be encoded into these vectors.
We can pass an input to the model and visually inspect the attention scores. To do this we need to alter the attention functions to return the score as well (including reshaping it as needed). At the top level - the forward pass of the model - these scores should be saved in a vector. Then we can plot them:
using Plots
text = """LYSANDER.
How now, my love? Why is your cheek so pale?
How chance the roses there do fade so fast?"""
tokens = reshape(indexer(collect(text)), :, 1);
X = tokens[1:context_size, :];
X_text = decode(indexer, X[:, 1]);
Y, scores = predict_with_scores(model, X, mask=model.mask); # modified forward pass
s = scores[3][:, :, 3, 1]
s = ifelse.(model.mask, s, NaN)
heatmap(s,
xticks=(1:context_size, X_text),
yticks=(1:context_size, X_text),
yrotation=90,
aspectratio=:equal,
xlims=(0.5, n+0.5),
size=(500, 500),
)
The attention matrices are very sparse. Most tokens only place emphasis on the four or less tokens directly before them. This suggests we could have used a much smaller context length, for example 16 and indeed that does work.
Ideally the model should be learning long range relationships and it is worrying that it is not.
That said, the model does confidently predict that the next letter is an “e” at the end of “chance”:
using Plots
probs_next = softmax(Y[:, end, 1])
v = length(indexer.vocabulary)
bar(probs_next,
xticks=(1:v, indexer.vocabulary),
xlims=(1, v),
label="",
ylabel="probabilities",
xlabel="tokens"
)
Perhaps with more training the model would give better results.
Thank you for following this tutorial. I hope you now have a working transformer and have much better insight into how they work.
The cosine similarity is calculated as $W^TW/ m^T m $ where $m_{1j}=\sqrt{\sum_i W_{ij}^2}$ for each column $j$ in $W$. In code:
using LinearAlgebra
function cosine_similarity(W::AbstractMatrix)
sim = transpose(W) * W
magnitudes = sqrt.(diag(sim))
for i in 1:size(sim, 1)
for j in 1:size(sim, 2)
sim[i, j] /= magnitudes[i] * magnitudes[j]
end
end
sim
end
In general multiplication is not defined for higher order arrays. But there is a set of multidimensional algebraic objects called tensors where it is. Confusingly, Google named their machine learning framework TensorFlow and calls higher order arrays tensors. So one should differentiate between machine learning tensors and geometric tensors. They are not the same. To give a simple explanation: one can think of geometric tensors as higher order arrays with severe constraints on their entries and operations because they represent geometric objects. These constraints make it harder - not easier - to code higher order arrays as geometric tensors. ↩
The design decision is to purposely drop the attention scores in the TransformerBlock
’s forward pass. This is to simplify the code and to not place a bias on the attention.
In a typical block the MultiHeadAttention
layer will make up 1/3rd of parameters while the dense layers will make up 2/3rds, so the dense layers are potentially more important.
To return the scores it is enough to edit the forward pass for the block and model, or to create two new functions entirely. ↩
A smarter strategy is to randomly sample passages throughout the text until the desired proportions are reached. ↩