Lior Sinai

DeepSeek’s Multi-Head Latent Attention

2025-02-22T00:00:00+00:00

A deep dive into DeepSeek’s Multi-Head Latent Attention, including the mathematics and implementation details. The layer is recreated in Julia using Flux.jl.

See also previous posts on transformers:

All code available at github.com/LiorSinai/TransformersLite.jl/tree/feature/mla.

1 Introduction

In January 2025, DeepSeek unveiled their new DeepSeek-V3 and DeepSeek R1 models. It took the world by storm. Users were impressed with its abilities on top of their claims that is was up to 50× more efficient to train and run than their competitors.

They also released multiple papers (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1) with an impressive array of new techniques across the whole machine learning pipeline, from high level theory to intricate implementation details. Most of it built on existing ideas in innovative ways. They include:

Theory
- Multi-Head Latent Attention (MLA): compress vectors during attention, which reduces the cache size during inference.
- DeepSeekMoE: segmented and isolated mixture of experts.
- Multi-token prediction.
- Reinforcement learning with Group Relative Policy Optimization but without supervised data.
- Improved chain-of-thought reasoning.
Implementation
- DualPipe: accelerate training by overlapping forward and backward computation communication phases.
- Exponential moving average on the CPU.
- Mixed precision floating point numbers during training. FP8 and FP32 are used.
- Low-precision storage and communication.

The aim of this post is to explore the first of these ideas in depth, namely Multi-Head Latent Attention (MLA). It is actually a combination of 3 ideas:

Attention and multi-head attention from Attention is all you need (2017).
KV Caching.
Low-Rank Adaption matrices (LoRA) from LoRA: Low-Rank Adaptation of Large Language Models (2021).

The idea behind MLA is simple: it compresses the input matrix into a single matrix for caching during inference, as opposed to standard KV caching which caches two intermediate matrices. This is a novel innovation by DeepSeek. However, MLA has side effects which DeepSeek barely address. The compression also results in a performance boost but most likely comes with a qualitative penalty. That these are not discussed is a notable omission in their paper.

DeepSeek also adds two unwieldly enhancements to MLA. While they only speak good of these, they complicate the mathematics and require specialised optimised code to see performance gains:

Weight absorption.
Decoupled rotary position embeddings (RoPE), a modification of RoFormer: Enhanced Transformer with Rotary Position Embedding (2021).

The original source code is written in Python with the PyTorch machine learning framework. However, my favourite language is Julia and continuing with my previous posts on transformers, all the code here is written in Julia using the Flux.jl machine learning framework.

Julia uses column major format whereas Python uses row major format. In Julia sample vectors are columns while in Python they are rows. Equations between the two formats will look backwards to each other. They need to be transposed and definitions also need to be transposed. E.g. $K^TQ \rightarrow (K_c^TQ_c)^T=Q_c^TK_c= Q_r K_r^T$

The mathematics therefore follows Julia’s columnar major format and not Python’s row major format. In this format each column represents a single sample. The second dimension is sequence length and all remaining dimensions represent batches.

2 KV Caching

2.1 Theory

There is an inefficiency in using transformers for generation. In classification use cases, the model only needs to calculate attention over the input sentence once (per layer) before making its prediction. But in text generation, it needs to recalculate it over the entire sentence. One analogy I read is that this is like having to reread the entire sentence so far in order to produce the next word and then rereading again to produce the next word, and so on.

In mathematical terms, this is computing the attention between the first token and itself, between the first two tokens, between the first 3 tokens and so on until it’s between all the tokens, and repeating this whole process each time for every new token. This was the case for the generator I made based on Andrej Karpathy’s Zero to Hero course.

It would be better to have a “memory” of what has been generated so far. This is where the KV cache comes in, for Key-Value cache. We can store the previous keys and values and use them to calculate the new attention value. This will give the exact same value as recalculating the entire attention.

Here is a mathematical proof. For a visual interpretation, please see João Lages’ excellent article.

The attention equation is

\[A = V\text{softmax}\left(\frac{1}{\sqrt{d_h}}K^T Q\right) \label{eq:attention} \tag{2.1.1}\]

where

\[\begin{align} K &= W^K X \\ Q &= W^Q X \\ V &= W^V X \end{align} \label{eq:KQV} \tag{2.1.2}\]

where $X \in \mathbb{R}^{d\times n \times B}$ and $W^K, W^Q, W^V \in \mathbb{R}^{d_hH \times d}$.

Focusing on one head in one batch with dimension $d_h$, the input matrix $X$ is split into the first $n-1$ columns and the $n$th column. The first multiplication can then be written as (dimensional analysis on right):¹

\[\begin{align} S &= K^T Q & & \\ &= \begin{bmatrix} K_{1:n-1} & K_n \end{bmatrix}^T \begin{bmatrix} Q_{1:n-1} & Q_n \end{bmatrix} &; &\begin{bmatrix}d_h \times (n-1) & d_h \times 1 \end{bmatrix} ^T \begin{bmatrix} d_h \times (n-1) & d_h \times 1\end{bmatrix}\\ &= \begin{bmatrix} K_{1:n-1}^T \\ K_n^T \end{bmatrix} \begin{bmatrix} Q_{1:n-1} & Q_n \end{bmatrix} &; &\begin{bmatrix}(n-1) \times d_h \\ 1 \times d_h \end{bmatrix} \begin{bmatrix} d_h \times (n-1) & d_h \times 1\end{bmatrix}\\ &= \begin{bmatrix} K_{1:n-1}^T Q_{1:n-1} & K_{1:n-1}^T Q_n \\ K_n^T Q_{1:n-1} & K_n^T Q_n \end{bmatrix} &; &\begin{bmatrix}(n-1)\times(n-1) & (n-1)\times 1 \\ 1 \times (n-1) & 1\times 1 \end{bmatrix} \end{align} \label{eq:Kcache} \tag{2.1.3}\]

Looking at the final line, the first $(n-1)$ columns of the query can be safely dropped without affecting the $n$th column. (We would do this anyway in generation.) The $n$th column only depends on $K_{1:n-1}$, $K_n$ and $Q_n$. Of these, $K_{1:n-1}$ will come from the cache and the other two will be calculated from $X_n$.

It is important to note that dropping the first $(n-1)$ columns is also valid because almost all other layers in a transformer are independent of position. For example for the dense layers $Y=WX$, permuting the columns of $X$ will result in a corresponding permutation of the columns of $Y$. E.g. if $X$ has two columns and $[Y_1 Y_2] = W[X_1 X_2]$ then $[Y_2 Y_1] = W[X_2 X_1]$ The exception is the position embedding layers which will require a new parameter to be passed through the whole transformer to indicate the position.

Similarly for the next multiplication we have ($Z=\text{softmax}(S)$):

\[\begin{align} A &= V Z \\ &= \begin{bmatrix} V_{1:n-1} & V_n \end{bmatrix} \begin{bmatrix} Z_{1:n-1,1:n-1} & Z_{1:n-1,n} \\ Z_{n,1:n-1} & Z_{n,n} \end{bmatrix} \\ &= \begin{bmatrix} V_{1:n-1} Z_{1:n-1,1:n-1} + V_{n} Z_{n,1:n-1} & V_{1:n-1} Z_{1:n-1,n} + V_n Z_{n, n} \end{bmatrix} \end{align} \label{eq:Vcache} \tag{2.1.4}\]

However, we said we are dropping the first $(n-1)$ columns. Without them only the $n$th column is calculated. Hence we have:

\[\begin{align} A_n &= \begin{bmatrix} V_{1:n-1} & V_n \end{bmatrix} \begin{bmatrix} Z_{1:n-1,n} \\ Z_{n,n} \end{bmatrix} \\ &= V_{1:n-1} Z_{1:n-1,n} + V_n Z_{n, n} \end{align} \tag{2.1.5}\]

which depends on $V_{1:n-1}$, which will come from the cache, and $V_n$, which will be calculated from $X_n$.

There will be two caches each with size $d_h H \times N \times B$ for $H$ heads, a maximum sequence length of $N$ and a maximum batch size of $B$. The total cache size can grow very large for large transformers with many multi-head attention layers. The primary aim of MLA is to reduce the size of this cache. That will be covered in the next section.

2.2 Code

Building on my code in TransformersLite.jl, it is straightforward to create a new MultiHeadAttentionKVCache layer with two caches:

struct MultiHeadAttentionKVCache{
    Q<:Dense, K<:Dense, V<:Dense, O<:Dense, C<:Array{T, 3}  where T
    }
    nhead::Int
    denseQ::Q
    denseK::K
    denseV::V
    denseO::O
    cache_k::C
    cache_v::C
end

Flux.@layer trainable=(denseQ, denseK, denseV, denseO)

The forward pass calculates the current q, k and v values from the input and gets the rest from the cache. It then continues without any additional modifications from the original code:

function (mha::MultiHeadAttentionKVCache)(
    query::A3, key::A3, value::A3
    ; start_pos::Int=1, use_cache::Bool=true, kwargs...
    ) where {T, A3 <: AbstractArray{T, 3}}
    q = mha.denseQ(query) # size(q) == (dh, 1, B)
    k = mha.denseK(key)
    v = mha.denseV(value) # size(k) == size(v) == (dh, 1, B)
    if use_cache
        dim, seq_length, batch_dim = size(query)
        end_pos = start_pos + seq_length - 1
        mha.cache_k[:, start_pos:end_pos, 1:batch_dim] = k
        mha.cache_v[:, start_pos:end_pos, 1:batch_dim] = v
        K = mha.cache_k[:, 1:end_pos, 1:batch_dim]
        V = mha.cache_v[:, 1:end_pos, 1:batch_dim]
    else
        K = k
        V = v
    end
    A, scores = multi_head_scaled_dot_attention(mha.nhead, q, K, V; kwargs...)
    mha.denseO(A), scores
end

Here is a small example of it in action. (For the full code, see test/MultiHeadAttention.jl.)

Create the layer and inputs:

using TransformersLite
using TransformersLite: MultiHeadAttention, MultiHeadAttentionKVCache
using TransformersLite: make_causal_mask, clone_add_kv_cache
nhead, dim_model, dim_out = 4, 32, 13
mha = MultiHeadAttention(nhead, dim_model, dim_out) 
mha = clone_add_kv_cache(mha, 64, 8)
X = randn(Float32, 32, 10, 5)

Fill the cache:

mask = make_causal_mask(ones(10, 10))
A, scores = mha(X, X, X; mask=mask, start_pos=1, use_cache=true)
size(A) # (13, 10, 5)
size(scores) # (10, 10, 4, 5)

Use the cache with a new vector:

x = randn(Float32, 32, 1, 5)
mask = repeat([true], inner=(11, 1))
Ax, scoresx = mha(x, x, x; mask=mask, start_pos=11, use_cache=true)
size(Ax) # (13, 1, 5)
size(scoresx) # (11, 1, 4, 5)

Compare without the cache:

Xx = cat(X, x, dims=2)
mask = make_causal_mask(ones(11, 11))
AXx, scoresXx = mha(Xx, Xx, Xx; mask=mask, start_pos=1, use_cache=false)
isapprox(AXx[:, end, :], Ax[:, end, :]) # true

3 Multi-Head Latent Attention

3.1 C cache

We’ve seen that the KV cache has size $2d_h H \times N \times B$ elements per multi-head attention layer. (Each element is 1-4 bytes depending if FP8, FP16 or FP32 is used.) The aim of MLA is to reduce this, specifically to $d_c \times N \times B$ elements per multi-head attention layer. Therefore we will choose $d_c < 2d_h H$.

Different KV caching techniques.

DeepSeek’s innovation is to introduce a weight matrix $W^{DKV} \in \mathbb{R}^{d_c\times d}$ to compress the input $X \in \mathbb{R}^{d\times n}$ to a lower rank matrix $C^{KV} \in \mathbb{R}^{d_c \times n}$. This $C^{KV}$ matrix is then stored in the cache. Then two other weight matrices $W^{UK}$ and $W^{UV} \in \mathbb{R}^{d_h H\times d_c}$ uncompress the same $C^{KV}$ matrix to the key $K$ and value $V$ respectively. The above figure shows this visually.

\[\begin{align} c^{KV}_n &= W^{DKV} x_n \\ K &= W^{UK} C^{KV}_{1:n} \\ V &= W^{UV} C^{KV}_{1:n} \end{align} \tag{3.1.1} \label{eq:mla}\]

The KV cache is now replaced with a $C^{KV}$ cache of size $d_c \times N \times B$. DeepSeek theorises that this compression also results in a regularization effect that improves performance. This is supported by other LoRA research. However, the lossy compression might instead adversely affect quality. DeepSeek provides no evidence towards either claim.

The compression also results in a significant performance boost which DeepSeek strangely does not mention in their paper. Note that in MLA there are three matrix multiplications to perform to create $K$ and $V$ instead of two matrix multiplications in MHA. However the three multiplications comprise of less scalar operations:² $\begin{align} \frac{\text{# MLA ops}}{\text{# MHA ops}} &= \frac{2(2d_h H + d)d_c nB}{2(2d_h H)d n B} \\ &= \frac{2\frac{d_h H}{d} + 1}{2 \tfrac{d_h H}{d_c}} \\ &= \frac{3}{2r} \end{align} \tag{3.1.2} \label{eq:mla_ops}$ with the standard $d = d_h H$ and a compression ratio $r=\tfrac{d_h H}{d_c}$. This requires $r > 1.5$ for performance gains. The only performance penalty is the memory required for the $n$th $c^{KV}_n$ vector before it is transferred to the cache, which is $d_c \times 1 \times B$.

DeepSeek-V3 uses $d_h = 128$, $H=128$ and $d_c=4 d_h = 512$ which means it has a compression ratio of $32$ and a 20× speed up!

To reduce the activation memory during training, DeepSeek also applies the same strategy to the query:

\[\begin{align} C^{Q} &= W^{DQ} X \\ Q &= W^{UQ} C^{Q} \end{align} \tag{3.1.3} \label{eq:cq}\]

In total, five matrix multiplications are needed to create $Q$, $K$ and $V$ instead of three. The overall ratio of scalar operations is $\tfrac{5}{3r}$, which requires $r>1.67$ for performance gains.

DeepSeek give further enhancements to this which will be described shortly. They also apply layer normalisation to $C^Q$ and $C^{KV}$ which I will ignore in this article. For now, lets see this basic version of MLA in action.

3.2 Code

First create a struct similar to the MultiHeadAttentionKVCache layer.

struct MultiHeadLatentAttention{D1<:Dense, D2<:Dense, A<:AbstractArray{T, 3} where T} 
    nhead::Int
    denseDQ::D1
    denseUQ::D1
    denseDKV::D1
    denseUK::D1
    denseUV::D1
    denseO::D2
    cache_ckv::A
end

Flux.@layer MultiHeadLatentAttention trainable=(denseDQ, denseUQ, denseDKV, denseUK, denseUV, denseO)

Here is a convenience constructor to construct it from the various input dimensions:

function MultiHeadLatentAttention(;
    nhead::Int, dim_in::Int, dim_head::Int, dim_lora, dim_out::Int,
    max_seq_length::Int, max_batch_size::Int
    )
    denseDQ = Dense(dim_in => dim_lora; bias=false)
    denseUQ = Dense(dim_lora => dim_head * nhead; bias=false)
    denseDKV = Dense(dim_in => dim_lora; bias=false)
    denseUK = Dense(dim_lora => dim_head*nhead; bias=false)
    denseUV = Dense(dim_lora => dim_head*nhead; bias=false)
    denseO = Dense(dim_head*nhead => dim_out; bias=false)
    cache_ckv = Array{Float32, 3}(undef, dim_lora, max_seq_length, max_batch_size)
    MultiHeadLatentAttention(
        nhead,
        denseDQ, denseUQ,
        denseDKV, denseUK, denseUV,
        denseO,
        cache_ckv
    )
end

The forward pass is:

function (mla::MultiHeadLatentAttention)(query::A3, key::A3
    ; start_pos::Int=1, use_cache::Bool=true, mask::Union{Nothing, M}=nothing
    ) where {T, A3 <: AbstractArray{T, 3}, M <: AbstractArray{Bool}}
    dm, seq_length, batch_dim = size(key)
    cq = mla.denseDQ(query) # size(cq) == (dc, dq, B)
    ckv = mla.denseDKV(key) # size(ckv) == (dc, dkv, B)
    if use_cache
        end_pos = start_pos + seq_length - 1
        mla.cache_ckv[:, start_pos:end_pos, 1:batch_dim] = ckv
        ckv = mla.cache_ckv[:, 1:end_pos, 1:batch_dim]
    end
    K = mla.denseUK(ckv) # size(k) == (dh*nhead, dkv, B)
    V = mla.denseUV(ckv) # size(v) == (dh*nhead, dkv, B)
    Q = mla.denseUQ(cq)  # size(q) == (dh*nhead, dq, B)
    A, scores = multi_head_scaled_dot_attention(mla.nhead, Q, K, V; mask=mask)
    A = mla.denseO(A)
    A, scores
end

Create a test layer with compression ratio $r=\tfrac{d_hH}{d_c}=4$:

nhead, dim_head, dim_lora, dim_out = 8, 64, 128, 8*64
dim_model = nhead * dim_head
N, max_seq_length, batch_dim = 20, 32, 8
mla = MultiHeadLatentAttention(
    nhead=nhead, dim_in=dim_model, dim_head=div(dim_model, nhead),
    dim_lora=dim_lora, dim_out=dim_out,
    max_seq_length=max_seq_length, max_batch_size=batch_dim
    )
X0 = randn(Float32, dim_model, N, batch_dim)

Fill the cache:

mask = make_causal_mask(ones(N, N));
A, scores = mla(X0, X0; mask=mask, use_cache=true); 
size(A) # (512, 20, 8)
size(scores) # (20, 20, 8, 8)

Use the cache with a new vector:

x = randn(Float32, dim_model, 1, batch_dim)
mask = repeat([true], inner=(N + 1, 1))
Ax, scoresx = mla(x, x; mask=mask, start_pos=N+1, use_cache=true)
size(Ax) # (512, 1, 8)
size(scoresx) # (21, 1, 8, 8)

3.3 Absorption

DeepSeek suggests a way to further decrease the computational cost by absorbing weight matrices into each other. To quote DeepSeek directly:

In addition, during inference, since $W^{UK}$ can be absorbed into $W^{Q}$ , and $W^{UV}$ can be absorbed into ${W^O}$, we even do not need to compute keys and values out for attention.

What they mean is that the weight matrices can be multiplied to produce a single weight matrix. This can only be done during inference because during training they need to be kept separate so that the gradients flow properly backwards through each matrix.

This technique can be used independently of MLA.

To show why this works, rewrite the attention equation $\ref{eq:attention}$ as follows:

\[\begin{align} S &= K^T{Q} \\ &= (W^{UK}C^{KV})^T (W^{UQ}C^Q) \\ &= (C^{KV})^T (W^{UK})^T W^{UQ} C^{Q} \\ &= (C^{KV})^T W^{KQ} C^{Q} \quad ; W^{KQ}=(W^{UK})^T W^{UQ} \end{align} \label{eq:absorbWKQ} \tag{3.3.1}\]

This looks straightforward but there are further complications with the dimensions. Here is a dimensional analysis of the above equation following the two rules of batch matrix multiplication:

The inner matrix dimensions must match. That is, the second dimension of the first matrix must match the first dimension of the second.
All the batch dimensions (dimensions 3 and greater) must be equal.

\[\begin{align} & (d_c \times n \times B)^T (d_h H \times d_c)^T (d_h H \times d_c) (d_c \times n \times B) \\ &= (n \times d_c \times B) (d_c \times d_h \times H) (d_h \times d_c \times H) (d_c \times n \times B) \\ &= (n \times d_c \times B) (d_c \times d_c \times H) (d_c \times n \times B) \\ &= (n \times d_c \times 1 \times B) (d_c \times d_c \times H \times 1) (d_c \times n \times 1 \times B) \\ &= n \times n \times H \times B \end{align} \label{eq:absorbWQ_dimension} \tag{3.3.2}\]

where

Line 2 reshapes the weight matrices from $d_h H \times d_c$ to $d_h \times d_c \times H$. This is necessary because the non-linear softmax function must be applied independently over each head dimension.
Line 3 shows that $W^{KQ} \in \mathbb{R}^{d_c \times d_c \times H}$.
Line 4 adds extra broadcast dimensions to make the batch dimensions match.

Broadcasting is a technique where the smaller array is replicated along all dimensions of size 1 to match the size of the larger array. For broadcasted batched multiplication this only needs to be done for the 3rd and higher dimensions. Pseudo-code for this is:

Broadcasted batched multiplication
inputs: $A \in \mathbb{R}^{I\times R \times L_A \times K_A}$, $B \in \mathbb{R} ^{R \times J \times L_B \times K_B}$
for $l$ in $1:\max(L_A, L_B)$
$\quad$ $l_A \leftarrow$ 1 if $L_A=1$ else $l$
$\quad$ $l_B \leftarrow$ 1 if $L_B=1$ else $l$
$\quad$ for $k$ in $1:\max(K_A, K_B)$
$\quad\quad$ $k_A \leftarrow$ 1 if $K_A=1$ else $k$
$\quad\quad$ $k_B \leftarrow$ 1 if $K_B=1$ else $k$
$\quad\quad$ $C_{l,b} = A_{l_A,k_A} B_{l_B, k_B}$

This same absorption technique can be applied to the value and output matrices: $\begin{align} Y &= W^O V Z \\ &= W^O (W^{UV} C^{KV} Z) \\ &= W^{OV} C^{KV} Z \end{align} \label{eq:absorbOV} \tag{3.3.3}$

The dimensional analysis here is similar:

\[\begin{align} & (d_o \times d_h H) (d_h H \times d_c) (d_c \times n \times B) (n \times n \times H \times B) \\ &= (d_o \times d_h \times H) (d_h \times d_c \times H) (d_c \times n \times 1 \times B) (n \times n \times H \times B)\\ &= (d_o \times d_h \times H) (d_h \times d_c \times H) (d_c \times n \times H \times B) \\ &= (d_o \times d_c \times H) (d_c \times n \times H \times B) \\ &= (d_o \times d_c H) (d_c H \times n \times B) \\ &= d_o \times n \times B \end{align} \label{eq:absorbOV_dimension} \tag{3.3.4}\]

This shows that the $C^{KV} Z$ multiplication is a broadcasted batched multiplication. However where $W^{KQ} \in \mathbb{R}^{d_c \times d_c \times H}$, $W^{OV} \in \mathbb{R}^{d_o \times d_c H}$ is a typical 2D matrix. Therefore the usual matrix multiplication can be applied by reshaping the $C^{KV} Z$ result from a 3D $d_c H \times n \times B$ array to a $d_c H \times nB$ matrix.

3.4 Broadcasted batched multiplication

I have written an implementation in Julia directly based on the pseudo code. It can be seen here: broadcasted_batched_mul.jl. However, it is inefficient and uses scalar indexing which is extremely slow on a GPU.

My solution instead is to physically replicate the broadcasted dimensions. This is of course inefficient compared to virtual replication but it makes the function viable on a GPU. The downside is it can be up to 4× slower than the naive code on a CPU.

using Flux: batched_mul
function broadcasted_batched_mul(x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N}
    batch_dims_x = Tuple(size(x, idx) == 1 ? size(y, idx) : 1 for idx in 3:N)
    dims_x = (1, 1, batch_dims_x...)
    batch_dims_y = Tuple(size(y, idx) == 1 ? size(x, idx) : 1 for idx in 3:N)
    dims_y = (1, 1, batch_dims_y...)
    xb = repeat(x; outer=dims_x)
    yb = repeat(y; outer=dims_y)
    batched_mul(xb, yb)
end

The DeepSeek source code meanwhile uses torch.einsum and carries out the multiplications right to left instead of creating a new matrix.³ Here is the relevant code. As far as I know, this has the same drawbacks with scalar indexing with none of the advantages of absorption as described in their paper.

q = self.wq_b(self.q_norm(self.wq_a(x)))
q = q.view(bsz, seqlen, self.n_local_heads, self.qk_head_dim)
kv = self.wkv_a(x)
wkv_b = self.wkv_b.weight 
wkv_b = wkv_b.view(self.n_local_heads, -1, self.kv_lora_rank)
q = torch.einsum("bshd,hdc->bshc", q, wkv_b)
self.kv_cache[:bsz, start_pos:end_pos] = self.kv_norm(kv)
scores = torch.einsum("bshc,btc->bsht", q, self.kv_cache[:bsz, :end_pos])

Presumably fast MLA implementions for GPU kernels would make use of virtual replication. (It would great if someone can clarify what FlashMLA does.)

3.5 Code

I will detail some of the code here. For the full code, see MultiHeadLatentAttentionV2.jl.

The first step is create the $W^{KQ}$ and $W^{OV}$ matrices.

First, $W^{KR}=(W^{UK})^T W^{UQ}$ while reshaping from $d_h H \times d_c$ to $d_h \times d_c \times H$.

function _absorb_WUK_WUQ(nhead::Int, W_UK::AbstractMatrix, W_UQ::AbstractMatrix)
    dh = div(size(W_UK, 1), nhead)
    dim_lora = size(W_UK, 2)
    W_UQ = permutedims(reshape(W_UQ, dh, nhead, dim_lora), (1, 3, 2)) 
    W_UK = permutedims(reshape(W_UK, dh, nhead, dim_lora), (1, 3, 2))
    W_UKT = permutedims(W_UK, (2, 1, 3)) # (dh, dc, nhead)^T => (dc, dh, nhead)
    batched_mul(W_UKT, W_UQ) 
end
W_KQ = _absorb_WUK_WUQ(nhead, denseUK.weight, denseUQ.weight)

Then $W^{OV}=W^O W^{UV}$ while preserving the head dimension through reshaping:

function _absorb_WO_WUV(nhead::Int, W_O::AbstractMatrix, W_UV::AbstractMatrix)
    dh = div(size(W_UV, 1), nhead)
    dim_lora = size(W_UV, 2)
    dout = size(W_O, 1)
    W_UVh = permutedims(reshape(W_UV, dh, nhead, dim_lora), (1, 3, 2)) # (dh*nhead, dc) => (dh, dc, nhead)
    W_Oh = reshape(W_O, dout, dh, nhead) # (dout, dh*nhead) => (dout, dh, nhead)
    W_OVh = batched_mul(W_Oh, W_UVh) # (dout, dh, nhead) * (dh, dc, nhead)
    reshape(W_OVh, dout, dim_lora*nhead) # (dout, dc, nhead) => (dout, dc*nhead)
end
W_OV = _absorb_WO_WUV(nhead, denseO.weight, denseUV.weight)

The forward pass is the same as in the original code until the end of caching:

function mla_absorb(
    mla::MultiHeadLatentAttention, query::A3, key::A3
    ; start_pos::Int=1, use_cache::Bool=true, mask::Union{Nothing, M}=nothing
    ) where {T, A3 <: AbstractArray{T, 3}, M <: AbstractArray{Bool}}
    dm, seq_length, batch_dim = size(key)
    dh = div(dm, mla.nhead)
    cq = mla.norm_cq(mla.denseDQ(query))  # size(cq) == (dc, dq, B)
    ckv = mla.norm_ckv(mla.denseDKV(key)) # size(ckv) == (dc, dkv, B)
    if use_cache
        end_pos = start_pos + seq_length - 1
        mla.cache_ckv[:, start_pos:end_pos, 1:batch_dim] = ckv
        ckv = mla.cache_ckv[:, 1:end_pos, 1:batch_dim]
    end

Then add the broadcast dimensions:

    ckv_ = Flux.unsqueeze(ckv, dims=3)
    keyT = permutedims(ckv_, (2, 1, 3, 4)) # (dkv, dc, B) => (dkv, dc, 1, B)
    cq_ = Flux.unsqueeze(cq, dims=3) # (dkv, dc, B) => (dkv, dc, 1, B)
    W_KQ = Flux.unsqueeze(mla.W_KQ, dims=4); # (dc, dc, nhead) => (dc, dc, nhead, 1)

Then apply the equations as before except using broadcasted_batched_mul instead of batched_mul:

    atten_base = broadcasted_batched_mul(keyT, broadcasted_batched_mul(W_KQ, cq_))
    atten = one(T)/convert(T, sqrt(dh)) .* (atten_base)
    atten = apply_mask(atten, mask)
    scores = softmax(atten; dims=1)
    A = broadcasted_batched_mul(ckv_, scores) # (dc, dq, nhead, B)
    # (dc, dq, nhead, B) => (dc*nhead, dq, B)
    A = permutedims(A, [1, 3, 2, 4])
    A = reshape(A, :, size(A, 3), size(A, 4))
    mla.denseOV(A), scores 
end

Test that these give the same result:

X0 = randn(Float32, dim_model, N, batch_dim)
mask = make_causal_mask(ones(N, N));
A_naive, scores_naive = mla_naive(mla, X0, X0; mask=mask, use_cache=true);
A_absorb, scores_absorb = mla_absorb(mla, X0, X0; mask=mask, use_cache=true);
isapprox(A_absorb, A_naive) # true

3.6 Decoupled RoPE

The last enhancement DeepSeek adds is RoPE. One issue with RoPE however is that it breaks the absorption property described above. To prove this, RoPE can be represented as a series of matrix multiplications on each column in an input matrix $X$:

\[\text{RoPE}(X) = \begin{bmatrix} R_1 X_1 & R_2 X_2 & ... & R_n X_n \end{bmatrix} \label{eq:RoPE} \tag{3.4.1}\]

Applying this to the scores equation: $\begin{align} S &= \text{RoPE}(K)^T\text{RoPE}(Q) \\ &= \text{RoPE}(W^{UK}C^{KV})^T \text{RoPE}(W^{UQ}C^Q) \\ &= \begin{bmatrix} R_1 W^{UK}_1 C^{KV}_1 & ... & R_n W^{UK}_n C^{KV}_n \end{bmatrix}^T \\ &\phantom{=x} \begin{bmatrix} R_1 W^{UQ}_1 C^Q_1 & ... & R_n W^{UQ}_n C^Q_n \end{bmatrix} \\ \implies S_{ij} &= (C^{KV}_{i})^T (W^{UK}_i)^T R_i^T R_j W^{UQ}_j C^{Q}_j \end{align} \label{eq:absorb_RoPE} \tag{3.4.2}$

which shows that the rotation matrices will appear right in the middle of the product.

DeepSeek’s solution is to concatenate another matrix to the bottom of the key $K$ and query $Q$ respectively and only apply RoPE to these matrices. Furthermore, because the new matrix $K^R$ will also need to be cached, they share it across all heads. To put it another way, $K^R$ will be broadcasted across the head dimension during multiplication. So for each head $h$:

\[\begin{align} K_h &= \begin{bmatrix} W^{UK}_h C^{KV} \\ \text{RoPE}(W^{KR} X) \end{bmatrix} \\ Q_h &= \begin{bmatrix} W^{UQ}_h C^{Q} \\ \text{RoPE}(W^{QR}_h C^{Q}) \end{bmatrix} \end{align} \label{eq:MLA_RoPE} \tag{3.4.3}\]

where $W^{KR} \in \mathbb{R}^{d_R \times d}$ and $W^{QR} \in \mathbb{R}^{d_R H \times d_c}$. This means that $K,Q \in \mathbb{R}^{(d_h + d_R) \times n \times H \times B}$. The cache will consist of both $C^{KV}$ and $K^R$ for a total size of $(d_c + d_R) \times N \times B$ elements per layer.

Very conveniently, this results in an addition between the original and embedded scores:

\[\begin{align} S_h &= K^T Q \\ &= \begin{bmatrix} (K^0)^T & (K^{R})^T \end{bmatrix} \begin{bmatrix} Q^0 \\ Q^{R} \end{bmatrix} \\ &= (K^0)^T Q^0 + (K^{R})^T Q^{R} \end{align} \label{eq:MLA_RoPE_scores_} \tag{3.4.4}\]

which means that these results can be calculated separately. Note that $S_h$ must now be scaled by $1/\sqrt{d_h + d_R}$ instead of $1/\sqrt{d_h}$.

3.7 Code

I will detail some of the code here. For the full code, see MultiHeadLatentAttentionV2.jl.

Here is a Julia implementation of RoPE:

struct RoPE{T}
    base::Int
    dim::Int
    seq_length::Int
    freqs_complex::Matrix{Complex{T}}
end

RoPE(dim::Int, max_seq_length::Int; base::Int=10_000) = RoPE(Float32, dim, max_seq_length; base=base)

function RoPE(T::DataType, dim::Int, max_seq_length::Int; base::Int=10_000)
    @assert dim % 2 == 0 "Require even dim"
    θ = 1 ./ (base .^ ((0:2:(dim - 2)) / dim))
    angles = θ * transpose(0:(max_seq_length-1))
    freqs = map(x -> reverse(sincos(x)), angles)
    freqs_complex = map(cs -> Complex(cs...), freqs)
    RoPE{T}(base, dim, max_seq_length, freqs_complex)
end

The forward pass can be calculated with matrices, but the RoPE authors gave a more efficient implementation with complex numbers:

(r::RoPE)(x::AbstractArray) = apply_rope(x, r.freqs_complex[:, 1:size(x, 2)])
(r::RoPE)(x::AbstractArray, indices) = apply_rope(x, r.freqs_complex[:, indices])

function apply_rope(x::AbstractArray{T}, freqs_complex::AbstractMatrix{<:Complex{T}}) where T
    x_complex = reinterpret(Complex{T}, x)
    rx_complex = freqs_complex .* x_complex
    T.(reinterpret(T, rx_complex))
end

Then add the embedding, denseQR and denseKR layers to the MultiHeadLatentAttention struct.

The embeddings are applied as follows:

function _apply_embeddings(mla::MultiHeadLatentAttention, key::A3, cq::A3, idx::UnitRange{Int}) where {T, A3 <: AbstractArray{T, 3}}
    dim_lora, dq, batch_dim = size(cq)
    kr = mla.denseKR(key)
    qr = mla.denseQR(cq)
    kr = mla.embedding(kr, idx) # size(kr) == (dr, dkv, B)
    qr = permutedims(reshape(qr, :, mla.nhead, dq, batch_dim), (1, 3, 2, 4)) # (dr*nhead, dq, B) => (dr, dq, nhead, B)
    qr = mla.embedding(qr, idx)
    kr, qr
end
kr, qr = _apply_embeddings(mla, key, cq, start_pos:end_pos)

Note that embedding is done per head, hence the reshaping of qr.

For the naive method, concatenate along the head dimension. This requires reshaping for qr and repeating kr.

    Q, K = _cat_decoupled_embedding(mla.nhead, Q, qr, K, kr)

where:

function _cat_decoupled_embedding(
    nhead::Int, Qin::A3, Qr::A4, Kin::A3, kr::A3
    ) where {T, A3 <: AbstractArray{T, 3}, A4 <: AbstractArray{T, 4}}
    dhq, dq, B = size(Qin)
    dhk, dkv, B = size(Kin)
    Q = reshape(
        cat(reshape(Qin, :, nhead, dq, B), permutedims(Qr, (1, 3, 2, 4)), dims=1),
        : , dq, B)
    Kr = repeat(Flux.unsqueeze(kr, dims=2), outer=(1, 2, 1, 1))
    K = reshape(
        cat(reshape(Kin, :, nhead, dkv, B), reshape(Kr, :, nhead, dkv, B), dims=1),
        :, dkv, B)
    Q, K
end

Then continue as before.

With absorption, broadcast batched multiply kr and qr and add to the original attention:

    krT = Flux.unsqueeze(permutedims(kr, (2, 1, 3)), dims=3) # (dr, dkv, B) => (dkv, dr, 1, 1B)
    atten_base = broadcasted_batched_mul(keyT, broadcasted_batched_mul(W_KQ, cq_))
    atten_embed = broadcasted_batched_mul(krT, qr)
    atten = one(T)/convert(T, sqrt(dh + dr)) .* (atten_base + atten_embed)

Conclusion

Overall, I think MLA is a smart and useful idea. However, after having explored it in depth, I am more critical of their enhancements.

The basic premise of Multi-Head Latent Attention is simple. It compresses the input matrix so that a single smaller $C^{KV}$ matrix can be stored instead of the key $K$ and value $V$ matrices. It then uncompresses this matrix into $K$ and $V$ with two additional weight matrices. This also results in a significant performance increase which scales with the compression ratio $\frac{d_h H}{d_c}$ by a factor of $\tfrac{2}{3}$ - so the compression needs to be greater than a modest 1.5 for gains to be realised - and requires no further modifications to existing MHA code. However, it is unclear what the qualitative effects of the compression are, and it is strange that DeepSeek did not discuss the performance benefits.

To this DeepSeek adds weight absorption and decoupled RoPE. We have seen this complicates the mathematics and requires careful dimensional analysis. True performance gains only come with an optimised broadcasted_batched_mul function. Their own open source code does not even have such optimisations. Personally, I see no benefit to this and would recommend the naive method with RoPE applied normally. That is, apply each of the weight matrices to their inputs individually and then apply RoPE to the entire $K$ and $Q$ matrices.

While I am impressed with the ingenuity behind MLA, DeepSeek’s omissions coupled with extra, unwieldly enhancements makes me more skeptical of their methodology. If I examine their other techniques, I will do so with more caution.

Indices for dimensions are shown when they are relevant and left out when they’re not. For example, the row index (along the embedding dimension) is generally ignored so $X_j$ is the $j$th column/token of $X$. But then later for the scores I’ll use $X_{i,j}$ because both indices represent a position in the token sequence. But then later I use $X_h$ to indicate the $h$th head of $X$ which is along the 3rd dimension. Please forgive me for this and other abuses of matrix notation in this post. ↩
This is the naive matrix multiplication algorithm. For sizes $n \times d$ and $d \times m$, for each of the $nm$ output elements there are $d$ multiplications and $d-1$ additions (no addition for the first element), so there are $nm(2d-1)$ operations in total. ↩
In general multiplication is not defined for higher order arrays. But there is a set of multidimensional algebraic objects called tensors where it is, and Einstein notation was designed for this use case. Confusingly, Google named their machine learning framework TensorFlow and calls higher order arrays tensors. So one should differentiate between machine learning tensors and geometric tensors. They are not the same. To give a simple explanation: one can think of geometric tensors as higher order arrays with severe constraints on their entries and operations because they represent geometric objects. These constraints make it harder - not easier - to code higher order arrays as geometric tensors. ↩

Notes on the Martinez-Rueda Polygon Clipping algorithm

2025-01-11T00:00:00+00:00

The Martinez-Rueda algorithm computes boolean operations between polygons. It can be used for polygon intersections (polygon clipping), unions, differences and XORs. I recently implemented it by following a comprehensive guide at https://sean.fun/a/polygon-clipping-pt2/. However, it was slightly lacking in some complex scenarios, mainly resulting from the strict ordering required by the Bentley-Ottmann line intersection algorithm. This post explains my minor modifications to address this crucial part of the algorithm.

1 Introduction

Boolean operations between a spiral and a star, computed with the Martinez-Rueda algorithm.

I recently updated my PolygonAlgorithms.jl package to use the Martinez-Rueda algorithm for boolean operations between polygons. I had originally implemented a version of the Weiler-Atherton algorithm, explained in detail in an earlier blog post. However, that algorithm can only calculate intersections between polygons, whereas Martinez-Rueda simultaneously calculates the intersection as well as unions, differences and XORs between the polygons. See the above example and the table below for a brief comparison between the algorithms.

	Martinez-Rueda	Weiler-Atherton
Operation	Segment level. Compare fill annotations.	Point level. Walk along loops.
Polygon types	Convex, concave, self-intersecting, holes.	Convex, concave. Can be extended to holes.
Time complexity	$\mathcal{O}(nm)$	$\mathcal{O}((n+m+k)\log(n+m))$
Return types	Segments and regions.	Points, segments and regions.

The Martinez-Rueda algorithm is more versatile because it fundamentally operates at a segment level whereas Weiler-Atherton which operates a point level, and so it has a “bigger picture” view of the polygons. A disadvantage of the Martinez-Rueda algorithm is that it is more sensitive to numerical inaccuracies - for reasons that will be described shortly - such as a line that is almost vertical or tiny regions of intersection. In practice I found their runtimes similar, with Martinez-Rueda running faster in some situations and slower in others. For the spiral-star example, it is about 1.5 slower.

The original paper can be found here, but I followed the guide at https://sean.fun/a/polygon-clipping-pt2/.

Fill annotations in the Martinez-Rueda algorithm.

The core idea behind the Martinez-Rueda algorithm is to calculate fill annotations for each segment for each polygon: is this segment filled above and below by this polygon, and is it filled above and below by the other polygon? Once these are known, it is easy to select the relevant segments to the given operation, and to link them up again into polygons.

Line sweep and line stack in the Martinez-Rueda algorithm. Source: sean.fun/a/polygon-clipping-pt2.

The genius of the Martinez-Rueda algorithm is to extend upon the Bentley-Ottmann algorithm for segment intersections to do this. It does a vertical line sweep from left to right, bottom to top. At any any given moment, we can imagine having a stack of all the lines that intersect the vertical line, ordered from top to bottom. According to the Bentley-Ottmann algorithm, to find intersections through a segment, we only need to check for intersections with the segments immediately above and immediately below it in stack. At the same time, we can propagate the fill annotations from the segment below, or empty space if nothing is below it. Hence, finding the exact segments that are above and below a segment is paramount to this algorithm, and even slight mistakes can cause errors that propagate to other segments.

This is the gist of the algorithm. The practicalities of handling the event queue and many edge cases such as handling coincident lines and tricky annotation situations are described in the article. From now, I will focus only on the is_above? algorithm to determine if a segment is above another segment. I spent a long time debugging the whole Martinez-Martinez algorithm against a variety of test cases, and I always seemed to land back at this is_above? function. Getting this function right solved most of my problems.

2 Is above?

For reference, the function is called statusCompare in the article.

The goal of this algorithm is to sort lines by height. This will then give a sweep status like:

Lines sorted by height.

Given line segments $AB$ and $CD$, it is tempting to sort them only by the starting point coordinate:

\[y_A \geq y_C \tag{2.1} \label{eq:greater}\]

This will work in most cases. However already in the figure we can see an example where it does not. Line 2’s starting point is below line 3’s, but it makes more sense to consider line 2 as “above” line 3.

A better definition for “above” is needed. Instead, we will consider one segment above another if its starting point is above its projection on the other line:

That is, $y_p \leq y_A$, where $y_p$ is:

\[\begin{align} y_p &= \frac{y_D - y_C}{x_D - x_C}(x_A - x_C) + y_C \\ \implies & 0 \leq (y_A - y_C)(x_D - x_C) - (y_D - y_C)(x_A - x_C) \; ; x_D \neq x_C \end{align} \tag{2.2} \label{eq:projection}\]

However one problem is this equation is not symmetrical. (This is the case in the original statusCompare.) In the figure above, both projections are below the other line. Hence is_above will return false for both segments. Yet one must be above the other. Therefore to maintain symmetry, the function will always only consider the right segment. If it is the segment of interest, we check if its starting point is above its projection on the left line. Otherwise if we are checking if the left segment is above, we check if the right segment’s starting point’s is below its projection on the left segment.

There are two other special cases. The first is if the starting point is colinear or coincident with the other line:

In this case, the endpoint is used instead.

The second and final case is a vertical line:

As implied by equation $\ref{eq:projection}$, if the line is vertical the projection equation is indeterminant. In fact, if the line were slight sloped towards the left or towards the right, the answer would differ. Here instead we will simply compare y-values. That is, fallback to $\ref{eq:greater}$. (The original statusCompare did not account for this case.)

No fallback

If there is no fallback, then when $x_C=x_D$ equation $\ref{eq:projection}$ becomes: $$ 0 \geq (y_D - y_C)(x_A - x_C) $$ In the algorithm vertical events are always constructed from bottom to top, so $y_D > y_C$ and this becomes a test whether or not the $A$ is to the left of the vertical $CD$ segment.

Hence the is_above algorithm is:

Is segment AB above CD?
inputs: $AB$, $CD$
if colinear($A, C, D$)
$\quad$ return point_above_line($B, CD$)
if $x_C < x_A$
$\quad$ return point_above_line($A, CD$)
else
$\quad$ return not point_above_line($C, AB$)

where point_above_line is:

point_above_line
inputs: $P$, $CD$
if $x_C = x_D$
$\quad$ return $y_p \geq \text{min}(y_C, y_D)$
return $ (y_P - y_C)(x_D - x_C) - (y_D - y_C)(x_P - x_C) \geq 0$

This final algorithm is simple, but absolutely crucial for the algorithm.

3 Compare events

For reference, the function is called eventCompare in the article.

It sorts segment events from left to right, bottom to top. There are two events per segment: a start event and an end event. An example ordering is:

The algorithm is:

If the points are not the same, the smaller event is to the left, or the lower one if they are on a vertical line.
If the other points are also the same, this event is not smaller. (Equal segments.)
If the one is a start event and the other an end event, the end event is considered smaller. (Common points.)
The smaller event is below the other one according to not is_above, unless the segment of interest is vertical, then the smaller event is “not above” if it is to the right. (Common start/end points.)

For example, in the picture:

Event 1 is smaller than event 2 by step 4: lower event is to right of a vertical segment.
Event 2 is smaller than event 5 by step 1: they are on the same segment, but event 2 is defined by the lower start point.
Event 3 is smaller than event 4 by step 4: common start point but segment 3 is lower than segment 4.
Event 5 is smaller than event 6 by step 3: same point but event 5 is an end event, while event 6 is a start event.

And so on.

4 Conclusion

This was a short post to address minor issues and some improvements to two parts of the Martinez-Rueda implementation from https://sean.fun/a/polygon-clipping-pt2/. Otherwise that article did a very good job at explaining this algorithm and I highly recommend it.

MicroGrad.jl: Part 5 MLP

2024-08-19T00:00:00+00:00

A series on automatic differentiation in Julia. Part 5 shows how the MicroGrad.jl code can be used for a machine learning framework like Flux.jl. The working example is a multi-layer perceptron trained on the moons dataset.

This is part of a series. The other articles are:

All source code can be found at MicroGrad.jl.

1 Introduction

A 2×6×2 multi-layer perceptron

The previous four sections have developed a minimal automatic differentiation package. The aim of this part is to demonstrate how it can be used as the backbone for a machine learning framework like Flux.jl. In this post we will create a multi-layer perceptron also known as a fully connected neural network. This is an extremely popular and powerful machine learning model. New code will be needed for the forward pass and for some extra rrules. Otherwise, the rest is handled by code from the previous parts.

2 Moons dataset

The moons dataset is a toy dataset for testing and visualising classification algorithms. While clearly distinct, the curved nature of the two classes requires a non-linear algorithm to discern them. This was the dataset chosen by Karpathy to demonstrate his micrograd package, and so it will be used here too.

This dataset can be reconstructed in Julia as follows, based on the Scikit-Learn function:

using Random

function make_moons(rng::AbstractRNG, n_samples::Int=100; noise::Union{Nothing, AbstractFloat}=nothing)
    n_moons = floor(Int, n_samples / 2)
    t_min = 0.0
    t_max = π
    t_inner = rand(rng, n_moons) * (t_max - t_min) .+ t_min
    t_outer = rand(rng, n_moons) * (t_max - t_min) .+ t_min
    outer_circ_x = cos.(t_outer)
    outer_circ_y = sin.(t_outer)
    inner_circ_x = 1 .- cos.(t_inner)
    inner_circ_y = 1 .- sin.(t_inner) .- 0.5

    data = [outer_circ_x outer_circ_y; inner_circ_x inner_circ_y]
    z = permutedims(data, (2, 1))
    if !isnothing(noise)
        z += noise * randn(size(z))
    end
    z
end

make_moons(n_samples::Int=100; options...) = make_moons(Random.default_rng(), n_samples; options...)

Creating the moons and labels:

n = 100
X = make_moons(2n; noise=0.1) # 2×200 Matrix 
y = vcat(fill(1, n)..., fill(2, n)...) # 200-element Vector{Int64}

3 Layers

3.1 ReLU

The Rectified Linear Unit (ReLU) is a common activation function in machine learning. It is defined as follows:

\[\text{relu}(x)=\begin{cases} x, & \text{if $x> 0$} \\ 0, & \text{otherwise} \end{cases}\]

This can be realised as a broadcast of the max function:

relu(x::AbstractArray) = max.(0, x)

The derivative is:

\[\frac{\partial \text{relu}}{\partial x}=\begin{cases} 1, & \text{if $x> 0$} \\ 0, & \text{otherwise} \end{cases}\]

In code:

function rrule(::typeof(relu), x::AbstractArray)
    relu_back(Δ) = (nothing, ifelse.(x .> 0, Δ, 0))
    relu(x), relu_back
end

3.2 Dense layer

The fully connected layer equation is:

\[Y_{ij} = a\left(\sum_k (W_{ik}X_{kj} + b_{i}) \right)\]

This is the code from Flux.jl to create this fully connected layer (source):

using Random
struct Dense{M<:AbstractMatrix, B<:AbstractMatrix, F}
    weight::M
    bias::B
    activation::F
end

function (a::Dense)(x::AbstractVecOrMat)
    a.activation(a.weight * x .+ a.bias)
end

Dense((in, out)::Pair; activation=relu) = Dense(glorot_uniform(in, out), zeros(out, 1), activation)

function glorot_uniform(rng::AbstractRNG, fan_in::Int, fan_out::Int)
    scale = sqrt(24 / (fan_in + fan_out))  # 0.5 * sqrt(24) = sqrt(1/4 * 24) = sqrt(6)
    (rand(rng, fan_out, fan_in) .- 0.5) .* scale
end

glorot_uniform(fan_in::Int, fan_out::Int) = glorot_uniform(Random.default_rng(), fan_in, fan_out)

Also add a method to paramaters:

parameters(a::Dense) = (;weight=a.weight, bias=a.bias)

Create and test:

X = rand(2, 4)
layer = Dense(2 => 3; activation=relu)
layer(X) # 3×3 Matrix{Float64}

3.3 Reverse broadcast

Inspect the IR @code_ir layer(X):

1: (%1, %2)
  %3 = Base.getproperty(%1, :activation)
  %4 = Main.:+
  %5 = Base.getproperty(%1, :weight)
  %6 = %5 * %2
  %7 = Base.getproperty(%1, :bias)
  %8 = Base.broadcasted(%4, %6, %7)
  %9 = Base.materialize(%8)
  %10 = (%3)(%9)
  return %10

From part 1 and part 4 we have rrules for getproperty (getfield), matrix multiplication (*) and for the activation (relu). We still need rrules for broadcasted and materialize.

Creating rules for broadcasting in general is complex¹, so instead create a specific rule for the broadcast invoked here:

function rrule(::typeof(Broadcast.broadcasted), ::typeof(+), A::AbstractVecOrMat{<:Real}, B::AbstractVecOrMat{<:Real})
    broadcast_back(Δ) = (nothing, nothing, unbroadcast(A, Δ), unbroadcast(B, Δ))
    broadcast(+, A, B), broadcast_back
end

function unbroadcast(x::AbstractArray, x̄)
    if length(x) == length(x̄)
        x̄
    else
      dims = ntuple(d -> size(x, d) == 1 ? d : ndims(x̄)+1, ndims(x̄))
      dx = sum(x̄; dims = dims)
      check_dims(size(x), size(dx))
      dx
    end
end

function check_dims(size_x, size_dx) # see ChainRulesCore.ProjectTo
    for (i, d) in enumerate(size_x)
        dd = i <= length(size_dx) ? size_dx[i] : 1 # broadcasted dim
        if d != dd 
            throw(DimensionMismatch("variable with size(x) == $size_x cannot have a gradient with size(dx) == $size_dx"))
        end
    end
end

Testing:

X = rand(2, 4)
b = rand(2)
Z, back = rrule(Base.broadcasted, +, X, b) # (2×4 Matrix{Float64}, broadcast_back)
back(ones(2, 4)) # (nothing, nothing, ones(2, 4), [4.0; 4.0;;])

The definition for Base.Broadcast.materialize is:

@inline materialize(bc::Broadcasted) = copy(instantiate(bc))
materialize(x) = x

Hence we need rrules for copy and instantiate (source):

function rrule(::typeof(copy), bc::Broadcast.Broadcasted)
    uncopy(Δ) = (nothing, Δ)
    return copy(bc), uncopy
end

function rrule(::typeof(Broadcast.instantiate), bc::Broadcast.Broadcasted)
    uninstantiate(Δ) = (nothing, Δ)
    return Broadcast.instantiate(bc), uninstantiate
end

Now the pullback for the Dense layer works:

Y, back = pullback(layer, X) # (3×4 Matrix, Pullback)
back(ones(3, 4)) # ((;weight=...,bias=...,activation=nothing), 2×4 Matrix)
Y, back = pullback(m->m(X), layer) # (3×4 Matrix, Pullback)
back(ones(3, 4)) # (nothing, (;weight=...,bias=...,activation=nothing))

3.4 Chain

Here is the Flux code to create a generic chain (source):

struct Chain{T<:Tuple}
    layers::T
end
  
Chain(xs...) = Chain(xs)

(c::Chain)(x) = _apply_chain(c.layers, x)

@generated function _apply_chain(layers::Tuple{Vararg{Any,N}}, x) where {N}
  symbols = vcat(:x, [gensym() for _ in 1:N])
  calls = [:($(symbols[i+1]) = layers[$i]($(symbols[i]))) for i in 1:N]
  Expr(:block, calls...)
end

Add a method to parameters:

parameters(c::Chain) = (;layers = map(parameters, c.layers))

We will need an rrule for getindex:

world = Base.get_world_counter()
pr1 = _generate_pullback(world, typeof(_apply_chain), Tuple{typeof(cos), typeof(sin)}, Float64)

It is as follows (source):

function rrule(::typeof(getindex), x::T, i::Integer) where {T<:Tuple}
    function getindex_back_1(Δy)
        dx = ntuple(j -> j == i ? Δy : nothing, length(x))
        return (nothing, (dx...,), nothing)
    end
    return x[i], getindex_back_1
end

Test (compare the results in part 1):

model = Chain(cos, sin)
model(0.9) # 0.5823
z, back = pullback(model, 0.9)
back(1.0) # ((layers=(nothing, nothing),), -0.6368)

Test a multi-layer perceptron:

model = Chain(
    Dense(2 => 16, activation=relu),
    Dense(16 => 16, activation=relu),
    Dense(16=>2, activation=relu)
)
model(X) # 2×4 Matrix
Z, back = pullback(m->m(X), model)  # (2×4 Matrix, Pullback)
back(ones(2, 4)) # (nothing, (layers=((weight=...), (weight=...), (weight=...))))

4 Loss

4.1 Cross entropy

The output of the machine learning model will be a probability $p_j$ for a sample $j$ being in a certain class. This will be compared to a probability for a known label $y_j$, which is either 1 if that sample is in the class or 0 if it is not. An obvious value to maximise is their product:

\[y_j p_j \tag{4.1}\]

with range $[0, 1]$.

However most machine learning optimisation algorithms aim to minimise a loss. So instead $p_j$ is scaled as $-\log(p_j)$, so that the loss ranges from $[0, \infty)$ with the goal to minimise it at 0. This is called the cross entropy loss:

\[L(p_j, y_j) = -y_j \log(p_j) \tag{4.2} \label{eq:cross_entropy}\]

4.2 Logit cross entropy

The outputs of the neural network are not probabilities but instead a vector of logits containing $N$ real values for $N$ classes. By convention these logits are scaled to a probability distribution using the softmax function:

\[s(x)_i = \frac{e^{x_i}}{\sum_{r=1}^{N} e^{x_r}} \tag{4.3} \label{eq:softmax}\]

Combining equations $\ref{eq:cross_entropy}$ and $\ref{eq:softmax}$ and taking a mean across samples gives the mean logit cross entropy loss:

\[\begin{align} L(x, y) &= -\frac{1}{n}\sum_{j=1}^n \sum_{i=1}^N y_{ij} z_{ij} \\ &= -\frac{1}{n}\sum_{j=1}^n \sum_{i=1}^N y_{ij} \left(x_{ij} - \log\left(\sum_{r=1}^{N} e^{x_{rj}}\right) \right) \end{align} \tag{4.4} \label{eq:logit_cross_entropy}\]

where $z_{ij}$ is the output of the logsoftmax function. Assuming that $y_{ij}$ is 1 for exactly one value of $i$ and 0 otherwise, this can be simplified to:

\[\begin{align} L(x, y) = -\frac{1}{n}\sum_{j=1}^n \left(x_{j} - \log\left(\sum_{r=1}^{N} e^{x_{rj}}\right) \right) \end{align} \tag{4.5} \label{eq:logit_cross_entropy_2}\]

In Julia this can be implemented as follows (source):

using StatsBase
logsoftmax(x::AbstractArray) = x .- log.(sum(exp.(x), dims=1))
function logit_cross_entropy(x::AbstractVecOrMat, y::AbstractVecOrMat)
    mean(-sum(y .* logsoftmax(x), dims=1))
end

According to the multivariable chain rule, the derivative with respect to one logit $x_{ij}$ in the vector for sample $j$ is (gradients come from the main case $k=i$ case as well as the sum in the softmax for $k\neq i$):

\[\begin{align} \frac{\partial L}{\partial x_{ij}} &= \sum_{k=1}^N \frac{\partial L}{\partial z_{kj}} \frac{\partial z_{kj}}{\partial x_{ij}} \\ &= \sum_{k=1}^N \left( -\frac{y_{kj}\Delta}{n} \frac{\partial}{\partial x_{ij}}\left(x_{kj} - \log\left(\sum_{r=1}^{N} e^{x_{rj}}\right) \right) \right) \\ &= \sum_{k=1}^N \left(-\frac{y_{kj} \Delta}{n} \left(\delta_{ij} - \frac{e^{x_{ij}}}{\sum_{r=1}^{N} e^{x_{rj}}} \right) \right) \\ &= -\frac{\Delta}{n} \left(y_{ij} - s(x_j)_{i} \sum_{k=1}^N y_{kj}\right) \end{align} \tag{4.6} \label{eq:back_logitcrossentropy}\]

where $\delta_{ij}$ is the Kronecker delta. Assuming that $y_{kj}$ is 1 for one value of $k$ and 0 otherwise, this simplifies too:

\[\begin{align} \frac{\partial L}{\partial x_{ij}} &= -\frac{\Delta}{n}(y_{ij} - s(x_j)_{i}) \end{align} \tag{4.7} \label{eq:back_logitcrossentropy_2}\]

In Julia this can be implemented as follows (source):

function rrule(::typeof(logsoftmax), x::AbstractArray)
    expx = exp.(x)
    Σ = sum(expx, dims=1)
    function logsoftmax_back(Δ)
        (nothing, Δ .- sum(Δ; dims=1) .* expx ./ Σ)
    end
    x .- log.(Σ), logsoftmax_back
end

function rrule(::typeof(logit_cross_entropy),  x::AbstractVecOrMat, y::AbstractVecOrMat)
    ls, logsoftmax_back = rrule(logsoftmax, x)
    function logit_cross_entropy_back(Δ)
        size_ls = size(ls)
        n = length(size_ls) > 1 ? prod(size(ls)[2:end]) : 1
        ∂x = logsoftmax_back(-y * Δ/n)[2]
        ∂y = -Δ/n .* ls
        return nothing, ∂x , ∂y
    end
    mean(-sum(y .* ls, dims = 1)), logit_cross_entropy_back
end

Testing:

y1, y2 = rand(4), rand(4)
l, back = pullback(logit_cross_entropy, y1, y2) # (2.69, logit_cross_entropy_back)
back(1.0) # (nothing, [0.4,...], [1.37,...] )
X = rand(2, 4) 
Y = [1.0 1.0 0.0 0.0 ; 0.0 0.0 1.0 1.0] # one hot encoded
l, back = pullback(logit_cross_entropy, X, Y)
back(1.0) # (nothing, 2×4 Matrix, 2×4 Matrix)

5 Train and Evaluate

5.1 Train

Create the moons data and labels:

n = 100
X = make_moons(2n; noise=0.1) # 2×200 Matrix 
y = vcat(fill(1, n)..., fill(2, n)...) # 200-element Vector{Int64}

Convert the labels to a one hot presentation:

function onehot(y::AbstractVector, labels)
    num_classes = maximum(labels)
    Y = zeros(num_classes, length(y))
    for (j, label) in enumerate(y)
        Y[label, j] += 1
    end
    Y
end
Y = onehot(y, 1:2)

Create the model:

model = Chain(
    Dense(2 => 16, activation=relu),
    Dense(16 => 16, activation=relu),
    Dense(16=>2, activation=relu)
)

Test the loss function:

l, back = pullback(m->logit_cross_entropy(m(X), Y), model); # (0.69, Pullback{...}(...))
back(1.0) # (nothing, layers=((weight=...),(weight=...),(weight=...),))

Use the exact same gradient_descent! function from part 4:

history = gradient_descent!(
    model, logit_cross_entropy, X, Y
    ; learning_rate=0.9, max_iters=200
)

5.2 Evaluate

Plot the history:

Training history

Calculate accuracy:

Y_pred = model(X)
y_pred = vec(map(idx -> idx[1], argmax(Y_pred, dims=1)))
mean(y_pred .== y) # 100%

Plot decision boundary:

using Plots
xmin, xmax = extrema(X[1, :])
ymin, ymax = extrema(X[2, :])
h = 0.01
xrange = (xmin-0.1):h:(xmax+0.1)
yrange = (ymin-0.1):h:(ymax+0.1)

x_grid = xrange' .* ones(length(yrange))
y_grid = ones(length(xrange))' .* yrange
Z = similar(x_grid)
for idx in eachindex(x_grid)
    logits = model([x_grid[idx], y_grid[idx]])
    Z[idx] = softmax(logits)[1]
end
canvas = heatmap(xrange, yrange, Z, size=(800, 500))

Plot points over the boundary:

scatter!(
    X[1, :], X[2, :], color=y, label="", aspectratio=:equal,
    xlims = xlims(canvas),
    ylims = ylims(canvas),
)

The result:

The probability boundaries of a multi-layer perceptron trained on the moons dataset.

6 Conclusion

That was a long and difficult journey. I hope you understand how automatic differentiation with Zygote.jl works now!

The Zygote.jl code for broadcast has this gem of a comment:

There's a saying that debugging code is about twice as hard as writing it in the first place. So if you're as clever as you can be when writing code, how will you ever debug it?

AD faces a similar dilemma: if you write code that's as clever as the compiler can handle, how will you ever differentiate it? Differentiating makes clever code that bit more complex and the compiler gives up, usually resulting in 100x worse performance.

Base's broadcasting is very cleverly written, and this makes differentiating it... somewhat tricky.

↩

MicroGrad.jl: Part 4 Extensions

2024-08-17T00:00:00+00:00

A series on automatic differentiation in Julia. Part 4 extends part 3 to handle maps, getfield and anonymous functions. It creates a generic gradient descent and uses this to fit a polynomial.

This is part of a series. The other articles are:

All source code can be found at MicroGrad.jl.

1 Introduction

By end of part 3 we had code that could automatically differentiate many functions as long as we had rrules and there was no control flow.

However, the code failed for the polynomial model:

struct Polynomial{V<:AbstractVector}
    weights::V
end
(m::Polynomial)(x) = evalpoly(x, m.weights)
(m::Polynomial)(x::AbstractVector) = map(m, x)
model = Polynomial([3.0, 2.0, -3.0, 1.0])
x = [1.0, 2.0, 3.0, 4.0]
pullback(model, x) # ERROR: No method found for Tuple{typeof(fieldtype) ....}

Calling @code_ir model(x), we can see that code is lowered as follows:

1: (%1, %2)
  %7 = Main.map(%6, %2)
  return %7

And further that model(1.0) is lowered to:

1: (%1, %2)
  %3 = Base.getproperty(%1, :weights)
  %4 = Main.evalpoly(%2, %3)
  return %4

We could have also defined the map using an anonymous function:

(m::Polynomial)(x::AbstractVector) = map(x->evalpoly(x, m.weights), x)

In which case it would have been lowered to:

1: (%1, %2)
  %3 = Main.:(var"#43#44")
  %4 = Core.typeof(%1)
  %5 = Core.apply_type(%3, %4)
  %6 = %new(%5, %1)
  %7 = Main.map(%6, %2)
  return %7

The calls to Core.typeof and Core.apply_type are in the list of ignored functions. However we need to handle map, getproperty and %new. These sort of functions do not have formal mathematical derivatives and so they do not have rrules in ChainRules.jl. Instead, Zygote.jl handles these functions with their own custom pullbacks. Zygote also replaces some low level functions like new, getproperty and getindex entirely with custom code.

2 Extending pullback

2.1 map

The pullback for map is fairly complex. What will be presented here is a simplified version. It might also help to look at the less generic code in the example in part 1.

Consider the following code:

f(x) = sin(x)
x = [0.1, 0.2, 0.5]
map(f, x)

The pullback for map should return 3 values: $\text{s̄elf}$ for map, $\bar{f}$ for the function f and $\bar{x}$ for each value in x.

The code will start by getting pullbacks for each value in x:

ys_and_backs = map((xs...) -> pullback(f, xs...), x) # ((0.099, Pullback), (0.198, Pullback), (0.479, Pullback))

This list is in a “zipped” format: there are $n$ entries of $(y_i, \mathcal{B}_i)$ for an array length $n$. This will be unzipped into two lists each of length $n$: $(y_1,…,y_n), (\mathcal{B}_1,…,\mathcal{B}_n)$:

Δ = ones(length(x))
ys = map(first, ys_and_backs) # (0.099, 0.198, 0.479)
∂f_and_∂x_zipped = map(((_, pb), δ) -> pb(δ), ys_and_backs, Δ) # ((nothing, 0.995), (nothing, 0.980), (nothing, 0.877))

The gradients list of $n$ entries

\[((\text{s̄elf}_1, \bar{x}_{11}, ..., \bar{x}_{k1}), ...,(\text{s̄elf}_n, \bar{x}_{1n}, ..., \bar{x}_{kn}))\]

needs to be further unzipped into $k+1$ lists for $\text{s̄elf}$ and $k$ arguments:

\[(\text{s̄elf}_1,...,\text{s̄elf}_{n}), (\bar{x}_{11},...,\bar{x}_{1n}), ... (\bar{x}_{k1},...,\bar{x}_{kn})\]

This is done with an unzip function which generalises first to any index i (source):

struct StaticGetter{i} end
(::StaticGetter{i})(v) where {i} = v[i]
(::StaticGetter{i})(::Nothing) where {i} = nothing

function _unzip(tuples, ::Val{N}) where {N}
  getters = ntuple(n -> StaticGetter{n}(), N)
  map(g -> map(g, tuples), getters)
end

function unzip(tuples)
  N = length(first(tuples))
  _unzip(tuples, Val(N))
end

The result:

∂f_and_∂x = unzip(∂f_and_∂x_zipped) # [nothing, nothing, nothing], [0.995, 0.98, 0.877]

As a final step, all the gradients for the function are accumulated into one value:

∂f = reduce(accum, ∂f_and_∂x[1]) # nothing

Putting all this code in a single function (source):

function pullback(::typeof(map), f::F, args::Vararg{Any, N}) where {F, N}
    ys_and_backs = map((xs...) -> pullback(f, xs...), args...)
    ys = map(first, ys_and_backs)
    function map_pullback(Δ)
      # technically should apply f in reverse and reverse back afterwards in case f is stateful
      ∂f_and_∂x_zipped = map(((_, pb), δ) -> pb(δ), ys_and_backs, Δ)
      ∂f_and_∂x = unzip(∂f_and_∂x_zipped) 
      ∂f = reduce(accum, ∂f_and_∂x[1])
      ∂args = ∂f_and_∂x[2:end]
      return (nothing, ∂f, ∂args...)
    end
    ys, map_pullback
end

Testing:

x = [0.1, 0.2, 0.5]
z, back = pullback(map, sin, x) 
back(ones(length(x))) # (nothing, nothing, [0.995, 0.98, 0.877])

And also:

f(a,b)=a/(a+b*b)
z, back = pullback(map, f, [2.0, 4.0], [3.0, 5.0]) 
back([1.0, 1.0]) # (nothing, nothing, [0.074, 0.029], [-0.099, -0.047])

2.2 Instrument

Zygote.jl modifies some of the source code before creating the primal and reverse passes. Here is a simplified version of this instrument function which only replaces new and getfield (source):

function instrument(ir::IR)
    pr = Pipe(ir)
    for (v, st) in pr
        ex = st.expr
        if isexpr(ex, :new)
            pr[v] = xcall(Main, :__new__, ex.args...)
        elseif is_literal_getfield(ex)
            pr[v] = xcall(Main, :literal_getfield, ex.args[2], Val(unwrapquote(ex.args[3])))
        end
    end
    finish(pr)
end

iscall(x, m::Module, n::Symbol) = isexpr(x, :call) && x.args[1] == GlobalRef(m, n)
unwrapquote(x) = x
unwrapquote(x::QuoteNode) = x.value

is_literal_getfield(ex) =
  (iscall(ex, Core, :getfield) || iscall(ex, Base, :getfield)) &&
  ex.args[3] isa Union{QuoteNode,Integer}

Modify the existing _generate_pullback_via_decomposition and _generate_callable_pullback functions to call it:

function _generate_pullback_via_decomposition(T, world)
    m = meta(T; world=world)
    isnothing(m) && return nothing
    ir = IR(m)
    length(blocks(ir)) == 1 || error("control flow is not supported")
    ir = instrument(ir) # new
    pr, calls = primal(ir, T)
    m, pr, calls
end

function _generate_callable_pullback(j::Type{<:Pullback{S, T}}, world, Δ) where {S, T}
    m = meta(S; world=world)
    ir = IR(m)
    isnothing(ir) && return :(error("Non-differentiable function ", repr(args[1])))
    length(blocks(ir)) == 1 || error("control flow is not supported")
    ir = instrument(ir) # new
    back = reverse_differentiate(ir)
    back = slots!(inlineable!(back))
    ci = build_codeinfo_(back)
    ci.slotnames = [Symbol("#self#"), :Δ]
    ci
end

Now we need to define literal_getfield and __new__ and their pullbacks.

2.3 getfield

Calls to getproperty default to getfield, where a field is is declared in a struct’s declaration. The getfield function is substituted with literal_getfield (source):

literal_getfield(x, ::Val{f}) where f = getfield(x, f)

The pullback will return a NamedTuple for each field, where the gradient is Δ for the relevant field and nothing for the others (source):

@generated nt_nothing(x) = Expr(:tuple, [:($f=nothing) for f in fieldnames(x)]...)
@generated pair(::Val{k}, v, _=nothing) where k = :($k = v,)

function pullback(::typeof(literal_getfield), x, ::Val{f}) where f
  val = getfield(x, f)
  function literal_getfield_back(Δ)
    if isimmutable(x)
      dx = (; nt_nothing(x)..., pair(Val(f), Δ)...)
      (nothing, dx, nothing)
    else
      error("multable stucts not supported")
    end
  end
  val, literal_getfield_back
end

pullback(::typeof(getfield), x, field_name::Symbol) = pullback(literal_getfield, x, Val(field_name))

For example:

struct Foo
    a
    b
    c
end
foo = Foo(1.0, 'a', "hello")
z, back = pullback(getfield, foo, :b) # ('a', literal_getfield_back)
back(1.0) # (nothing, (a = nothing, b = 1.0, c = nothing), nothing)

And for the polynomial model:

z, back = pullback(model, 1.0)
back(2.3) # ((weights = [2.3, 2.3, 2.3, 2.3],), -2.3)

For the first time we have a value $\text{s̄elf}$, which is the named tuple for the fields.

2.4 new

The code now works with:

(m::Polynomial)(x::AbstractVector) = map(m, x)

It returns $\text{s̄elf}$ and $\bar{x}$:

model = Polynomial([3.0, 2.0, -3.0, 1.0])
x = [1.0, 2.0, 3.0, 4.0]
z, back = pullback(model, x)
back(ones(4)) # ((weights = [4.0, 10.0, 30.0, 100.0],), [-1.0, 2.0, 11.0, 26.0])

However with an anonymous function:

(m::Polynomial)(x::AbstractVector) = map(x->evalpoly(x, m.weights), x)

nothing is returned for $\text{s̄elf}$:

z, back = pullback(model, x)
back(ones(4)) # (nothing, [-1.0, 2.0, 11.0, 26.0])

If we inspect the primal(ir), we see that it’s because no pullbacks and hence no gradients are recorded against variable %1 (self):

1: (%1, %2)
  %3 = Main.:(var"#74#75")
  %4 = Core.typeof(%1)
  %5 = Core.apply_type(%3, %4)
  %6 = %new(%5, %1)
  %7 = Main.pullback(Main.map, %6, %2)
  %8 = Base.getindex(%7, 1)
  %9 = Base.getindex(%7, 2)
  %10 = Base.tuple(%9)
  %11 = (Pullback{Any})(%10)
  %12 = Base.tuple(%8, %11)
  return %12

The solution is to swap %new with a call to a custom function __new__ with a pullback. This function is as follows (source):

macro __splatnew__(T, args)
  esc(Expr(:splatnew, T, args))
end

@inline __new__(T, args...) = @__splatnew__(T, args)

And the pullback is (source):

using Base: RefValue
struct Jnew{T,G}
  g::G
end

Jnew{T}(g) where T = Jnew{T,typeof(g)}(g)

function pullback(::typeof(__new__), T, args...)
  x = __new__(T, args...)
  g = !ismutabletype(T) || fieldcount(T) == 0 ? nothing : grad_mut(x)
  x, Jnew{T,typeof(g)}(g)
end

@generated function (back::Jnew{T,G})(Δ::Union{NamedTuple,Nothing,RefValue}) where {T,G}
  !ismutabletype(T) && Δ == Nothing && return :nothing
  Δ = G == Nothing ? :Δ :
      Δ <: RefValue ? :(back.g[]) :
      :(accum(back.g[], Δ))
  quote
    x̄ = $Δ
    $(G == Nothing || :(back.g[] = nt_nothing($Δ)))
    (nothing, nothing, $(map(f -> :(x̄.$f), fieldnames(T))...))
  end
end

Now if we try the following (after redefining @generated function pullback and function (methodinstance::Pullback)) we should get the same results:

z, back = pullback(model, x)
back(ones(4)) # ((weights = [4.0, 10.0, 30.0, 100.0],), [-1.0, 2.0, 11.0, 26.0])

3 Gradient Descent revisited

3.1 Generic Gradient Descent

Now that we have an automatic differentiation engine, it is possible to create a much more generic gradient descent function than in part 1:

function gradient_descent!(
    model,
    loss,
    X::AbstractVecOrMat,
    Y::AbstractVecOrMat
    ; learning_rate::AbstractFloat=0.1,
    max_iters::Integer=100
    )
    losses = Float64[]
    for i in 1:max_iters
        loss_iter, back = pullback(model) do m
            result = m(X)
            loss(result, Y)
        end 
        Δf, Δm = back(1.0)
        update_params!(parameters(model), Δm; learning_rate=learning_rate)
        push!(losses, loss_iter)  
    end
    losses
end

Note that pullback(m->f(m), model) is directly equivalent to pullback(model) do f(m) end.

The update_params! function is defined as follows:

function update_params!(params::NamedTuple, grads::NamedTuple; options...)
    for key in keys(params)
        update_params!(params[key], grads[key]; options...)
    end
end

function update_params!(params::Tuple, grads::Tuple; options...)
    for (p, g) in zip(params, grads)
        update_params!(p, g; options...)
    end
end

function update_params!(params, grads; learning_rate::AbstractFloat=0.1)
    params .-= learning_rate .* grads # must broadcast to edit elements and not copies!
end

The parameters function is defined per model. (Flux uses the generic Functors.jl library to accomplish something similar.)

3.2 Polynomial curve fitting revisited

Let’s create the exact same data set from part 1:

using StatsBase
target_weights = [15.0, -2.1, 13.9, 1.5]
noise_factor = 0.2
xs = (rand(100) .- 0.5) .* 10
ys = map(x -> evalpoly(x, target_weights), xs)
scale_factor = mean(abs.(ys))
ys .+= randn(length(ys)) * scale_factor * noise_factor

The Polynomial model is defined in the introduction. We also need a custom method for parameters:

parameters(m::Polynomial) = (;weights=m.weights)

Define the model:

model = Polynomial(rand(4))

Some sanity checks:

x = [1.0, 2.0, 3.0]
z, back = pullback(model, x) # ([1.68, 7.21, 21.2], Pullback) 
back([1.0, 1.0, 1.0]) # ((weights = [3.0, 6.0, 14.0, 36.0],), [-1.0, 2.0, 11.0])
z, back = pullback(m->m(x), model) 
back([1.0, 1.0, 1.0]) # (nothing, (weights = [3.0, 6.0, 14.0, 36.0],))
y = [2.0, 4.0, 8.0]
z, back = pullback(m->mse(m(x), y), model) 
back(1.0) # (nothing, (weights = [10.7 30.5, 87.6, 254.6],))

Train the model:

history = gradient_descent!(model, mse, xs, ys; learning_rate=1e-5, max_iters=2000)

This works just as well as before.

4 Conclusion

We now have a fully working AD package. It has some limitations, such as it cannot handle control flow or keyword arguments. However it can already work on a wide variety of code. All that might be needed is new rrule definitions. The next and final part of this series is a demonstration of exactly that.

MicroGrad.jl: Part 3 Automation with IRTools

2024-08-10T00:00:00+00:00

A series on automatic differentiation in Julia. Part 3 uses metaprogramming based on IRTools.jl to generate a modified (primal) forward pass and to reverse differentiate it into a backward pass. This is a more robust approach than the expression based approach in Part 2.

This is part of a series. The other articles are:

All source code can be found at MicroGrad.jl. The code here is based on the example at IRTools.jl.

1 Introduction

Part 1 introduced the rrule for implementing chain rules and Part 2 defined a @generated pullback function for inspecting and decomposing complex code. The goal here is to replicate the results of Part 2 except in a more robust manner using the IRTools.jl package.

Metaprogramming is a powerful tool, but it introduces complexity that can make code more difficult to understand. It can easily introduces critical bugs that can crash a program. Care should be taken when using it.

For example, from part 1 there are rrules for +, * and /. The goal is then to automatically differentiate the following:

\[f(a, b) = \frac{a}{a + b^2}\]

like so:

f(a, b) = a / (a + b*b)
z, back = pullback(f, 2.0, 3.0) # (0.1818, ∂(f))
back(1.0) # (nothing, 0.0744, -0.099)

where pullback is a @generated function that inspects the Intermediate Representation (IR) code for f:

using IRTools
ir = @code_ir f(2, 3)
#= 1: (%1, %2, %3)
  %4 = %3 * %3
  %5 = %2 + %4
  %6 = %2 / %5
  return %6
=#

This is an advanced use of the Julia programming language. You should be comfortable with the language before reading this post. At the very least, the Julia documentation page on metaprogramming is required for this post and will be considered assumed knowledge, especially the sections on “Expressions and evaluation”, “Code Generation” and “Generated Functions”. I also suggest going through the IRTools.jl documentation first.

This post can be read independently to Part 2 and will repeat parts of it. However it is advised to read Part 2 first because it is easier to understand than this post.

2 Differentiating Wengert Lists

The Zygote.jl automatic differentiation (AD) package is a realisation of the paper Don’t Unroll Adjoint: Differentiating SSA-Form Programs (2019) by Michael J Innes.
The paper works with Wengert lists, also known as tapes, and a generalisation of it called Static Single Assignment (SSA) form. The aim here is to develop a minimal AD package, so this series only focuses on the sections on Wengert lists. A consequence is that the code will not be to handle any non-linear logic in Julia, for example any control flow like if, while or for blocks.

The paper uses the same example as the introduction:

\[f(a, b) = \frac{a}{a + b^2} \tag{2.1} \label{eq:f}\]

This can be broken down into smaller steps where each intermediate variable is saved. This is known as a Wengert list, or tape, or (backpropagation) graph:

\[\begin{align} y_1 &= b \times b \\ y_2 &= a + y_1 \\ y_3 &= a / y_2 \end{align} \tag{2.2} \label{eq:f_wengert}\]

To differentiate this, all function calls are wrapped with a differentiation function $\mathcal{J}$ which returns both the output $y$ and a pullback function $\mathcal{B}$. This is called the primal form:

\[\begin{align} y_1, \mathcal{B}_1 &\leftarrow \mathcal{J}(\times, b, b) \\ y_2, \mathcal{B}_2 &\leftarrow \mathcal{J}(+, a, y_1) \\ y_3, \mathcal{B}_3 &\leftarrow \mathcal{J}(/, a, y_2) \end{align} \tag{2.3} \label{eq:primal}\]

The pullback function $\mathcal{B}$ takes as input the gradient of a scalar $l$ (typically a loss function) to a function $y(x)$ and returns the gradient with regards to the variable $x$. This partial gradient $\frac{\partial l}{\partial x}$ is written as $\bar{x}$.

\[\begin{align} \bar{x} &= \frac{\partial l}{\partial x} = \frac{\partial l}{\partial y} \frac{\partial y}{\partial x} \end{align} \tag{2.4} \label{eq:bar_x}\]

so we can write in this mathematical notation as:

\[\begin{align} \bar{x} &\leftarrow \mathcal{B}(\bar{y}) = \bar{y} \frac{\partial y}{\partial x}\\ \text{or} \quad \bar{x} &\leftarrow \mathcal{B}(\bar{y}) = J^{\dagger}\bar{y} \end{align} \tag{2.5} \label{eq:pullback}\]

where $\bar{y}=\frac{\partial l}{\partial y}$ and $J=\frac{\partial y}{\partial x}$ is the Jacobian (gradient) for arrays.

The various partial gradients are calculated by reversing the list. Each pullback function $\mathcal{B}_i$ takes as input the previous gradient $\bar{y}_i$. The input is an existing gradient $\Delta$. At the start this is usually set to 1:

\[\begin{align} \text{s̄elf}_3, \bar{a}_{3,1}, \bar{y}_2 &\leftarrow \mathcal{B}_3(\Delta) \\ \text{s̄elf}_2, \bar{a}_{2,1}, \bar{y}_1 &\leftarrow \mathcal{B}_2(\bar{y}_2) \\ \text{s̄elf}_1, \bar{b}_{1,1}, \bar{b}_{1,2} &\leftarrow \mathcal{B}_1(\bar{y}_1) \end{align} \tag{2.6} \label{eq:reverse}\]

The final step is to accumulate the gradients for variables which are used multiple times:

\[\begin{align} \bar{a} &\leftarrow \bar{a}_{3,1} + \bar{a}_{2,1} \\ \bar{b} &\leftarrow \bar{b}_{1,1} + \bar{b}_{1,2} \\ \end{align} \tag{2.7} \label{eq:accumulate}\]

This end result is equivalent to rolling everything up into one function using the multivariable chain rule:

\[\begin{align} \bar{a} &= \frac{\partial l}{\partial a} = \mathcal{B}_{3,a}(\Delta) + \mathcal{B}_{2,a}(\bar{y}_2) \\ &= \frac{\partial l}{\partial y_3} \frac{\partial y_3}{\partial a} + \frac{\partial l}{\partial y_2} \frac{\partial y_2}{\partial a} \\ &= \Delta \cdot \frac{\partial }{\partial a} \left( \frac{a}{y_2}\right) + \left(\frac{\partial l}{\partial y_3}\frac{\partial y_3}{\partial y_2} \right)\frac{\partial}{\partial a}(a + y_1) \\ &= \Delta \frac{1}{y_2} + \left(\Delta \frac{-a}{y_2^2} \right) (1+0) \\ &= \Delta \frac{b^2}{(a+b^2)^2} \\ \bar{b} &= \frac{\partial l}{\partial b} = 2 \mathcal{B}_{1,b}(\bar{y}_1) \\ &= 2\frac{\partial l}{\partial y_1} \frac{\partial y_1}{\partial b} \\ &= 2 \left(\frac{\partial l}{\partial y_3}\frac{\partial y_3}{\partial y_2}\frac{\partial y_2}{\partial y_1} \right) \frac{\partial y_1}{\partial b} \\ &= 2 \left(\Delta \cdot \frac{\partial}{\partial y_2}\left(\frac{a}{y_2}\right) \cdot \frac{\partial}{\partial y_1}(a + y_1) \right)\frac{\partial}{\partial b'}(b'\times b) \\ &= 2\left(\Delta \left(-\frac{a}{y_2^2}\right)(0+1)\right)b \\ &= -\frac{2ab\Delta}{(a+b^2)^2} \end{align} \tag{2.8} \label{eq:rollup}\]

3 Pullback

3.1 Definition

The goal is to generate code which automatically implements the equations of section 2.

The pullback function that is implemented here is equivalent to the internal Zygote._pullback function, which returns all partial gradients including for $\frac{\partial l}{\partial \text{self}}$. Zygote.pullback is a thin wrapper around Zygote._pullback which discards that first gradient.

To start, define a pullback function (source):

function pullback end

This will be turned into a generated function.

Julia changed the behaviour of generated functions in version 1.10. Before 1.10, they always had access to the world age counter. This is a single number that is incremented every time a method is defined, and helps optimise compilations. However from version 1.10 generated functions Base.get_world_counter() will only return typemax(UInt). This is to prevent reflection - code inspection - in generated functions.¹ However the code here relies on reflection. Thankfully, there is a hack that Zygote.jl uses to access the world age in pullback. Because of this, the definition of pullback is different based on the version, but both will forward to a common internal _generate_pullback function.

Generated functions should only be defined after all other functions. That is, at the bottom of the file or after all functions have been defined in the REPL. Otherwise they will not be able to access those functions or only old versions of those functions. These functions are defined here at the top only for explanatory purposes.

@generated function pullback(f, args...)
        _generate_pullback(nothing, f, args...)
end

function _pullback_generator(world::UInt, source, self, f, args)
        ret = _generate_pullback(world, f, args...)
        ret isa Core.CodeInfo && return ret
        stub = Core.GeneratedFunctionStub(identity, Core.svec(:methodinstance, :f, :args), Core.svec())
        stub(world, source, ret)
end

@eval function pullback(f, args...)
        $(Expr(:meta, :generated, _pullback_generator))
        $(Expr(:meta, :generated_only))
end

3.2 ChainRules

The first goal of _generate_pullback will be to forward the function and its arguments to a matching rrule if it exists. For now it will throw an error if it cannot find one.

function _generate_pullback(world, f, args...)
    T = Tuple{f, args...}
    if (has_chain_rrule(T, world))
        return :(rrule(f, args...))
    end
    :(error("No rrule found for ", repr($T)))
end

In part 1 the most generic method of rrule was defined for an Any first argument, so if the compiler dispatches to this method it means no specific rrule was found.²

using IRTools: meta
function has_chain_rrule(T, world)
    Tr = Tuple{typeof(rrule), T.parameters...}
    meta_T = meta(Tr; world=world)
    if isnothing(meta_T)
        return false
    end
    method_ = meta_T.method
    sig = method_.sig
    !(sig isa DataType) || (sig.parameters[2] !== Any)
end

Let’s test all this code from bottom to top for a function with an rrule and one without: + and f(a,b)=a/(a+b*b). As a reminder, generated functions only have access to a variables types, so to test the _generate_pullback and all functions under it, we can only work with the types.

Firstly, for + acting on floats (redefine @generated pullback if necessary):

world = Base.get_world_counter()
T = Tuple{typeof(+), Float64, Float64}
has_chain_rrule(T, world) # true
_generate_pullback(world, typeof(+), Float64, Float64) # :(rrule(f, args...))
pullback(+, 1.0, 2.0) # (3.0, var"#add_back#5"())

Now for f, also acting on floats:

world = Base.get_world_counter()
T = Tuple{typeof(f), Float64, Float64}
has_chain_rrule(T, world) # false
_generate_pullback(world, typeof(f), Float64, Float64) # :(error(...))
pullback(f, 1.0, 2.0) # ERROR: No rrule found for ...

The more interesting task is to inspect f and apply the equations of section 2 to fully differentiate with respect to all input parameters.

3.3 IR

Source: Julia Docs eval

The first step is to create a Wengert list for f in Intermediate Representation (IR) form. Julia already does this as part of the compilation process. IRTools.jl mimics this internal IR form with its own custom IR struct. It can be generated as follows:

using IRTools: IR, meta
T = Tuple{typeof(f), Float64, Float64}
m = meta(T; world=Base.get_world_counter())
ir = IR(m)
#=
1: (%1, %2, %3)
  %4 = %3 * %3
  %5 = %2 + %4
  %6 = %2 / %5
  return %6
=#

The returned object corresponds exactly to $\ref{eq:f_wengert}$.

Using this knowledge, we can now create a new function _generate_pullback_via_decomposition which will be called if no rrule exists. It uses the IR to create the primal (equation $\ref{eq:primal}$) (source).

using IRTools: meta, IR, blocks
function _generate_pullback_via_decomposition(T, world)
    m = meta(T; world=world)
    isnothing(m) && return nothing
    ir = IR(m)
    length(blocks(ir)) == 1 || error("control flow is not supported")
    pr, calls = primal(ir, T)
    m, pr, calls
end

3.4 Primal

The goal here is to create an IR for equation $\ref{eq:primal}$. This is what it will look like:

1: (%1, %2, %3)
  %4 = Main.pullback(Main.:*, %3, %3)
  %5 = Base.getindex(%4, 1)
  %6 = Base.getindex(%4, 2)
  %7 = Main.pullback(Main.:+, %2, %5)
  %8 = Base.getindex(%7, 1)
  %9 = Base.getindex(%7, 2)
  %10 = Main.pullback(Main.:/, %2, %8)
  %11 = Base.getindex(%10, 1)
  %12 = Base.getindex(%10, 2)
  %13 = Base.tuple(%6, %9, %12)
  %14 = (Pullback{Tuple{typeof(f), Float64, Float64}})(%13)
  %15 = Base.tuple(%11, %14)
  return %15

Although harder to read, this code represents the same code as the expressions in part 2.

The primal function first wraps the existing IR with Pipe to make inserts more efficient. It defines two arrays to store information (source):

using IRTools: block, isexpr, finish, Pipe, Variable, return!, returnvalue, stmt, xcall
function primal(ir::IR, T=Any)
    pr = IRTools.Pipe(ir)
    calls = []
    pullbacks = []

The calls array stores the subset of variables that require a pullback. Because the IR is a dictionary - ir[Variable(i)] returns statement i - this creates a direct link to the statement called. These will be used to generate the reverse code (equation $\ref{eq:reverse}$) in the next section.

Next, iterate over each statement in the IR. For each statement if it is an expression :call and not part of a special ignored list, replace it with three calls: the first is to pullback and then two calls to getindex to get the output variable v and back function J from the tuple t:

    for (v, st) in pr
        ex = st.expr
        if isexpr(ex, :call) && !ignored(ex)
            t = insert!(pr, v, stmt(xcall(Main, :pullback, ex.args...), line = st.line))
            pr[v] = xcall(Base, :getindex, t, 1)
            J = push!(pr, xcall(:getindex, t, 2))
            push!(calls, v)
            push!(pullbacks, J)
        end
    end

After working through all the statements, a final statement is added which returns a tuple with the output of the function and a Pullback struct which stores all the pullbacks. In the last step the pipe is converted back into an IR.

    pb = Expr(:call, Pullback{T}, xcall(:tuple, pullbacks...))
    return!(pr, xcall(:tuple, returnvalue(block(ir, 1)), pb))
    finish(pr), calls
end

This code requires a definition for the Pullback struct as well as the ignored function.

There are no closures in lowered Julia code, so instead Zygote.jl stores the pullbacks in a generic struct:

struct Pullback{S,T}
    data::T
end
Pullback{S}(data) where S = Pullback{S,typeof(data)}(data)

In the next section this struct will be turned into a callable struct. That is, for back=Pullback{S}(data), we will create a generated function that dispatches on itself: (j::Pullback)(Δ) so that we can call back(Δ). This back has all the information to generate the reverse pass independently of the forward pass: the method can be retrieved using meta(S) and the relevant data and input parameters from back.data.

Here is the ignored functions list (source):

function ignored(ex::Expr)
    f = ex.args[1]
    ignored_f(f)
end

ignored_f(f) = f in (
    GlobalRef(Base, :not_int),
    GlobalRef(Core.Intrinsics, :not_int),
    GlobalRef(Core, :(===)),
    GlobalRef(Core, :apply_type),
    GlobalRef(Core, :typeof),
    GlobalRef(Core, :throw),
    GlobalRef(Base, :kwerr),
    GlobalRef(Core, :kwfunc),
    GlobalRef(Core, :isdefined)
)

Running this code:

world = Base.get_world_counter()
T = Tuple{typeof(f), Float64, Float64}
pr, calls =_generate_pullback_via_decomposition(T, world)

gives the IR at the start.

3.5 Convert

To evaluate the IR it needs to be converted into a CodeInfo struct. Zygote.jl uses IRTools.Inner.update! to modify the existing struct in meta_T.code. To me, it makes more sense to construct a new code info block directly from the IR using a slightly modified version of IRTools.Inner.build_codeinfo:

using IRTools: arguments
using IRTools.Inner: dummy_m, update!
function build_codeinfo_(ir::IR)
    ir = copy(ir)
    ci = Base.uncompressed_ir(dummy_m)
    ci.inlineable = true
    for arg in arguments(ir)
    @static if VERSION >= v"1.10.0-DEV.870"
        isnothing(ci.slottypes) && (ci.slottypes = Any[])
        push!(ci.slottypes, Type)
    end
    push!(ci.slotnames, Symbol(""))
    push!(ci.slotflags, 0)
    end
    #argument!(ir, at = 1) # argument for #self# might already exist
    update!(ci, ir)
end

This can now be used in _generate_pullback:

using IRTools: argument!, varargs!, pis!, slots!
function _generate_pullback(world, f, args...)
    T = Tuple{f, args...}
    if (has_chain_rrule(T, world))
        return :(rrule(f, args...))
    end    
    g = _generate_pullback_via_decomposition(T, world)
    if isnothing(g)
        return :(error("No method found for ", repr($T), " in world ", $world))
    end
    m, pr, backs = g
    pr = varargs!(m, pr, 1) # add getfield for each index in args, offset by 1 for f
    pr = slots!(pis!(pr))
    argument!(pr, at = 1) # add #self#
    ci = build_codeinfo_(pr)
    ci.slotnames = [Symbol("#self#"), :f, :args]
    ci
end

Testing (you should redefine the @generated pullback function first):

world = Base.get_world_counter()
pr = _generate_pullback(world, typeof(f), Float64, Float64) # CodeInfo(...)
z, back = pullback(f, 1.0, 2.0) # (0.2,Pullback{...})

3.6 Reverse

The goal is to now turn Pullback into a callable struct so that we can call back(1.0) to evaluate equations $\ref{eq:reverse}$ and $\ref{eq:accumulate}$. With typeof(back) and back.data we have all the information to do this independent from the forward pass. The result will be:

There are unused variables here which can be removed e.g. %8 (s̄elf). The code here does not do such optimisations to keep things simple.

(%1, %2)
  %3 = Base.getfield(%1, :data)
  %4 = Base.getindex(%3, 1)
  %5 = Base.getindex(%3, 2)
  %6 = Base.getindex(%3, 3)
  %7 = (%6)(%2)
  %8 = Base.getindex(%7, 1)
  %9 = Base.getindex(%7, 2)
  %10 = Base.getindex(%7, 3)
  %11 = (%5)(%10)
  %12 = Base.getindex(%11, 1)
  %13 = Base.getindex(%11, 2)
  %14 = Base.getindex(%11, 3)
  %15 = (%4)(%14)
  %16 = Base.getindex(%15, 1)
  %17 = Base.getindex(%15, 2)
  %18 = Base.getindex(%15, 3)
  %19 = Main.accum(%9, %13)
  %20 = Main.accum(%17, %18)
  %21 = Base.tuple(nothing, %19, %20)
  return %21

Although harder to read, this code represents the same code as the expressions in part 2.

As with the forward pass, an internal function _generate_callable_pullback will do most of the work:

using IRTools: blocks, meta, slots!, inlineable!
function _generate_callable_pullback(j::Type{<:Pullback{S, T}}, world, Δ) where {S, T}
    m = meta(S; world=world)
    ir = IR(m)
    isnothing(ir) && return :(error("Non-differentiable function ", repr(args[1])))
    length(blocks(ir)) == 1 || error("control flow is not supported")
    back = reverse_differentiate(ir)
    back = slots!(inlineable!(back))
    ci = build_codeinfo_(back)
    ci.slotnames = [Symbol("#self#"), :Δ]
    ci
end

The reverse_differentiate function is a simplified version of Zygote.adjoint and Zygote.reverse_stacks!.

To start, a dictionary is created to store the gradients. It maps variable names (symbols) to an array of gradients. It is not accessed directly (e.g. grads[x]) but rather through the closure functions grad and grad! which automatically handle the arrays. The first gradient stored is %2=Δ associated with the final return value of the forward pass. (xaccum will be defined shortly.)

using IRTools: argument!, arguments, isexpr, returnvalue, xcall, return!
function reverse_differentiate(forw::IR)
    grads = Dict()
    grad!(x, x̄) = push!(get!(grads, x, []), x̄)
    grad(x) = xaccum(get(grads, x, [])...)
    ir = empty(forw)
    self = argument!(ir, at = 1, insert=false)
    grad!(returnvalue(block(forw, 1)), IRTools.argument!(ir))

The first statement retrieves the data field in the struct.

    data = push!(ir, xcall(:getfield, self, QuoteNode(:data)))

Next the code retrieves all the calls with pullbacks from the primal and loops over them, calling the pullbacks one by one. For each call it also loops over the input arguments and unpacks them one by one. Each variable’s gradient is added to grads and may be used later in the loop.

    pr, calls = primal(forw)
    pullbacks = Dict(calls[i] => push!(ir, xcall(:getindex, data, i)) for i = 1:length(calls))
    for v in reverse(keys(forw))
        ex = forw[v].expr
        if isexpr(ex, :call) && !ignored(ex)
            Δs = push!(ir, Expr(:call, pullbacks[v], grad(v)))
            for (i, x) in enumerate(ex.args)
                grad!(x, push!(ir, xcall(:getindex, Δs, i)))
            end
        end
    end

Finally, the last call retrieves all the necessary gradients for the input arguments and returns the IR:

    return!(ir, xcall(:tuple, [grad(x) for x in arguments(forw)]...))
end

This code calls a xaccum function. It is as follows:

xaccum() = nothing
xaccum(x) = x
xaccum(xs...) = xcall(Main, :accum, xs...)

The xaccum function calls an internal accumulate function if it acts on multiple inputs. At its simplest, accum is the same as sum. However it also handles nothing inputs, Tupless and NameTuples (source).

accum(x, y) = x === nothing ? y : y === nothing ? x : x + y
accum(x::Tuple, ys::Tuple...) = map(accum, x, ys...)
accum(x, y, zs...) = accum(accum(x, y), zs...)
@generated function accum(x::NamedTuple, y::NamedTuple)
    # assumes that y has no keys apart from those also in x
    fieldnames(y) ⊆ fieldnames(x) || throw(ArgumentError("$y keys must be a subset of $x keys"))
    grad(field) = field in fieldnames(y) ? :(y.$field) : :nothing
    Expr(:tuple, [:($f=accum(x.$f, $(grad(f)))) for f in fieldnames(x)]...)
end

Examples:

accum(1, 2, nothing, 3) # 6
accum((1, 2), (3, 4)) # (3, 6)
accum((;a=3, b=2), (;a=1)) # (a = 4, b = 2)

Finally, dispatch on the Pullback struct to turn it into a callable struct:

@generated function (methodinstance::Pullback)(Δ)
    _generate_callable_pullback(methodinstance, nothing, Δ)
end

function _callable_pullback_generator(world::UInt, source, self, Δ)
    ret = _generate_callable_pullback(self, world, Δ)
    ret isa Core.CodeInfo && return ret
    stub = Core.GeneratedFunctionStub(identity, Core.svec(:methodinstance, :Δ), Core.svec()) # names must match symbols in _generate_callable_pullback
    stub(world, source, ret)
end

@eval function (j::Pullback)(Δ)
    $(Expr(:meta, :generated, _callable_pullback_generator))
    $(Expr(:meta, :generated_only))
end

Testing:

f(a,b)=a/(a+b*b)
z, back = pullback(f, 2.0, 3.0) # (0.1818, Pullback{...})
_generate_callable_pullback(typeof(back), nothing, Float64) # CodeInfo for IR at start
back(1.0) # (nothing, 0.0744, -0.0991)

The results should match equation $\ref{eq:rollup}$:

a, b = 2.0, 3.0
ā = abs2(b)/abs2(a+abs2(b)) # 0.0744
b̄ = -2*a*b/abs2(a+abs2(b))  # -0.0991

4 Conclusion

This code works well enough for this simple case. It also works for the trigonometry example from part 1:

f(x) = sin(cos(x))
z, back = pullback(f, 0.9) # (0.5823, Pullback{...})
back(1.0) # (nothing, -0.6368)

However it will fail for the polynomial model:

struct Polynomial{V<:AbstractVector}
    weights::V
end
(m::Polynomial)(x) = evalpoly(x, m.weights)
(m::Polynomial)(x::AbstractVector) = map(m, x)
model = Polynomial([3.0, 2.0, -3.0, 1.0])
x = [1.0, 2.0, 3.0, 4.0]
pullback(model, x) # ERROR: No method found for Tuple{typeof(fieldtype) ....}

The error is raised five levels down:

pr1 = _generate_pullback(world, Polynomial, Vector{Float64})
pr2 = _generate_pullback(world, typeof(map), Polynomial, Vector{Float64})
pr3 = _generate_pullback(world, typeof(Base.Generator), Polynomial, Vector{Float64})
TT = Type{Base.Generator{Vector{Float64}, Polynomial{Vector{Float64}}}} # %9
pr4 = _generate_pullback(world, TT, Polynomial, Vector{Float64})
pr5 = _generate_pullback(world, typeof(Core.fieldtype), TT, 1) # error

This can be fixed by explicitly defining a pullback for map. These and other extensions will be the goal of part 4.

Presumably the reason the Julia team tried to prevent reflection in generated functions is that it interferes with the compliers ability to properly predict, trigger and/or optimise compilations. ↩
Zygote.jl has more complex rules which also consider other fallbacks, key word arguments and a possible opt out through a no_rrule. ↩

MicroGrad.jl: Part 2 Automation with expressions

2024-08-03T00:00:00+00:00

A series on automatic differentiation in Julia. Part 2 uses metaprogramming to generate a modified (primal) forward pass and to reverse differentiate it into a backward pass. This post uses an expression based approach which can be brittle. Part 3 develops a more robust approach for the same code using IRTools.jl.

This is part of a series. The other articles are:

All source code can be found at MicroGrad.jl. The code here is inspired by the example at IRTools.jl.

1 Introduction

Part 1 introduced the rrule for implementing chain rules. The challenge now is to automate it. This will be done through metaprogramming and generated functions.

For example, from part 1 there are rrules for +, * and /. The goal is then to automatically differentiate the following:

\[f(a, b) = \frac{a}{a + b^2}\]

like so:

f(a, b) = a / (a + b*b)
z, back = pullback(f, 2.0, 3.0) # (0.1818, ∂(f))
back(1.0) # (nothing, 0.0744, -0.099)

where pullback is a @generated function that inspects the lowered code for f:

ci = @code_lowered f(2, 3)
#= CodeInfo(
1 ─ %1 = b * b
│   %2 = a + %1
│   %3 = a / %2
└──      return %3
)
=#

2 Differentiating Wengert Lists

The paper uses the same example as the introduction:

\[f(a, b) = \frac{a}{a + b^2} \tag{2.1} \label{eq:f}\]

This can be broken down into smaller steps where each intermediate variable is saved. This is known as a Wengert list, or tape, or (backpropagation) graph:

\[\begin{align} y_1 &= b \times b \\ y_2 &= a + y_1 \\ y_3 &= a / y_2 \end{align} \tag{2.2} \label{eq:f_wengert}\]

\[\begin{align} \bar{x} &= \frac{\partial l}{\partial x} = \frac{\partial l}{\partial y} \frac{\partial y}{\partial x} \end{align} \tag{2.4} \label{eq:bar_x}\]

so we can write in this mathematical notation as:

where $\bar{y}=\frac{\partial l}{\partial y}$ and $J=\frac{\partial y}{\partial x}$ is the Jacobian (gradient) for arrays.

The final step is to accumulate the gradients for variables which are used multiple times:

\[\begin{align} \bar{a} &\leftarrow \bar{a}_{3,1} + \bar{a}_{2,1} \\ \bar{b} &\leftarrow \bar{b}_{1,1} + \bar{b}_{1,2} \\ \end{align} \tag{2.7} \label{eq:accumulate}\]

This end result is equivalent to rolling everything up into one function using the multivariable chain rule:

3 Pullback

3.1 Definition

The goal is to generate code which automatically implements the equations of section 2.

To start, define a pullback function (source):

function pullback end

This will be turned into a generated function.

@generated function pullback(f, args...)
        _generate_pullback(nothing, f, args...)
end

function _pullback_generator(world::UInt, source, self, f, args)
        ret = _generate_pullback(world, f, args...)
        ret isa Core.CodeInfo && return ret
        stub = Core.GeneratedFunctionStub(identity, Core.svec(:methodinstance, :f, :args), Core.svec())
        stub(world, source, ret)
end

@eval function pullback(f, args...)
        $(Expr(:meta, :generated, _pullback_generator))
        $(Expr(:meta, :generated_only))
end

3.2 ChainRules

The first goal of _generate_pullback will be to forward the function and its arguments to a matching rrule if it exists. For now it will throw an error if it cannot find one.

function _generate_pullback(world, f, args...)
    T = Tuple{f, args...}
    if (has_chain_rrule(T, world))
        return :(rrule(f, args...))
    end
    :(error("No rrule found for ", repr($T)))
end

In part 1 the most generic method of rrule was defined for an Any first argument, so if the compiler dispatches to this method it means no specific rrule was found.²

function has_chain_rrule(T, world)
    Tr = Tuple{typeof(rrule), T.parameters...}
    meta_T = meta(Tr; world=world)
    if isnothing(meta_T)
        return false
    end
    type_signature, sps, method_ = meta_T
    method_.sig.parameters[2] !== Any
end

The meta function uses the internal reflection function Base._methods_by_ftype to get all the methods for a specific type. (This same function is used by methods.) The most specific method is assumed to be the last one (source):

function meta(T; world=Base.get_world_counter())
    if isnothing(world)
        world = Base.get_world_counter() # in generated function post v1.10 this will return typemax(UInt)
    end
    min_world = Ref{UInt}(typemin(UInt))
    max_world = Ref{UInt}(typemax(UInt))
    has_ambig = Ptr{Int32}(C_NULL)  # don't care about ambiguous results
    _methods = Base._methods_by_ftype(T, #=mt=# nothing, #=lim=# -1,
        world, #=ambig=# false,
        min_world, max_world, has_ambig)
    _methods === nothing && return nothing
    _methods isa Bool && return nothing
    length(_methods) == 0 && return nothing
    last(_methods)
end

Firstly, for + acting on floats:

world = Base.get_world_counter()
T = Tuple{typeof(+), Float64, Float64}
Tr = Tuple{typeof(rrule), T.parameters...}
meta(Tr; world=world) # Core.MethodMatch(...), svec(), rrule(::typeof(+), x::Number, y::Number)
has_chain_rrule(T, world) # true
_generate_pullback(world, typeof(+), Float64, Float64) # :(rrule(f, args...))
pullback(+, 1.0, 2.0) # (3.0, var"#add_back#5"())

Now for f, also acting on floats:

world = Base.get_world_counter()
T = Tuple{typeof(f), Float64, Float64}
Tr = Tuple{typeof(rrule), T.parameters...}
meta(Tr; world=world) # Core.MethodMatch(...), svec(), rrule(::Any, ...)
has_chain_rrule(T, world) # false
_generate_pullback(world, typeof(f), Float64, Float64) # :(error(...))
pullback(f, 1.0, 2.0) # ERROR: No rrule found ...

The more interesting task is to inspect f and apply the equations of section 2 to fully differentiate with respect to all input parameters.

3.3 AST

Source: Julia Docs eval

The first step is to create a Wengert list for f. This is trivial because Julia already does this as part of the compilation process. As the first step of lowering code, the compiler will create an Abstract Syntax Tree (AST) which in the absence of control flow is the same as a Wengert list.

Julia exposes the @code_lowered macro to easily access the Intermediate Representation (IR) which is in Single Static Assignment (SSA) form. This is one step lower than the AST. However in many cases it is the same. Part 3 works with this form instead of the AST.

This AST can be retrieved by calling Base.uncompressed_ast on the method we have found above:

T = Tuple{typeof(f), Float64, Float64}
type_signature, sps, method_ = meta(T)
ci = Base.uncompressed_ast(method_)
#=
CodeInfo(
    @ REPL[1]:1 within `f`
1 ─ %1 = b * b
│   %2 = a + %1
│   %3 = a / %2
└──      return %3
)
=#

The returned object is a CodeInfo struct and it corresponds exactly to $\ref{eq:f_wengert}$.

Using this knowledge, we can now create a new function _generate_pullback_via_decomposition which will be called if no rrule exists. It uses the CodeInfo block to create the primal (equation $\ref{eq:primal}$) (source).

function _generate_pullback_via_decomposition(T, world)
    m = meta(T; world=world)
    isnothing(m) && return :(error("No method found for ", repr($T), " in world ", $world))
    type_signature, sps, method_ = m
    ci = Base.uncompressed_ast(method_)
    pr, calls = primal(ci, T)
end

3.4 Primal

The goal here is to create an expression for equation $\ref{eq:primal}$. This is what it will look like:

quote
    (y1, back1) = pullback(Main.:*, _3, _3)
    (y2, back2) = pullback(Main.:+, _2, %1)
    (y3, back3) = pullback(Main.:/, _2, %2)
    Base.tuple(%3, (Pullback{Tuple{typeof(f), Float64, Float64}})(Base.tuple(back1, back2, back3)))
end

Note that this expression cannot be executed because it still has slot numbers which correspond to input arguments (_X), and SSA values which correspond to intermediate values (e.g. %X). This will be fixed in the Sanitise section.

The first step for the primal function is to define three arrays to store information (source):

function primal(ci::Core.CodeInfo, T=Any)
    tape = []
    calls = []
    pullbacks = []

The tape array stores the new expressions which will be part of the final expression. The calls array stores the subset of expressions that require a pullback. This will be used to generate the reverse code (equation $\ref{eq:reverse}$) in the next section. Lastly, pullbacks stores all the pullbacks.

Next, iterate over each line in the CodeInfo instance. Each output variable will be called y$i. Then the line’s expression type is inspected. This minimal code cannot handle control flow or the creation of new objects, so errors will be explicitly thrown if those cases are encountered. (Please refer to the Lowered form section in the Julia documentation.)

    for (i, ex) in enumerate(ci.code)
      vy = Symbol("y$i")
      if ex isa Core.ReturnNode
          break
      elseif (typeof(ex) in [Core.GotoNode, Core.GotoIfNot, Core.SlotNumber])
          error("$(typeof(ex)) is not supported")

If the expression is of type Expr and it makes a call, and it is not in a specialised ignore list (to be defined shortly), then the new expression can be created and the three arrays updated. Otherwise, leave as is.

There are possible silent errors, including logic errors, with the else statement here. For example, it will not properly handle any :new expression statements. This is one of the inherent complexities with this metaprogramming/multiple dispatch approach.

      elseif (ex isa Expr) && (ex.head == :call)  && !ignored(ex)
              vb = Symbol("back$i")
              new_ex = :(($vy, $vb) = pullback($(ex.args...)))
              push!(tape, new_ex)
              push!(calls, (;SSA_value=vy, expr=ex))
              push!(pullbacks, vb)
      else # keep as is
              push!(tape, :($vy = $ex))
      end
    end

After working through all the lines, a final expression is added which returns a tuple with the final output of the function and a Pullback struct which stores all the pullbacks. Everything is then grouped into a single :block expression:

    pb = Expr(:call, Pullback{T}, xcall(:tuple, pullbacks...))
    push!(tape, xcall(:tuple, returnvalue(ci), pb))
    pr = Expr(:block, tape...)
    pr, calls
end

This code requires definitions for the Pullback struct as well as the following functions: ignored, xcall and returnvalue.

There are no closures in lowered Julia code, so instead Zygote.jl stores the pullbacks in a generic struct:

struct Pullback{S,T}
    data::T
end
Pullback{S}(data) where S = Pullback{S,typeof(data)}(data)

Here is the ignored functions list (source):

function ignored(ex::Expr)
    f = ex.args[1]
    ignored_f(f)
end

ignored_f(f) = f in (
    GlobalRef(Base, :not_int),
    GlobalRef(Core.Intrinsics, :not_int),
    GlobalRef(Core, :(===)),
    GlobalRef(Core, :apply_type),
    GlobalRef(Core, :typeof),
    GlobalRef(Core, :throw),
    GlobalRef(Base, :kwerr),
    GlobalRef(Core, :kwfunc),
    GlobalRef(Core, :isdefined)
)

xcall and returnvalue are convenience functions from IRTools:

xcall(mod::Module, f::Symbol, args...) = Expr(:call, GlobalRef(mod, f), args...)
xcall(f::Symbol, args...) = xcall(Base, f, args...)
xcall(f, args...) = Expr(:call, f, args...)

function returnvalue(ci::Core.CodeInfo)
    for expr in ci.code
        if expr isa Core.ReturnNode
            return expr.val
        end
    end
end

Running this code:

world = Base.get_world_counter()
T = Tuple{typeof(f), Float64, Float64}
pr, calls =_generate_pullback_via_decomposition(T, world)

gives the expression at the start.

3.5 Sanitise

To evaluate the expression we need to remove all slot values and SSA values.

For the slot values (_X), the first parameter in T will always be the function f, and the remainder are from args. Therefore the first slot needs to be replaced with the symbol :f, and the remainder with Base.getindex(args, idx) where idx is offset by 1. Here are two recursive functions to accomplish this:

function replace_slot!(ex::Expr, idx::Int, f::Symbol)
    for (i, v) in enumerate(ex.args)
        if v isa Expr
            replace_slot!(v, idx, f)
        elseif v isa Core.SlotNumber && v.id == idx
            ex.args[i] = :($f) 
        end
    end
    ex
end

function varargs!(ex::Expr, offset::Int=1)
    for (i, v) in enumerate(ex.args)
        if v isa Expr
            varargs!(v)
        elseif v isa Core.SlotNumber
            ex.args[i] = :(Base.getindex(args, $(v.id - offset))) 
        end
    end
    ex
end

The SSA values (%id) need to be replaced by the y$id symbol:

function replace_SSA!(ex::Expr)
    for (i, v) in enumerate(ex.args)
        if v isa Expr
            replace_SSA!(v)
        elseif v isa Core.SSAValue
            ex.args[i] = Symbol("y$(v.id)") 
        end
    end
    ex
end

Running this code on pr:

replace_slot!(pr, 1, :f)
varargs!(pr)
replace_SSA!(pr)

Results in:

quote
    (y1, back1) = pullback(Main.:*, Base.getindex(args, 2), Base.getindex(args, 2))
    (y2, back2) = pullback(Main.:+, Base.getindex(args, 1), y1)
    (y3, back3) = pullback(Main.:/, Base.getindex(args, 1), y2)
    Base.tuple(y3, (Pullback{Tuple{typeof(f), Float64, Float64}})(Base.tuple(back1, back2, back3)))
end

We can now complete _generate_pullback to also call the decomposition code:

function _generate_pullback(world, f, args...)
    T = Tuple{f, args...}
    if (has_chain_rrule(T, world))
        return :(rrule(f, args...))
    end    
    pr, backs = _generate_pullback_via_decomposition(T, world)
    replace_slot!(pr, 1, :f)
    varargs!(pr)
    replace_SSA!(pr)
    pr
end

Testing (you should redefine the @generated pullback function first):

world = Base.get_world_counter()
pr = _generate_pullback(world, typeof(f), Float64, Float64) # same as above
z, back = pullback(f, 1.0, 2.0) # (0.2,Pullback{...})

3.6 Reverse

There are unused variables here which can be removed e.g. x̄3_1 (s̄elf). The code here does not do such optimisations to keep things simple.

quote
    data = Base.getfield(methodinstance, :data)
    back3 = Base.getindex(data, 3)
    Δs = back3(Δ)
    x̄3_1 = Base.getindex(Δs, 1)
    x̄3_2 = Base.getindex(Δs, 2)
    x̄3_3 = Base.getindex(Δs, 3)
    back2 = Base.getindex(data, 2)
    Δs = back2(x̄3_3)
    x̄2_1 = Base.getindex(Δs, 1)
    x̄2_2 = Base.getindex(Δs, 2)
    x̄2_3 = Base.getindex(Δs, 3)
    back1 = Base.getindex(data, 1)
    Δs = back1(x̄2_3)
    x̄1_1 = Base.getindex(Δs, 1)
    x̄1_2 = Base.getindex(Δs, 2)
    x̄1_3 = Base.getindex(Δs, 3)
    Base.tuple(nothing, Main.accum(x̄3_2, x̄2_2), Main.accum(x̄1_2, x̄1_3))
end

As with the forward pass, an internal function _generate_callable_pullback will do most of the work. It uses the meta function defined above to get the CodeInfo struct based on the input types:

function _generate_callable_pullback(j::Type{<:Pullback{S, T}}, world, Δ) where {S, T}
    m = meta(S; world=world)
    isnothing(m) && return :(error("No method found for ", repr($S), " in world ", $world))
    type_signature, sps, method_ = m
    ci = Base.uncompressed_ast(method_)
    back = reverse_differentiate(ci, :methodinstance, :Δ)
    back
end

The reverse_differentiate function is a simplified version of Zygote.adjoint and Zygote.reverse_stacks!.

To start, a dictionary is created to store the gradients. It maps variable names (symbols) to an array of gradients. It is not accessed directly (e.g. grads[x]) but rather through the closure functions grad and grad! which automatically handle the arrays. The first gradient stored is Δ associated with the final return value of the forward pass. (_var_name and xaccum will be defined shortly.)

function reverse_differentiate(forw::Core.CodeInfo, self, Δ)
    grads = Dict()
    grad!(x, x̄) = push!(get!(grads, x, []), x̄)
    grad(x) = xaccum(get(grads, x, [])...)
    grad!(_var_name(returnvalue(forw)), Δ) # _var_name maps to variable names in calls
    tape = Expr[]
    push!(tape, :(data=$(xcall(:getfield, self, QuoteNode(:data)))))

The tape for the expression block is started by retrieving the data field in the struct.

    tape = Expr[]
    push!(tape, :(data=$(xcall(:getfield, self, QuoteNode(:data)))))

    pr, calls = primal(forw)
    i = length(calls)
    for (v, ex) in reverse(calls)
        vb = Symbol("back$i")
        push!(tape, :($vb = Base.getindex(data, $i)))
        g = grad(v)
        push!(tape, :(Δs = $vb($g)))
        for (j, x) in enumerate(ex.args)
            xbar = Symbol("x̄$(i)_$(j)")
            get_xbar = :($xbar=$(xcall(:getindex, :Δs, j)))
            push!(tape, get_xbar)
            grad!(_var_name(x), xbar)
        end
        i -= 1
    end

Finally, the last call retrieves all the necessary gradients for the input arguments and returns a single quote block.

    push!(tape, xcall(:tuple, [grad(x) for x in arguments(forw)]...))
    Expr(:block, tape...)
end

This code required the following functions: xaccum, _var_name and arguments. They are as follows:

xaccum() = nothing
xaccum(x) = x
xaccum(xs...) = xcall(Main, :accum, xs...)
_var_name(x::Core.SlotNumber) = x.id == 1 ? Symbol("#self") : Symbol("args$(x.id)")
_var_name(x::Core.SSAValue)  = Symbol("y$(x.id)")
_var_name(x) = x
arguments(forw::Core.CodeInfo) = [Symbol("#self"), [Symbol("args$i") for i in 2:length(forw.slotnames)]...]

accum(x, y) = x === nothing ? y : y === nothing ? x : x + y
accum(x::Tuple, ys::Tuple...) = map(accum, x, ys...)
accum(x, y, zs...) = accum(accum(x, y), zs...)
@generated function accum(x::NamedTuple, y::NamedTuple)
    # assumes that y has no keys apart from those also in x
    fieldnames(y) ⊆ fieldnames(x) || throw(ArgumentError("$y keys must be a subset of $x keys"))
    grad(field) = field in fieldnames(y) ? :(y.$field) : :nothing
    Expr(:tuple, [:($f=accum(x.$f, $(grad(f)))) for f in fieldnames(x)]...)
end

Examples:

accum(1, 2, nothing, 3) # 6
accum((1, 2), (3, 4)) # (3, 6)
accum((;a=3, b=2), (;a=1)) # (a = 4, b = 2)

Finally, dispatch on the Pullback struct to turn it into a callable struct:

The argument names methodinstance and Δ must match the symbols in the call to reverse_differentiate in _generate_callable_pullback. Otherwise the expression will be unable to find those variables.

@generated function (methodinstance::Pullback)(Δ)
    _generate_callable_pullback(methodinstance, nothing, Δ)
end

function _callable_pullback_generator(world::UInt, source, self, Δ)
    ret = _generate_callable_pullback(self, world, Δ)
    ret isa Core.CodeInfo && return ret
    stub = Core.GeneratedFunctionStub(identity, Core.svec(:methodinstance, :Δ), Core.svec()) # names must match symbols in _generate_callable_pullback
    stub(world, source, ret)
end

@eval function (j::Pullback)(Δ)
    $(Expr(:meta, :generated, _callable_pullback_generator))
    $(Expr(:meta, :generated_only))
end

Testing:

f(a,b)=a/(a+b*b)
z, back = pullback(f, 2.0, 3.0) # (0.1818, Pullback{...})
_generate_callable_pullback(typeof(back), nothing, Float64) # expression at start
back(1.0) # (nothing, 0.0744, -0.0991)

The results should match equation $\ref{eq:rollup}$:

a, b = 2.0, 3.0
ā = abs2(b)/abs2(a+abs2(b)) # 0.0744
b̄ = -2*a*b/abs2(a+abs2(b))  # -0.0991

4 Conclusion

This code works well enough for this simple case. It also works for the trigonometry example from part 1:

f(x) = sin(cos(x))
z, back = pullback(f, 0.9) # (0.5823, Pullback{...})
back(1.0) # (nothing, -0.6368)

However it will fail for the polynomial model:

struct Polynomial{V<:AbstractVector}
    weights::V
end
(m::Polynomial)(x) = evalpoly(x, m.weights)
(m::Polynomial)(x::AbstractVector) = map(m, x)
model = Polynomial([3.0, 2.0, -3.0, 1.0])
x = [1.0, 2.0, 3.0, 4.0]
pullback(model, x) # ERROR: syntax: invalid syntax (static_parameter 1)

The error is raised three levels down:

pr1 = _generate_pullback(world, Polynomial, Vector{Float64})
pr2 = _generate_pullback(world, typeof(map), Polynomial, Vector{Float64})
pr3 = _generate_pullback(world, typeof(Base.Generator), Polynomial, Vector{Float64})

This can be fixed by explicitly writing a pullback for map.

However rather than fixing it here, I first want to rewrite the code using IRTools. The code written here is brittle and difficult to debug. Instead of writing expressions, it would be better to directly create a CodeInfo struct which always contains valid code. Julia does not allow us to do that, but working with an IR object which can be readily converted is the next best thing. This is will be the goal of part 3.

Presumably the reason the Julia team tried to prevent reflection in generated functions is that it interferes with the compliers ability to properly predict, trigger and/or optimise compilations. ↩
Zygote.jl has more complex rules which also consider other fallbacks, key word arguments and a possible opt out through a no_rrule. ↩

MicroGrad.jl: Part 1 ChainRules

2024-07-27T00:00:00+00:00

A series on automatic differentiation in Julia. Part 1 provides an overview and defines explicit chain rules.

This is part of a series. The other articles are:

All source code can be found at MicroGrad.jl.

1 Introduction

A major convenience of modern machine learning frameworks is automatic differentiation (AD). Training a machine learning model typically consist of two steps, a forward pass and a backwards pass. The forward pass takes an input sample and calculates the result. Examples include a label in a classifier model or a word or image in a generative model. In the backward pass, the result is compared to a ground truth sample and the error is backpropagated throughout the model, from the final layers through to the start. Backpropagation is driven by gradients which are calculated with the differentiation rules of Calculus.

With modern machine learning frameworks, such as PyTorch or Flux.jl, only the forward pass needs to be defined and they will automatically generate the backward pass. This (1) makes them easier to use and (2) enforces consistency between the forward pass and backward pass.

The probability boundaries of a multi-layer perceptron trained on the moons dataset with MicroGrad.jl.

Andrej Kaparthy made an excellent video where he built a minimal automatic differentiation module called Micrograd in Python. This is the first video in his Zero to Hero series. He later uses it to train a multi-layer perceptron model. I highly recommend it for anyone who wants to understand backpropagation.

The aim of this series is to create a minimal automatic differentiation package in Julia. It is based on Zygote.jl and works very differently to the Python AD packages. The latter are based on objects with their own custom implementations of mathematical operations that calculate both the forward and backward passes. All operations are only done with these objects.¹ Zygote.jl is instead based on the principle that Julia is a functional programming language. It utilises Julia’s multiple dispatch feature and its comprehensive metaprogramming abilities to generate new code for the backward pass. Barring some limitations, it can be used to differentiate all existing functions as well as any custom code.

Zygote’s approach is complex and pushes the boundaries of Julia’s metaprogramming. It can sometimes be buggy. However its promise is true automatic differentiation of any forward pass code without further work on the coder’s part.

For the final code, see my MicroGrad.jl repository. It is very versatile but has several limitations, including less code coverage than Zygote.jl and it is unable to handle control flow or keyword arguments.

There are almost no comprehensive tutorials on AD in Julia and so this series aims to cover that gap. A good understanding of Julia and of Calculus is required.

2 Julia AD Ecosystem

The Julia automatic differentiation ecosystem is centered around three packages: Flux.jl, ChainRules.jl and Zygote.jl.

Flux.jl is a machine learning framework. It uses either ChainRules.jl or Zygote.jl to differentiate code.
Zygote.jl implements automatic differentiation through metaprogramming.
- The core functionality is defined in the minimal ZygoteRules.jl package.
- The main functions it exposes are gradient, withgradient and pullback. The pullback function is a light wrapper around _pullback which does most of the heavy lifting.
- The goal of _pullback is to dispatch a function, its arguments and its keyword arguments to a ChainRule.rrule. If it cannot, it will inspect the code, decompose it into smaller steps, and follow the rules of differentiation to dispatch each of those to _pullback to recursively find an rrule. If this recursive process does not find a valid rule it will raise an error.
ChainRules.jl defines forward rules and reverse rules.
- The core functionality is defined in the minimal ChainRulesCore.jl package.
- The main functions it exposes are frule and rrule. This series deals only with backpropagation, so it will only concentrate on rrule.

Also important is IRTools.jl, an extended metaprogramming package for working with an intermediate representation (IR) between raw Julia code and lowered code. MicroGrad.jl in particular is based on the example code at IRTools.jl with alignment with Zyogte.jl functions and names.

As an example, consider the function $f(x) = \sin(\cos(x))$. Using the chain rule of Calculus, it is differentiated as:

\[\begin{align} \frac{df}{dx} &= \frac{df}{dh}\frac{dh}{dx} \quad ; h(x)=cos(x)\\ &= \frac{d}{dh}\sin(h)\frac{d}{dx}\cos(x) \\ &= \cos(h)(-\sin(x)) \\ &= -\cos(\cos(x))\sin(x) \end{align}\]

Zygote.withgradient, exposed as Flux.withgradient, can be used to calculate this:

using Flux
f(x) = sin(cos(x))
y, grad = Flux.withgradient(f, 0.9) # 0.5823, (-0.6368,)
grad[1] == -cos(cos(0.9))*sin(0.9) # true

More commonly we differentiate with respect to the model, not the data:

y, grad = Flux.withgradient(m->m(0.9), f) # 0.5823, (nothing,)

This is more useful for a model with parameters. For example a dense, fully connected layer:

model = Dense(3=>1)
x = rand(Float32, 3, 10)
y, grad = Flux.withgradient(m->sum(m(x)), model) # 1.5056f0, ((weight=[4.9142 6.235 5.3379],bias=Fill(10.0f0,1),σ=nothing),)

The aim of the rest of the series is to recreate this functionality. This first part will focus solely on ChainRules.jl and recreating the rrule function. Part 2 will focus on recreating the Zygote._pullback function. Part 3 repeats part 2 in a more robust manner. Part 4 extends part 3’s solution to handle maps, anonymous functions and structs. Finally part 5 shows how this AD code can be used by a machine learning framework.

3 ChainRules

3.1 Definition

ChainRules.jl’s rrule returns the output of the forward pass $y(x)$ and a function $\mathcal{B}$ which calculates the backward pass. $\mathcal{B}$ takes as input $\Delta = \frac{\partial l}{\partial y}$, the gradient of some scalar $l$ with regards to the output variable $y$, and returns a tuple of $\left(\frac{\partial l}{\partial \text{self}}, \frac{\partial l}{\partial x_1}, …, \frac{\partial l}{\partial x_n}\right)$, the gradient of $l$ with regards to each of the input variables $x_i$. (The extra gradient $\frac{\partial l}{\partial \text{self}}$ is needed for internal fields and closures. See the Dense layer example above.) According to the chain rule of Calculus, each gradient is calculated as:

\[\mathcal{B_i}\left(\frac{\partial l}{\partial y}\right) = \frac{\partial l}{\partial x_i} = \frac{\partial l}{\partial y} \frac{\partial y}{\partial x_i}\]

As a starting point $\frac{\partial l}{\partial y}=1$ is used to evaluate only $\frac{\partial y}{\partial x}$.

If $x$ and $y$ are vectors, then the gradient $J=\frac{\partial y}{\partial x}$ is a Jacobian:

\[J = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}\]

To maintain the correct order, we need to use the conjugate transpose (adjoint) of the Jacobian. So each gradient is calculated as:

\[\mathcal{B_i}(\Delta) = J_i^{\dagger} \Delta\]

Note the Jacobian does not need to be explicitly calculated; only the product needs to be. This is can be useful when coding the rrule for matrix functions. See the section on the chain rule for matrix multiplication later.

To start, define a default fallback for rrule that returns nothing for any function with any number of arguments (source):

rrule(::Any, ::Vararg{Any}) = nothing

An rrule can now be defined for any function. For it to be really useful rrule must cover a large set of functions. Thankfully ChainRules.jl provides us with that. However in this post I’ll only work through a limited set of examples.

3.2 Arithmetic

The derivatives of adding two variables is:

\[\frac{\partial}{\partial x}(x+y) = 1 + 0; \frac{\partial}{\partial y}(x+y) = 0 + 1\]

Proof ⇩

$$ \begin{align} \Delta f_x &= (x+\Delta x+ y) - (x+y) \\ \therefore \lim_{\Delta x \to 0}\frac{\Delta f_x}{\Delta x} &=\frac{\partial f}{\partial x}= 1 \\ \therefore \lim_{\Delta y \to 0}\frac{\Delta f_y}{\Delta y} &=\frac{\partial f}{\partial y}= 1 \end{align} $$

There are no internal fields so $\frac{\partial l}{\partial \text{self}}$ is nothing. $\mathcal{B}$ can be returned as an anonymous function, but giving it the name add_back helps with debugging (source).

function rrule(::typeof(+), x::Real, y::Real)
    add_back(Δ) = (nothing, true * Δ, true * Δ) # ∂self, ∂x, ∂y
    x + y, add_back # also (Δ) -> (nothing, true * Δ, true * Δ)
end

Usage:

z, back = rrule(+, 1, 2) # (3, var"#add_back#"())
back(1.2) # (nothing, 1.2, 1.2)

Subtraction is almost identical:

function rrule(::typeof(-), x::Real, y::Real)
    minus_back(Δ) = (nothing, true * Δ, -1 * Δ) # ∂f, ∂x, ∂y
    x - y, minus_back
end

With multiplication, the incoming gradient is multiplied by the other variable:

\[\frac{\partial}{\partial x}(xy) = y; \frac{\partial}{\partial y}(xy) = x\]

Proof ⇩

$$ \begin{align} \Delta f_x &= (x+\Delta x)y - xy \\ \therefore \lim_{\Delta x \to 0}\frac{\Delta f_x}{\Delta x} &=\frac{\partial f}{\partial x}= y \\ \therefore \lim_{\Delta y \to 0}\frac{\Delta f_y}{\Delta y} &=\frac{\partial f}{\partial y}= x \end{align} $$

In code (source):

function rrule(::typeof(*), x::Real, y::Real)
    times_back(Δ) = (nothing, y * Δ, x * Δ) # ∂self, ∂x, ∂y
    x * y, times_back
end

Note that Julia will create a closure around the incoming x and y variables for times_back. A closure is when the function stores the values of variables from its parents scope (it closes over the variables). In other words, x and y will become constants in the times_back scope. In this way, the times_back function will always “remember” what values it was called with:

Example:

z, back = rrule(*, 2, 3) # (6, var"#times_back#4"{Int64, Int64}(2, 3))
back.x # 2
back.y # 3
back(1.2) # (nothing, 3.6, 2.4)

Every call to rrule with * will return a different back instance based on the input arguments.

Division is slightly different in that the derivatives look different for $x$ and $y$:

\[\frac{\partial}{\partial x}\frac{x}{y} = \frac{1}{y}; \frac{\partial}{\partial y}\frac{x}{y}= -\frac{x}{y^2}\]

Proof ⇩

$$ \begin{align} \Delta f_x &= \frac{x+\Delta x}{y} - \frac{x}{y} \\ \therefore \lim_{\Delta x \to 0}\frac{\Delta f_x}{\Delta x} &=\frac{\partial f}{\partial x}= \frac{1}{y} \\ \Delta f_y &= \frac{x}{y+\Delta y} - \frac{x}{y} \\ &= \frac{xy}{y(y+\Delta y)} - \frac{x(y+\Delta y)}{y(y+\Delta y)} \\ &= -\frac{x \Delta y}{y(y+\Delta y)} \\ \therefore \lim_{\Delta y \to 0}\frac{\Delta f_y}{\Delta y} &=\frac{\partial f}{\partial y} = -\frac{x}{y^2} \end{align} $$

Here we can calculate an internal variable Ω to close over, and use it for the $\frac{\partial}{\partial y}$ derivative (source):

function rrule(::typeof(/), x::Real, y::Real)
    Ω = x / y
    divide_back(Δ) = (nothing, 1 / y * Δ, -Ω/y * Δ) # ∂self, ∂x, ∂y
    Ω, divide_back
end

Example:

z, back = rrule(/, 2, 3) # (0.6667, var"#divide_back#5"{Int64, Float64}(3, 0.6667))
back.Ω # 0.6667
back.y # 3
back.x # ERROR
back(1.2) # (nothing, 0.4, -0.2667)

3.3 Trigonometry

The derivatives of $\sin$ and $\cos$ are:

\[\begin{align} \frac{\partial}{\partial x} \sin(x) &= \cos(x) \\ \frac{\partial}{\partial x} \cos(x) &= -\sin(x) \end{align}\]

Because both use $\sin$ and $\cos$, we can use sincos to calculate both simultaneously and more efficiently than calculating each on its own. This shows the advantage of calculating the forward pass and backward pass at the same time (source):

function rrule(::typeof(sin), x::Number)
    s, c = sincos(x)
    sin_back(Δ) = (nothing, Δ * c) # ∂self, ∂x
    s, sin_back
end

function rrule(::typeof(cos), x::Number)
    s, c = sincos(x)
    cos_back(Δ) = (nothing, -Δ * s) # ∂self, ∂x
    c, cos_back
end

Let’s now revisit the example from earlier, $f(x) = \sin(\cos(x))$. We have the forward pass:

\[\begin{align} y_1 &= \cos(x) \\ y_2 &= \sin(y_1)\\ \end{align}\]

And the backwards pass:

\[\begin{align} \frac{\partial y_2}{\partial y_1} &= (1.0) \frac{\partial}{\partial y_1} \sin(y_1) \\ &= \cos(y_1) \\ \frac{\partial y_2}{\partial x} &= \frac{\partial y_2}{\partial y_1} \frac{\partial}{\partial x} \cos(x) \\ &= -\Delta_2 \sin(x) \end{align}\]

In code:

x = 0.9
y1, back1 = rrule(cos, x) # (0.6216, cos_back)
y2, back2 = rrule(sin, y1) # (0.5823, sin_back)
grad_sin, grad_y1 = back2(1.0) # (nothing, 0 .8129)
grad_cos, grad_x = back1(grad_y1) # (nothing, -0.6368)
grad_x == -cos(cos(x))*sin(x) # true

3.4 Polynomials

The next section will showcase an example of polynomial curve fitting. This requires an rrule for the evalpoly function.

For a general polynomial:

\[y = a_0 + a_1x + a_2x^2 + ... + a_n x^n\]

The derivatives are:

\[\begin{align} \frac{\partial y}{\partial x} &= 0 + a_1 + 2a_2x^1 + ... + n a_n x^{n-1} \\ \frac{\partial y}{\partial a_i} &= 0 + ... + x^{i} + ... + 0 \end{align}\]

For the most efficient implementation, the powers of $x$ can be calculated for both the forward and backwards pass at the same time. For simplicity, I’m not going to do that (source):

function rrule(::typeof(evalpoly), x, coeffs::AbstractVector)
    y = evalpoly(x, coeffs)
    function evalpoly_back(Δ)
        xpow = one(x)
        dp = similar(coeffs, typeof(xpow * Δ))
        dx = zero(x)
        for i in eachindex(coeffs)
            dp[i] = Δ * xpow
            dx += (i-1) * coeffs[i] * xpow / x * Δ
            xpow *= x
        end
        return nothing, dx, dp
    end
    y, evalpoly_back
end

Usage:

y, back = rrule(evalpoly, 1.2, [2.0, 0.0, 3.0, 4.0]) # 13.232, evalpoly_back
back(1.0) # (nothing, 24.48, [1.0, 1.2, 1.44, 1.728])

3.5 Matrix multiplication

For some scaler loss function $l$, we can calculate a derivative $\Delta=\frac{\partial l}{\partial Y}$ against some matrix $Y$. Then for $Y=AB$, the partial derivatives are:

\[\begin{align} \frac{\partial l}{\partial A} &= \frac{\partial Y}{\partial A} \frac{\partial L}{\partial Y} \\ &= \Delta B^T \\ \frac{\partial l}{\partial B} &= \frac{\partial Y}{\partial B} \frac{\partial L}{\partial Y} \\ &= A^T \Delta \end{align}\]

Note that the Jacobians $\frac{\partial Y}{\partial A}$ and $\frac{\partial Y}{\partial B}$ are not explicitly calculated here; only the product is. (These Jacobians would have many zeros because each output element depends only on a small subset of the input elements.)

The most common use case in machine learning is $Y=WX$, where $W$ is a set of weights and $X$ is the data. Machine learning algorithms only alter the weights, not the data. Hence only $\frac{\partial l}{\partial W}$ is required. This means computation is wasted on $\frac{\partial l}{\partial X}$. For large matrices, this can be significant. To avoid this ChainRules.jl uses the ChainRulesCore.@thunk macro to wrap code in a ChainRulesCore.Thunk struct. This struct defers computation until it is used. If it is not used, the computation is not run.

In code (source):

function rrule(::typeof(*), A::AbstractVecOrMat{<:Real}, B::AbstractVecOrMat{<:Real})
    function times_back(Δ)
        dA = Δ * B'
        dB = A' * Δ
        return (nothing, dA, dB)
    end
    A * B, times_back
end

Test:

A, B = rand(2, 4), rand(4, 3)
C, back = rrule(*, A, B) # (2×3 Matrix{Float64}, times_back)
back(ones(2, 3)) # (nothing, 2×4 Matrix, 4×3 Matrix)

3.6 MSE

The mean square error (MSE) is a common loss function in machine learning. It will be used shortly for polynomial curve fitting. It is:

\[MSE(\hat{y}, y) = \frac{1}{n}\sum^n_{i=1} (\hat{y}_i - y_i)^2\]

with derivatives:

\[\begin{align} \frac{\partial MSE}{\partial \hat{y}_i} &= \frac{1}{n}(0 + ... + 2(\hat{y}_i - y_i) + ... + 0) \\ &= \frac{2(\hat{y}_i - y_i)}{n} \\ \frac{\partial MSE}{\partial y_i} &= \frac{1}{n}(0 + ... - 2(\hat{y}_i - y_i) + ... + 0) \\ &= -\frac{2(\hat{y}_i - y_i)}{n} \end{align}\]

In code it is:

using StatsBase
mse(ŷ::AbstractVecOrMat, y::AbstractVecOrMat) = mean(abs2.(ŷ - y))

Flux.jl does not define an rrule for its mse because it can be decomposed into functions which already have an rrule (-, broadcast, abs2 and mean). However since we don’t have rrules for these parts and have not yet automated decomposition, it is simplest to create an rrule for the entire function:

function rrule(::typeof(mse), ŷ::AbstractVecOrMat, y::AbstractVecOrMat)
    Ω = mse(ŷ, y)
    function mse_back(Δ)
        c = 2 * (ŷ - y) / length(y) * Δ
        return nothing, c, -c # ∂self, ∂ŷ, ∂y
    end
    Ω, mse_back
end

The mse can also be applied per individual data point and summed up separately. This form is not common but will be useful for explanatory purposes in the polynomial curve fitting section:

mse(ŷ::Number, y::Number, n::Int) = abs2(ŷ - y)/n
function rrule(::typeof(mse), ŷ::Number, y::Number, n::Int)
    Ω = mse(ŷ, y, n)
    function mse_back(Δ)
        c = 2 * (ŷ - y) / n * Δ
        return nothing, c, -c, -Ω/n # ∂self, ∂ŷ, ∂y, ∂n
    end
    Ω, mse_back
end

4 Gradient Descent

4.1 Polynomial curve fitting

Gradient descent is a great algorithm to illustrate the usefulness of the code developed so far. The toy example of fitting a polynomial to data will be used. This is a useful example because (1) we can start with a target curve and so have ground truth values to compare and (2) this problem can be solved analytically without gradients.

Here is code to create the above data:

using StatsBase
target_weights = [15.0, -2.1, 13.9, 1.5]
noise_factor = 0.2
xs = (rand(100) .- 0.5) .* 10
ys = map(x -> evalpoly(x, target_weights), xs)
scale_factor = mean(abs.(ys))
ys .+= randn(length(ys)) * scale_factor * noise_factor

Analytical least squares fitting of polynomials ⇩

For a polynomial of order $p$, if there are exactly $n=p+1$ training samples (including for the constant $a_0$) than there exactly $n$ equations for $n$ unknowns ($a_0$,...,$a_p$) and this can be solved as an ordinary linear system: $$ \begin{align} &a_0 + a_1 x_1 + a_2x_1^2 + ... + a_p x_1^p = y_1 \\ &\vdots \\ &a_0 + a_1 x_n + a_2x_n^2 + ... + a_p x_n^p = y_n \\ &\Rightarrow \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^p \end{bmatrix} \begin{bmatrix} a_0 \\ \vdots \\ a_n \end{bmatrix} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \\ &\Rightarrow XA=Y \\ &\Rightarrow A = X^{-1}Y \end{align} $$ Where $X^{-1}$ usually exists because $X$ is a square matrix.

However usually $n > p + 1$ and thus $X^{-1}$ will not exist. In that case the pseudoinverse $X^+$, also called the Moore-Penrose inverse, can be used instead: $$ \begin{align} X^{+} &= (X^T X)^{-1} X^T \\ \Rightarrow A &= X^{+}Y \end{align} $$ It can be proven that this solution for $A$ minimises the least squared error.

Here is this solution in code:

using LinearAlgebra
function solve_poly_linear(order::Int, xs::AbstractVector, ys::AbstractVector)
    n = length(xs)
    X = zeros(n, order + 1)
    for (i, x) in enumerate(xs)
        xpow = 1
        for j in 1:(size(X, 2))
            X[i, j] = xpow
            xpow *= x
        end
    end
    pinv(X) * ys
end

Here is a simple version of gradient descent:

Gradient descent
while (criteria is not met) do:
$\quad$ $\Delta = 0$
$\quad$ for sample, label in train_set do:
$\quad\quad$ $\Delta \leftarrow \Delta + \frac{\partial}{\partial\theta_j}L$($m_{\theta_j}$(sample), label)
$\quad$ $\theta_{j+1}$ $\leftarrow \theta_j - \alpha \Delta$

where $m_\theta$ is the model with parameters $\theta$ and $L$ is the loss function.

This is a Julia implementation for specifically applying the algorithm to polynomials. The stopping condition is a maximum number of iterations, so the while loop has been replaced with a for loop. The code also saves the loss so that the training progress can be analysed.

function gradient_descent_poly!(
    coeffs::AbstractVector,
    xs::AbstractVector,
    ys::AbstractVector
    ; learning_rate::AbstractFloat=0.1,
    max_iters::Integer=100
    )
    history = Float64[]
    n = length(xs)
    p = length(coeffs)
    for i in 1:max_iters
        loss_iter = 0.0
        Δcoeffs = zeros(p)
        for (x, y) in zip(xs, ys)
            # forward
            ŷ, back_poly = rrule(evalpoly, x, coeffs)
            loss_x, back_loss = rrule(mse, ŷ, y, n)
            # reverse
            Δloss, Δŷ, Δy, Δn = back_loss(1.0)    
            Δevalpoly, Δx, Δcoeffs_x = back_poly(Δŷ)
            # accumulate
            loss_iter += loss_x
            Δcoeffs += Δcoeffs_x
        end
        # update
        coeffs .-= learning_rate .* Δcoeffs
        # history
        push!(history, loss_iter)
    end
    history
end

Calling the code:

coeffs = rand(4)
history = gradient_descent_poly!(coeffs, xs, ys; learning_rate=1e-5, max_iters=2000)

Plotting the history:

Comparing losses on the train set:

ys_est =  map(x -> evalpoly(x, coeffs), xs)
mse(ys_est, ys)

Method	Loss	Coefficients
Target	416.62	(15.0, -2.1, 13.9, 13.9, 1.5)
Analytical	391.64	(15.34, -3.24, 13.84, 1.46)
Gradient Descent	498.50	(1.37, 0.54, 14.51, 1.26)

And finally, comparing the curves:

x_model = -5:0.01:5
ys_model =  map(x -> evalpoly(x, coeffs), x_model)

4.2 Revisited with map

It is possible to replace the inner loop over the training data with map.

function gradient_descent_poly!(
    coeffs::AbstractVector,
    xs::AbstractVector,
    ys::AbstractVector
    ; learning_rate::AbstractFloat=0.1,
    max_iters::Integer=100
    )
    history = Float64[]
    for i in 1:max_iters
        # forward
        ys_and_backs = map(x->rrule(evalpoly, x, coeffs), xs)
        ŷ = map(first, ys_and_backs)
        loss_iter, back_loss = rrule(mse, ŷ, ys)
        # reverse
        Δmse, Δŷ, Δy = back_loss(1.0)
        ∂f_and_∂x_zipped = map(((_, pb), δ) -> pb(δ), ys_and_backs, Δŷ)
        Δcoeffs_unzipped = map(Δ->Δ[3], ∂f_and_∂x_zipped) # Δ[i] = (Δevalpoly, Δx, Δcoeffs)
        Δcoeffs = reduce(+, Δcoeffs_unzipped)
        # update
        coeffs .-= learning_rate .* Δcoeffs
        # history
        push!(history, loss_iter)
    end
    history
end

This is code is slightly more complex than the previous version. The behaviour and performance is practically identical. However, it is one step closer to being more generic.

In machine learning, models usually execute on multiple inputs at once. We could make a polynomial model that does that:

struct Polynomial{V<:AbstractVector}
    weights::V
end
(m::Polynomial)(x) = evalpoly(x, m.weights)
(m::Polynomial)(x::AbstractVector) = map(m, x)

The goal then is to get gradients for the model’s weights directly:

model = Polynomial(coeffs)
zs, back = pullback(m -> m(xs), model)

In the next sections we will write code that will inspect the model function call, recognise that it calls map, and call a pullback for map.² This in turn will call the pullback for evalpoly, which will pass the arguments to the rrule defined above.

5 Conclusion

The next two sections will develop the pullback function. It will inspect and decompose code with the goal of passing arguments to rrule and accumulating gradients via the chain rule.

Part 2 will introduce metaprogamming Julia and generate expressions for the backpropagation code. However the code is unstable and prone to errors - it is recursive metaprogramming - so part 3 will introduce more robust code making use of the IRTools.jl package. This code really pushes Julia’s metaprogramming to its limits.

It is possible to jump straight to part 3 if desired.

For example, Micrograd defines a Value class that has a custom definition for __add__. This custom definition then calculates the forward pass and prepares the backwards pass. The same is true of the Tensor objects in PyTorch. ↩
It is a design choice to use pullback and not rrule for map. Both rrule and pullback have the same outputs. However rrule is intended for stand alone gradients, whereas pullback will potentially involve recursive calls to itself. ↩

Covering all birthdays

2024-07-09T00:00:00+00:00

Quantifying how likely each birthday is present (covered) in some large group of people.

1 Introduction

I recently got nerd sniped by a fascinating post on Hacker News titled Every day is an Owl’s Birthday! by SeniorMars. It explored the problem of estimating if there was at least one student at a university for every birthday. Put another way, it explored the following question:

Given $n$ people, what is the probability that all $N$ birthdays are covered? That is, given $n$ people, what is the probability that there is at least 1 person for each birthday?

As well as the related question:

What is the expected number of people required to have at least 1 person for each birthday? That is, how many people do you need to approach and ask what their birthday is before you see all birthdays?

For the latter, the minimum number of people is obviously $n=N=365$. However you would have to be very lucky to get this outcome. On the other extreme, one can imagine an incredibly unlucky case where 1000 people are approached and all are born on May 5th, and hence you would be no closer to your goal than when you started. You might end up approaching hundreds of thousands of people. But between these two extremes, how many people do we expect to approach on average? The first question then seeks to quantify how lucky you are with the number you get.

I want to present the results in the original post in my own way. It took me a few reads to understand those explanations, and I think I can improve on them here. However I will leave out extra material from the original including mathematic proofs, accounting for leap years and accounting for non-uniform birthday distributions.

This problem is different to the birthday paradox, which tries to determine how likely duplicate birthdays are in a group of people, and which comes up with the surprising answer that it is very likely for even a small number. I have explored this problem in an earlier blog post. The key differentiator is the birthday paradox deals with $nN$ (more people than birthdays), where duplicates cause the extra complexity.

2 The Coupon Collector Problem

We will start with the second problem because it is simpler to solve. It is:

What is the expected number of people required to have at least 1 person for each birthday?

This problem is identical to the Coupon Collector’s Problem:

Given $N$ coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once?

with $N=365$ birthdays.

I’ll first simulate it and present results, and then match the numbers to theory.

2.1 Simulation

To run a Monte Carlo simulation, for each trial create a vector of seen of size 365 and set it to all false. Then get stuck in a while loop, and on each iteration generate 1 random birthday k and set seen[k] to true. Exit when all of seen is true. Repeat this for some large number $T$ trials.

Here is an implementation in Julia:

using ProgressMeter

function coupon_collecting_simulation(ncoupons::Int, ntrials::Int)
    counts = Vector{Int}(undef, ntrials)
    @showprogress for i in eachindex(counts)
        counts[i] = run_collection_trial(ncoupons)
    end
    counts
end

function run_collection_trial(ncoupons::Int)
    seen = fill(false, ncoupons)
    coupons = 0
    while !all(seen)
        coupons += 1
        k = rand(1:ncoupons)
        seen[k] = true
    end
    coupons
end

This code can be run with coupon_collecting_simulation(365, 10_000).

Here are the results:

Monte Carlo frequency of stopping counts for coupon collecting.

The stopping counts range from 1,349 to 5,832. The average is 2364.84 with a standard deviation of 466.68. So on average we need 6.5$\times$ as many people as the number of birthdays to see all of them.

2.2 Theory

To calculate this number theoretically, it helps to answer the following easier questions.

How many people on average do we need to ask to collect a new birthday,

At the start?
After collecting 50 unique birthdays?
After collecting 265 unique birthdays?
At the end, after collecting 364 unique birthdays?

Answers:

One. The first person will give us our first birthday.
There are 315 remaining birthdays and so $\frac{315}{365}=\frac{1}{1.15}=86.3\%$ of birthdays will be new. This means 1 in every 1.15 people will give us a new birthday, so we need to ask 1.15 people on average to get a new birthday.
There are 100 remaining birthdays and so $\frac{100}{365}=\frac{1}{3.65}=27.4\%$ of birthdays will be new. This means 1 in every 3.65 people will give us a new birthday, so we need to ask 3.65 people on average to get a new birthday.
At the end only $\frac{1}{365}$ of birthdays will be new. This means 1 in 365 people will give us a new birthday, so we need to ask a full 365 people to get a new birthday.

From this it follows that the formula for the expected number of people $n$ is the sum of all the different scenarios:

\[\begin{align} n &= \sum_{i=1}^N \frac{1}{p_i} \\ &= \sum_{i=1}^N \frac{1}{(N-i+1)/N} \\ &= N\sum_{k=1}^N \frac{1}{k} \quad ; k=N-i+1 \end{align}\]

Setting $N=365$, we get 2364.64 people, which is extremely close to our simulated value of 2364.84.

The sum $\sum_k^N \frac{1}{k}$ is the harmonic number and is approximated by $\ln N + \gamma$, where $\gamma\approx 0.5772156649$ is the Euler-Mascheroni constant. This shows that this sum is unbounded for $N$. That is, the more coupons the more people that need to be asked.

3 Covering Birthdays

Now to solve the first problem. It is:

Given $n$ people, what is the probability that all $N$ birthdays are covered? That is, given $n$ people, what is the probability that there is at least 1 person for each birthday?

Based on the previous answer, we expect the probability to be very low below $n=2364$, and very high above it.

For the theory part you’ll need a good understanding of counting methods and how the binomial coefficient $n \choose k$ (read as “n choose k”) is used in combinatorics. The main calculation is with the Inclusion-Exclusion Principle formula, which I’ll introduce gently.

Because the numbers get very large very quickly, I’ll also work with the simpler case of covering 4 seasons with 5 people: spring 🌱, summer ☀️, autumn 🍂 and winter ❄️.

3.1 Simulation

For this Monte Carlo simulation more work needs to be done per data point. Take a fixed $n$ and then generate $n$ random birthdays a large number of $T$ times. Each time check if all the birthdays are covered or not and add this to a count $c$. (The simplest way to do this is check if the length of the hashed set is 365.) After all the trials estimate the probability as $c/T$. Then repeat this for several different $n$’s.

Here is an implementation in Julia:

using ProgressMeter

function covering_simulation(ndays::Int, ntrials::Int, population_sizes::Vector{Int})
    ratio_covered = zeros(length(population_sizes))
    for (idx, pop_size) in enumerate(population_sizes)
        covered = 0
        progress = Progress(ntrials; desc="Population size: $pop_size ")
        for i in 1:ntrials
            next!(progress)
            is_covered = covering_trial(ndays, pop_size)
            if is_covered
                covered += 1
            end
        end
        ratio_covered[idx] = covered / ntrials
    end
    ratio_covered
end

function covering_trial(ndays::Int, n::Int)
    birthdays = rand(1:ndays, n)
    length(Set(birthdays)) == ndays
end

It can be run with:

population_sizes = [365, 500, 1000, 1500, 2000, 2364, 2500, 3000, 4000, 5000]
ratio_covered = covering_simulation(365, 10_000, population_sizes)

Here are the results:

Monte Carlo simulation of ratio of birthdays covered per $n$

The probability is almost zero below 1500, but rise rapidly afterwards and by 4000 is almost one. At the expected value of $n=2364$, the ratio covered is 0.5739.

3.2 Theory

Counting configurations

One way to estimate the probability is to count all the different configurations of birthdays.

For the season problem, this is straight forward: ${5 \choose 2} = 10$ pairs can share a season, then there are 4 seasons that can be assigned to the pair, then 3 remaining seasons to the next person, then 2 to the next person, and finally the last person must take the last remaining season. This is out of $4^5$ possible configurations:

\[\begin{align} P(🌱\cup ☀️\cup 🍂 \cup ❄️ ) &= \frac{ {5 \choose 2} 4!}{4^5} \\ &= 0.234375 \end{align}\]

This is just under 1/4th.

For the birthday problem, this is much more difficult. There are many, many different configurations which all need to be summed together. For example, one such configuration between $n=2364$ people is 180 birthdays each shared 6 times (1080 people), another 180 birthdays each shared 7 times (1260 people), 4 shared 5 times (20 people), and 1 shared 4 times (4 people). This is out of $365^n$ configurations:

\[\begin{align} P\left(X \right) &= \left[{2364 \choose 1080}{1284 \choose 1260}{24 \choose 20}{4 \choose 4} \right] \cdot \\ &\phantom{=} \quad \left[ {365 \choose 180 } {185 \choose 180 } {5 \choose 4 } {1 \choose 1 }\right] / 365^{2364} \\ &= \frac{2364!}{1080! 1260! 20! 4!} \frac{365!}{ (180!)^2 4! 1!} / 365^{2364} \\ &= 8.4\times 10^{-5179} \end{align}\]

This probability is absolutely tiny. Worse, there are an extremely large number of configurations like this, all with extremely small probabilities. Adding them up is complex and might have numerical issues.

Thankfully, there is a simpler way.

Counting missing birthdays

All probabilities sum to 1. From this, the probability that at least one person has each birthday is 1 minus the scenarios where birthdays are missing.

As a start, assume mutual exclusivity between missing birthdays. That is, there is no overlap between missing a birthday. This is clearly false: a group of people can have multiple missing birthdays. However, it makes the calculations simple.

Counting trees for each $S \setminus x$ (S exclude x) season.

For the season problem, there are 4 possible ways we can exclude 1 of 4 seasons, and then there are $3^5$ possibilities for all of the five people. The probability is thus:

\[\begin{align} P(🌱\cup ☀️\cup 🍂 \cup ❄️ )&= 1 - P(\bar{🌱}\cup \bar{☀️}\cup \bar{🍂} \cup \bar{❄️} )\\ &= 1 - \frac{4 \cdot 3^5}{4^5} \\ &= 0.0508 \end{align}\]

This is much smaller than the original value of 0.234. The mutual exclusivity assumption clearly does not hold here. (This will be corrected shortly.)

For the birthdays, there are 365 possible ways we can exclude 1 of 365 birthdays, and then there are $364^n$ possibilities for the birthdays for $n$ people:

\[\begin{align} P\left(\bigcup\limits_{i=1}^{365} B_i \right) &= 1 - P\left(\bigcup\limits_{i=1}^{365} \bar{B}_i \right) \\ &= 1 - \frac{365 \cdot 364^{2364} }{365^{2364} } \\ &= 0.4432 \end{align}\]

This is much closer to the target value (77% of the simulated value). This is because with 2364 people it is somewhat likely that only 1 of the 365 birthdays is missing.

Overlap between counting trees.

To correct these values, we need to account for overlaps in the counting trees. For the season problem, we can exclude both winter ❄️ and autumn 🍂 by only choosing spring 🌱 or summer ☀️ in either the $S\setminus ❄️$ tree or the $S\setminus 🍂$ tree. Since in both we have a choice of 2 seasons for each of the 5 people, there are $2^5=32$ overlapping branches. In total there are ${4 \choose 2} = 6$ sets of overlapping branches:

Between $S\setminus ❄️$ and $S\setminus 🌱$.
Between $S\setminus ❄️$ and $S\setminus ☀️$.
Between $S\setminus ❄️$ and $S\setminus 🍂$.
Between $S\setminus 🌱$ and $S\setminus ☀️$.
Between $S\setminus 🌱$ and $S\setminus 🍂$.
Between $S\setminus ☀️$ and $S\setminus 🍂$.

Each branch has been counted twice, so we need to minus one version to correct it:

\[\begin{align} P(🌱\cup ☀️\cup 🍂 \cup ❄️ )&= 1 - P(\bar{🌱}\cup \bar{☀️}\cup \bar{🍂} \cup \bar{❄️} )\\ &= 1 - \left[ \frac{4 \cdot 3^5}{4^5} - \frac{ {4 \choose 2} \cdot 2^5}{4^5}\right] \\ &= 0.23828125 \end{align}\]

Much closer to our original answer of 0.234375!

Similarly, for the birthdays:

\[\begin{align} P\left(\bigcup\limits_{i=1}^{365} B_i \right) &= 1 - P\left(\bigcup\limits_{i=1}^{365} \bar{B}_i\right) \\ &= 1 - \left[ \frac{365 \cdot 364^{2364} }{365^{2364} } - \frac{ {365 \choose 2} \cdot 363^{2364} }{365^{2364} }\right]\\ &= 0.5955 \end{align}\]

This is slightly over the simulated value of 0.5739.

For the next correction, it is helpful to draw a Venn diagram:

For the seasons, we initially double count the overlaps between 2 circles, but then correct this by subtracting each one once. But this means that the middle, which is initially counted 3 times, is subtracted 3 times. So we need to add it back once. There are $ {4 \choose 3 } = 4 $ overlaps we need to add back:

Between $S\setminus ❄️$, $S\setminus 🌱$ and $S\setminus ☀️$.
Between $S\setminus ❄️$, $S\setminus 🌱$ and $S\setminus 🍂$.
Between $S\setminus ❄️$, $S\setminus ☀️$ and $S\setminus 🍂$.
Between $S\setminus ☀️$, $S\setminus 🌱$ and $S\setminus 🍂$.

This is the exact same value as with counting the configurations.

For the birthdays:

\[\begin{align} P\left(\bigcup\limits_{i=1}^{365} B_i \right) &= 1 - P\left(\bigcup\limits_{i=1}^{365} \bar{B}_i \right) \\ &= 1 - \left[ \frac{365 \cdot 364^{2364} }{365^{2364} } - \frac{ {365 \choose 2} \cdot 363^{2364} }{365^{2364}} \right. \\ &\phantom{=} \left. + \frac{ {365 \choose 3} \cdot 362^{2364} }{365^{2364}} \right] \\ &= 0.5681 \end{align}\]

This is now slightly under the simulated value of 0.5739.

For the seasons, we are done. For the birthdays, we can continue this pattern of over-correcting/under-correcting under the Inclusion-Exclusion Principle:

Inclusion-Exclusion Principle

$$ \begin{align} P\left( \bigcup\limits_{i=1}^{n} A_i \right) &= \sum_{i=1}^{n} |A_k| - \sum_{1\leq i

For the birthday problem, each $A_i$ is the exclusion of one birthday (e.g. $A_5$ is January 5th missing), and groups of intersections $\sum \vert A_{i_1} \cap … \cap A_{i_k} \vert$ are calculated as the number of different combinations $365 \choose k $ of shared missing birthdays multiplied by the probability $\left(\frac{365-k}{365}\right)^n$.

The formula is then:

\[\begin{align} P\left(\bigcup\limits_{i=1}^{365} B_i\right) &= 1 - P\left(\bigcup\limits_{i=1}^{365} \bar{B}_i\right) \\ &= 1 - \frac{1}{365^n}\sum_{k=1}^{365} (-1)^{(k+1)} { 365 \choose k} (365 - k)^n \end{align}\]

For $n=2364$, we get an answer of 0.5712. The simulated value of 0.5739 was close.

We can now construct a theoretical graph and compare it to the graph from the simulation:

The graphs match very well.

4 Conclusion

The answer to the question, what is the probability that all birthdays ($N=365$) are present in a group of $n$ people is:

Very low for less than 1000 people ($<3N$).
About 50% for 2000 people ($\approx 6N$).
Very high for 3000 people ($8N$) and almost certain for 4000 and above ($>10N$).

More generally, the Inclusion-Exclusion Principle can be used to calculate exact probabilities for this and similar problems.

This was an interesting problem, but I’m not sure if there is a practical use to it.

Generative transformer from first principles in Julia

2024-03-23T00:00:00+00:00

A transformer for generating text in Julia, trained on Shakespeare’s plays. This model can be used as a Generative Pre-trained Transformer (GPT) with further work. This post was inspired by Andrej Karpathy’s Zero to Hero course.

Update 2 February 2025: update to Flux 0.16.

See also a previous post: Transformers from first principles in Julia. And a later post: DeepSeek’s Multi-Head Latent Attention.

All code available at github.com/LiorSinai/TransformersLite.jl.

1 Introduction

The transformer architecture was introduced by Google AI in their famous Attention is all you need (2017) paper. They have dominated the natural language processing (NLP) landscape since then. Nearly all of the state of the NLP models today are transformer models. Most of them have an incredibly similar architecture to the original and differ only on training regimes, datasets and sizes.

Transformers have continued to grow in size.

In 2018 OpenAI released a paper titled Improving Language Understanding by Generative Pre-Training. This led to the development of their first Generative Pre-trained Transformer (GPT) model. As of 2024 they have released four versions of GPT, with the latest requiring over 1.8 trillion parameters. The interactive version of the model, ChatGPT, has gained widespread fame for its human like responses.

GPT Transformer architecture (left) and fine tuning tasks (right). Source: GPT1 paper (2018)

The goal of this post is to create a generative transformer following OpenAI’s methodology for their first GPT-1 paper. It will be a vanilla transformer without many of the additions that have been proposed in this fast paced field. The model will be trained on Shakespeare plays and will be able to generate text that looks and sounds like Shakespeare. This model can then be used as the pre-trained foundation for further supervised tasks.

Outcome

The goal is to create a model which implements the architecture in the GPT paper:

TransformerGenerator(
  Embedding(72 => 32),                  # 2_304 parameters
  Embedding(64 => 32),                  # 2_048 parameters
  Dropout(0.1),
  TransformerBlock(
    MultiHeadAttention(
      nhead=4,
      denseQ = Dense(32 => 32; bias=false),  # 1_024 parameters
      denseK = Dense(32 => 32; bias=false),  # 1_024 parameters
      denseV = Dense(32 => 32; bias=false),  # 1_024 parameters
      denseO = Dense(32 => 32),         # 1_056 parameters
    ),
    LayerNorm(32),                      # 64 parameters
    Dense(32 => 128, relu),             # 4_224 parameters
    Dense(128 => 32),                   # 4_128 parameters
    LayerNorm(32),                      # 64 parameters
    Dropout(0.1),
  ),
  ..., # 2x more TransformerBlocks
  Dense(32 => 72),                      # 2_376 parameters
  mask = 64×64 Matrix{Bool},
)        # Total: 43 trainable arrays, 44_552 parameters,
          # plus 1 non-trainable, 4_096 parameters, summarysize 180.641 KiB.

It will map tokens to indices and will operate on those :

mask = make_causal_mask(ones(8, 8))
indices = indexer(collect("LYSANDER")) # [23, 36, 30, 12, 25, 15, 16, 29]
model(indices; mask=mask)

It will return a $ V \times n $ matrix, where $V$ is the vocabulary size and $n$ is the length of the input vector (8 in this example). Each column represents logits for each token. These will then be normalised to values between 0 and 1 using the softmax function. The model will be trained so that each value represents the probability of the next most likely token based on all the tokens before, up to a fixed context length $n$. As a whole the matrix represents the probabilities associated with shifting the input one value to the right.

As an example, during training the input will be “LYSANDER” and the reference “YSANDER\n”. The model will output a probability matrix and after sampling the result will be something like “YSANDR\nH”. This is then compared to the reference to improve the output.

The model computes all the probabilities for all $n$ characters in parallel through the same set of matrix operations, which makes this very efficient during training. We will effectively compare $n$ different predictions for one sample. However at inference time we are only interested in the last ($n$th) character, because we already have the first $n$ characters. Therefore we will discard the first $n-1$ predictions. (They would have already been used internally in the model.)

This is an inherent inefficiency in the transformer model architecture. (KV Caching can be used to overcome it. See João Lages’ visual explanation or a later blog post.)

Generation will repeat inference many times, each time adding the last generated token to the context and generating a new token. The result is something like:

CLATIO.
No, Goe, him buchieds is, hand I was,
To queer thee that of till moxselat by twish are.

BENET.
Are warrain Astier, the Cowlles,
bourse and nope, Merfore myen our to of them coun-mothared man,
Here is
Mafter my thath and herop, and in in have low’t so, veriege a the can eeset thy
inscestle marriom.

ADY.
Thus him stome
To so an streeward. Here cas, which id renuderser what thou bee of as the hightseleh-to.

CHAESS.
With he mand, th’ fouthos. I purcot Lay,
You.

GATHENT.
Who, to hath fres

This was generated by a tiny 42,400 parameter model with a perplexity of 6.3, down from a random sampling perplexity of 71 for 71 characters.

Background

In May 2022 I wrote a blog post on transformers from first principles in Julia. It developed a transformer for a classification task, namely predicting stars for Amazon Reviews. That post was lacking however in that it did not create a decoder transformer. This post is dedicated to that task. I’ve written this as a stand-alone from the original even though much of the code is the same. I refer back to the original post for some explanations. Please see the Design Considerations section which is not repeated here.

This post was inspired by Andrej Karpathy’s Zero to Hero course. I highly recommend it. It covers many ideas like backpropagation, normalisation and embeddings that are assumed knowledge in this post. In particular, this post emulates lesson 7 except the language and framework used are Julia and Flux.jl, not Python and PyTorch. The source code can be accessed at Karpathy’s famed nanoGPT repository.

My own repositories with the code in this blog post can be accessed at TransformersLite.jl and TransformersLite-examples. I will not detail any “pretty” printing function here - please see the repository for those.

This is not meant to be a full scale Julia solution. For that, please see the Transformers.jl package. It has better optimizations, APIs for HuggingFace and more.

2 Data

2.1 Download

The Complete Works of William Shakespeare by William Shakespeare has no copyright attached and can be downloaded legally from Project Gutenburg.

Here is a line to download it with cURL:

curl https://www.gutenberg.org/cache/epub/100/pg100.txt > project_gutenberg_shakespeare.txt

2.2 Preparation

A typical passage from the text looks like:

LYSANDER.
How now, my love? Why is your cheek so pale?
How chance the roses there do fade so fast?

HERMIA.
Belike for want of rain, which I could well
Beteem them from the tempest of my eyes.

LYSANDER.
Ay me! For aught that I could ever read,
Could ever hear by tale or history,
The course of true love never did run smooth.
But either it was different in blood—

This is what we want the transformer to learn and the vast majority of the text follows this format. However some pieces do not. These include the Project Gutenberg introduction and conclusion, the table of contents, the sonnets, the preambles - these list the acts and scenes in each play - and so on. Those should all be removed.

Optionally, the small amount of non-ASCII characters (œ, Æ,æ, …) should be removed. I also removed the “&” symbol and changed the archaic usage of “&c.” to “etc.”.

I’ve made a script which does all this work, prepare_shakespeare.jl. It reduces the file size from 5.4 MB to 4.8 MB.

2.3 Exploration

We can load the text in Julia with:

text = open(filepath) do file
    read(file, String)
end

Some basic statistics:

count('\n', text)   # 182,027 lines
count("\n\n", text) # 38,409 passages
count(r"\w+", text) # 921,816 words
length(text)        # 4,963,197 characters

The prepared dataset contains 182,027 lines spanning over approximately 38,409 passages, 921,816 words and 4,963,197 characters.

Most passages are very short - less than 100 characters. The longest is Richard’s monologue in “The Third Part of King Henry the Sixth” which consists of 3047 characters.

Lines have an average of 26.27 characters with the longest being 77 characters in length.

Frequencies of characters in the Complete Works of Shakespeare

After the data preparation there are 71 unique characters in the text: \n !(),-.:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz—‘’“”

There are approximately 30,040 unique words in the dataset. Of these, approximately 80% appear less than 10 times and 96.5% less than 100 times. The most frequent word is “the” with 23,467 occurrences.

3 Model

3.1 Project Setup

To start, make a package in the Julia REPL:


        julia> cd("path\\to\\project")
        

        julia> ] # enter package mode
        

        (@v1.x) pkg> generate TransformersLite # make a directory structure
        
 
        (@v1.x) pkg> dev "path\\to\\project\\TransformersLite"

The purpose of making a package is that we can now use the super helpful Revise package, which will dynamically update most changes during development without errors:

julia> using Revise
julia> using TransformersLite

The following packages need to be loaded/added for this tutorial:

julia> using Flux, LinearAlgebra, NNlib, ProgressMeter, Random, StatsBase

3.2 Tokenization

The model will predict probabilities for each token in a given vocabulary. There is a choice as to what constitutes a token. One extreme is one token for each word in the dataset. Here there are far too many unique words so it will explode the parameter count while providing too few training samples per token. The other extreme is character level tokens. This compresses the learning space too much to get fully realistic outputs, but otherwise it works surprisingly well. In between is sub-word tokenization such as Byte Pair Pair encoding. This allows configurable vocabulary lengths. See my TokenizersLite.jl package, the BytePairEncoding.jl package or Karpathy’s latest video.

Here we will follow Karpathy’s approach and use character level tokens. The model will learn to predict each word character by character.

First get all the characters:

characters = sort(collect(Set(text)))

Karpathy uses two dictionaries to convert between characters and indices: char_to_int and int_to_char. I’m going to wrap these in a slightly more complex IndexTokenizer struct introduced in my first post. It holds a vector of the vocabulary (equivalent to int_to_char) and a lookup for reversing this (equivalent to char_to_int). Additionally, it has an unknown symbol if any of the characters are not in the vocabulary.

The constructor is as follows:

struct IndexTokenizer{T}
    vocabulary::Vector{T}
    lookup::Dict{T, Int}
    unksym::T
    unkidx::Int
    function IndexTokenizer(vocab::Vector{T}, unksym::T) where T
        if !(unksym ∈ vocab)
            pushfirst!(vocab, unksym)
            unkidx = 1
        else
            unkidx = findfirst(isequal(unksym), vocab)
        end
        lookup = Dict(x => idx for (idx, x) in enumerate(vocab))
        new{T}(vocab, lookup, unksym, unkidx)
    end
end

Base.length(tokenizer::IndexTokenizer) = length(tokenizer.vocabulary)

function Base.show(io::IO, tokenizer::IndexTokenizer) 
    T = eltype(tokenizer.vocabulary)
    print(io, "IndexTokenizer{$(T)}(length(vocabulary)=$(length(tokenizer)), unksym=$(tokenizer.unksym))")
end

For encoding we lookup the character in the dictionary, returning the index of the unknown symbol by default:

function encode(tokenizer::IndexTokenizer{T}, x::T) where T
    get(tokenizer.lookup, x, tokenizer.unkidx)
end

function encode(tokenizer::IndexTokenizer{T}, seq::AbstractVector{T}) where T
    map(x->encode(tokenizer, x), seq)
end

We can add a method to do multiple dispatch on the type IndexTokenizer itself which turns the struct into a function:

(tokenizer::IndexTokenizer)(x) = encode(tokenizer, x)

Encoding example:

push!(characters, 'Ø') # unknown symbol
vocab_size = length(characters) # 72
indexer = IndexTokenizer(characters, 'Ø')
tokens = indexer(collect("How now, my love?")) # [19, 55, 63, 2, 54, ..., 62, 45, 11]

Decoding goes the other way:

decode(tokenizer::IndexTokenizer{T}, x::Int) where T = 
	0 <= x <= length(tokenizer) ? tokenizer.vocabulary[x] : tokenizer.unksym

function decode(tokenizer::IndexTokenizer{T}, seq::Vector{Int}) where T
    map(x->decode(tokenizer, x), seq)
end

An example:

join(decode(indexer, [23, 36, 30, 12, 25, 15, 16, 29])) # LYSANDER

3.3 Embeddings

Each token is transformed into a vector of floating point numbers. This vector represents some sort of meaning in a large, abstract vector space, where vectors that are closer to each other are more similar. (There is plenty of literature on this subject.)

Flux.jl comes with an embedding layer which can be used directly:

embedding = Flux.Embedding(72 => 32)
x = rand(1:72, 10) # [40, 49, 55, 65, 27, 50, 35, 69, 40, 29]
embedding(x) # 32×10 Matrix{Float32}

Here is the source code:

struct Embedding{W<:AbstractMatrix}
  weight::W
end

Flux.@layer Embedding

Embedding((in, out)::Pair{<:Integer, <:Integer}; init = randn32) = Embedding(init(out, in))

(m::Embedding)(x::Integer) = m.weight[:, x]
(m::Embedding)(x::AbstractVector) = NNlib.gather(m.weight, x)
(m::Embedding)(x::AbstractArray) = reshape(m(vec(x)), :, size(x)...)

function Base.show(io::IO, m::Embedding)
  print(io, "Embedding(", size(m.weight, 2), " => ", size(m.weight, 1), ")")
end

This struct stores a weight, by default the smaller datatype of Float32 rather than the usual Julia default of Float64. This saves on space without reducing accuracy. (Float16, Float8 and as low as Float4 are all used in machine learning models.)

On the forward pass each index is used to retrieve the associated column vector from the matrix. However instead of using m.weight[:, x] the function uses NNlib.gather(m.weight, x). This is because gather comes with an rrule defined for it (source):

∇gather_src(Δ, src_size, idx) = scatter!(+, fill!(similar(Δ, eltype(Δ), src_size), 0), Δ, idx)

The rrule is a reverse (backwards) rule that encodes the derivative for backpropagation. It is what makes the magic of automatic differentiation work.

The function gather does not have a formal derivative, but scatter is the opposite of it and is what we need to apply when we calculate the loss:

At the end of backpropagation we need to distribute the error matrix amongst the original word embeddings. This is what scatter does. Note that we use the red column twice, so we have two error columns directed towards it. The rrule applies + as the reducing function; that is, the two errors are added together and then to the word embedding.

Scatter can be inefficient. If we do a small experiment and call scatter we will see it results in a large matrix of mostly zeros:

NNlib.scatter(+, rand(8, 4), [1, 5, 11, 1]; dstsize=(8, 15))
8×15 Matrix{Float64}:
 1.62703   0.0  0.0  0.0  0.495725  0.0  0.0  0.0  0.0  0.0  0.237452     0.0  0.0  0.0  0.0
 0.979735  0.0  0.0  0.0  0.984499  0.0  0.0  0.0  0.0  0.0  0.145738     0.0  0.0  0.0  0.0
 0.892948  0.0  0.0  0.0  0.76959   0.0  0.0  0.0  0.0  0.0  0.714658     0.0  0.0  0.0  0.0
 1.45113   0.0  0.0  0.0  0.883492  0.0  0.0  0.0  0.0  0.0  0.52775      0.0  0.0  0.0  0.0
 0.702824  0.0  0.0  0.0  0.965256  0.0  0.0  0.0  0.0  0.0  0.0966964    0.0  0.0  0.0  0.0
 1.16978   0.0  0.0  0.0  0.568429  0.0  0.0  0.0  0.0  0.0  0.000161501  0.0  0.0  0.0  0.0
 1.80566   0.0  0.0  0.0  0.271676  0.0  0.0  0.0  0.0  0.0  0.430018     0.0  0.0  0.0  0.0
 1.16445   0.0  0.0  0.0  0.911601  0.0  0.0  0.0  0.0  0.0  0.786343     0.0  0.0  0.0  0.0

3.4 Position encoding

The matrix operations used in the transformer are parallel operations. This speeds up computation and is a major reason why they are so popular. However this is an issue: they do not take order into account. We can shuffle the columns in the embedding matrix and it will not affect the output.

Cosine similarities of different position encodings. The learned embedding is from a model made using the code in this blog post.

To counter-act this, the authors of the Attention is all you need (2017) paper suggested adding a second embedding to the first where the indices are the positions in the sequence.¹

We can use an Embedding matrix as before, except with a different input:

position_encoding = Embedding(16 => 32)
x = rand(32, 10) # the output of the first embedding layer
indices = 1:size(x, 2) # 1:10
embedding(indices) # 32×10 Matrix{Float32}

Other position encodings

Transformers are an active area of research and many position encodings have been proposed.

Sinusodial Position Encodings: The original paper gave equations to calculate a fixed embedding matrix. For an explanation and implementation see my first post.
Relative Position Embeddings (RPE) (2018): add embeddings in the attention step where each entry relates $r=j_k-i_q$. The Music Transformer (2018) paper greatly improved computation of this matrix. For a helpful video see Relative Self-Attention Explained.
Rotary Position Embeddings (RoPE) (2023): encode absolute position with a fixed rotation matrix, which handles longer sequences better and encodes relative positions better than sinusoidal embeddings. A downside is every key and query in the attention step needs to be multiplied by this rotation matrix instead of adding an encoding once at the start.
Position Interpolation (2023): extending embedding matrices by linearly down-scaling the input position indices to match the original context window size. For example, each index in a 128 context window can be down-scaled by a factor of 2 to match a 64 length encoding matrix. Half indices like 42.5 are a linear combination of the indices before and after (so 42 and 43). Some fine tuning is required for best results. This can be combined with RoPE.

This embedding matrix restricts the context size. In the example the embedding matrix is 32×16 so a maximum of 16 tokens that can be passed to the model at time. To overcome this a sliding window must be implemented and the model will completely “forget” any character outside of the window.

Ideally we would create an embedding matrix as large as possible so that the bottleneck is the training data, not the model. However attention, which will be discussed in the next section, scales with $n^2$ for a context length $n$. This is a significant performance penalty for a larger context size.

3.5 Attention

Source: Attention paper (2017)

Definition
Masking
Batched multiplication
MultiHeadAttention layer
Multi-Head Attention
Scaled Dot Attention
Full example

3.5.1 Definition

Attention is the main mechanism at the heart of the transformer. Theoretically it is a weighting of every token towards every other token, including itself. It is asymmetrical and so forms a full $n \times n$ matrix. For example consider word level tokens for the sentence “The elephant in the room”. The tokens “The”, “in”, “the” and “room” might all rate “elephant” the highest, but “elephant” will probably only rate “room” highly.

Julia uses column major format whereas Python uses row major format. In Julia word vectors are columns while in Python they are rows. Equations between the two formats will look backwards to each other. They need to be transposed and definitions also need to be transposed. E.g. $K^TQ \rightarrow (K_c^TQ_c)^T=Q_c^TK_c= Q_r K_r^T$

The attention equation is: $A = V\text{softmax}\left(\frac{1}{\sqrt{d_h}}K^T Q\right) \label{eq:attention} \tag{3.6.1}$

where $\text{softmax}$ is given by:

\[\text{softmax}(z, i) = \frac{e^{z_i}}{\sum_r^V e^{z_r}} \label{eq:softmax} \tag{3.6.2}\]

Its calculation scales with $\mathcal{O}(n^2d_h)$ where $n$ is the input token length and $d_h$ is the head dimension, also known as the hidden dimension.

Efficient self-attention

Given the $n^2$ scaling of attention much effort has gone into altering this step. This includes sparse attention layers, factorisation/kernels for linear attention and down-sampling. A detailed survey can be found at Efficient Transformers: A Survey (2020). All these alternatives are faster than attention but come at the expense of accuracy.

Another line of research is to improve the computation. This include Flash attention (2022) which improves computational efficiency on a single GPU while Ring attention (2023) aims to distribute the work efficiently across multiple devices.

Here the key $K$, query $Q$ and value $V$ are derived from the input matrix $X$ using weights:

\[\begin{align} K = W_K X \\ Q = W_Q X \\ V = W_V X \end{align}\]

Each weight $W$ has a size $d_h \times d_\text{emb}$ and the input matrix has a size $d_\text{emb} \times n$ where $d_\text{emb}$ is the embedding dimension. Each of these matrices therefore has a size $d_h \times n$.

From this we can show that the first matrix product is a weighted dot product of every vector to every other vector in the input matrix, resulting in a $n \times n$ matrix:

\[K^T Q = (W_KX)^T(W_QX) = X^T W_K^T W_Q X\]

This is then following by scaling ($1/\sqrt{d_h}$) and normalisation ($\text{softmax}$). Lastly this matrix is used as a weight for $V$. The output is $d_h \times n$.

3.5.2 Masking

There is a flaw in this architecture. The attention is computed across all tokens at once. This means that past tokens will be given access to future tokens. However the training objective is to predict future tokens. Therefore only the $n$th token, whose next token is missing, will be trained fairly.

To overcome this the authors of the Attention (2017) paper suggested masking the matrix before the softmax with $-\infty$ at each illegal connection, so that $\exp(-\infty)=0$ which effectively removes their influence.

The masked matrix will look like:

\[\begin{bmatrix} s_{11} & s_{12} & s_{13} &... & s_{1n} \\ -\infty & s_{22} & s_{23} &... & s_{2n} \\ -\infty & -\infty & s_{33} &... & s_{3n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -\infty & -\infty & -\infty & ... & s_{nn} \end{bmatrix}\]

Firstly a mask is made where all valid connections have a true and all illegal connections have a false. Here is the code from NNlib.jl:

using LinearAlgebra
function make_causal_mask(x::AbstractArray; dims::Int=2)
  len = size(x, dims)
  mask = triu(trues_like(x, (len, len)))
  mask
end

trues_like(x::AbstractArray, sz=size(x)) = fill!(similar(x, Bool, sz), true)

Dataless masks

We don't have to allocate memory to create a mask. The causal mask is defined by $j \geq i$ for all indices $i$, $j$. We can write this as a function as long as we can also write an equivalent rrule for it as well. See NeuralAttentionlib.jl for such an implementation.

The mask will be applied through ifelse, where trues maintain their value but the falses are replaced with some large negative number.

apply_mask(logits, mask::Nothing) = logits

function apply_mask(logits, mask)
    neginf = typemin(eltype(logits))
    ifelse.(mask, logits, neginf)
end

Usage:

mask = make_causal_mask(ones(5, 5))
x = randn(Float32, 5, 5)
apply_mask(x, mask) # 5×5 Matrix{Float32}:

Backpropagation:

using Flux: pullback
y, back = pullback(apply_mask, x, mask);
grads = back(randn(size(y)...))
grads[1] # zero where -inf

As an experiment, set the mask to nothing during training. It should be possible to get very low training losses (below 0.5) corresponding to very low perplexities (less than 2) with very small models but without a corresponding increase in generation quality.

3.5.3 Batched multiplication

The Attention (2017) paper suggested a further enhancement on attention where the input matrix is divided amongst $H$ heads. This results in a $\tfrac{d_\text{emb}}{H} \times n \times H$ array. Furthermore, working with batches adds an extra dimension: $d_h \times n \times H \times B$.

We could work with these arrays as Vector{<:Matrix{T}} and Vector{<:Vector{<:Matrix{T}}} respectively, but it is more efficient to work with them as Array{T, 3} and Array{T, 4} because then we can work with optimised array functions.

My first post goes into more detail about multiplication with higher order arrays.² It compares vanilla versions with optimised versions. Here I will present the optimised version only.

Batch multiplication is defined as:

\[C_{ijk} = \sum_r A_{irk} B_{rjk}\]

An optimised version is available through the NNlib.jl library, a dependency of Flux.jl:

using NNlib
A = rand(6, 8, 4);
B = rand(8, 5, 4);
NNlib.batched_mul(A, B) # 6×5×4 Array{Float64, 3}

The 4D batched multiplication is defined as:

\[C_{ijkl} = \sum_r A_{irkl} B_{rjkl}\]

We can calculate this array with the same batched_mul by reshaping any 4D $m\times n \times p \times q$ arrays into 3D $m\times n \times pq$ arrays, do the multiplication, and reshape back. This is exactly what the implementation does behind the scenes:

using NNlib
A = rand(6, 8, 4, 3);
B = rand(8, 5, 4, 3);
NNlib.batched_mul(A, B) # 6×5×4×3 Array{Float64, 4}

The Flux Dense layer does something similar.

3.5.4 MultiHeadAttention layer

Flux.jl now comes with a Flux.MultiHeadAttention layer. However for continuity with my first post, I will present my own MultiheadAttention layer except now with masking. It is very similar to the code in Flux.jl and NNlib.jl. The differences are in design choices for the inputs and Flux.jl’s implementations are slightly more generic.

First define a struct to hold all the dense layers and a parameter for $H$ called nhead:

struct MultiheadAttention{Q<:Dense, K<:Dense, V<:Dense, O<:Dense}
    nhead::Int
    denseQ::Q
    denseK::K
    denseV::V
    denseO::O
end

#= tell Flux which parameters are trainable =#
Flux.@layer MultiHeadAttention trainable=(denseQ, denseK, denseV, denseO)

The model is defined by 4 values: the number of heads $H$, the input dimension $d_\text{in}$, the output dimension $d_\text{out}$ and the head dimension $d_h$. The default for $d_h$ is $d_\text{in}/H$.

function MultiheadAttention(
    nhead::Int, dim_in::Int, dim_head::Int, dim_out::Int
    )
    MultiheadAttention(
        nhead,
        Dense(dim_in, dim_head*nhead; bias=false),
        Dense(dim_in, dim_head*nhead; bias=false),
        Dense(dim_in, dim_head*nhead; bias=false),
        Dense(dim_head*nhead, dim_out),
    )
end

function MultiheadAttention(
    nhead::Int, dim_in::Int, dim_out::Int
    )
    if dim_in % nhead != 0 
        error("input dimension=$dim_in is not divisible by number of heads=$nhead")
    end
    MultiheadAttention(nhead, dim_in, div(dim_in, nhead), dim_out)
end

Now for the forward pass. In general there are three input matrices with the names of key, query and value. Later we will pass the same value x for all of them. From these we can calculate $Q$, $K$ and $V$ and pass them to the multi_head_scaled_dot_attention function:

function (mha::MultiheadAttention)(query::A3, key::A3, value::A3
    ; kwargs...) where {T, A3 <: AbstractArray{T, 3}}
    Q = mha.denseQ(query)
    K = mha.denseK(key)
    V = mha.denseV(value)
    A, scores = multi_head_scaled_dot_attention(mha.nhead, Q, K, V; kwargs...)
    mha.denseO(A), scores
end

This layer returns the scores as well, like Flux.jl’s MultiheadAttention layer. These are useful for inspecting the model.

3.5.5 Multi-Head Attention

The multi_head_scaled_dot_attention begins as follows:

function multi_head_scaled_dot_attention(nhead::Int, Q::A3, K::A3, V::A3
    ; kwargs...) where {T, A3 <: AbstractArray{T, 3}}
    qs = size(Q)
    ks = size(K)
    vs = size(V)
    dm = size(Q, 1)
    dh = div(dm, nhead)

The $Q$, $K$ and $V$ matrices need to be split from $d_m \times N \times B$ to $d_h \times N \times H \times B$. This is done in two steps:

$(d_h \times H)\times N \times B$ (break $d_m$ into $d_h$ and $H$)
$d_h \times N \times H \times B$ (swap the 2nd and 3rd dimensions)

    Q = permutedims(reshape(Q, dh, nhead, qs[2], qs[3]), [1, 3, 2, 4])
    K = permutedims(reshape(K, dh, nhead, ks[2], ks[3]), [1, 3, 2, 4])
    V = permutedims(reshape(V, dh, nhead, vs[2], vs[3]), [1, 3, 2, 4])

Then we calculate the scaled dot attention for each head, combine results and return it:

    A, scores = scaled_dot_attention(Q, K, V; kwargs...)
    A = permutedims(A, [1, 3, 2, 4])
    A = reshape(A, dm, size(A, 3), size(A, 4))
    A, scores
end

3.5.6 Scaled Dot Attention

The scaled dot attention is defined by default for 3D arrays. $Q$ is of size $d_h \times d_q \times H$ while $K$ and $V$ are both of size $d_h \times d_{kv} \times H$. Usually $n=d_q=d_{kv}$.

function scaled_dot_attention(
    query::A3, key::A3, value::A3
    ; mask::Union{Nothing, M}=nothing
    ) where {T, A3 <: AbstractArray{T, 3}, M <: AbstractArray{Bool}}
    dh = size(query, 1)
    keyT = permutedims(key, (2, 1, 3)) # (dkv, dh, nhead)
    atten = one(T)/convert(T, sqrt(dh)) .* batched_mul(keyT, query) # (dkv, dh, nhead)*(dh, dq, nhead) => (dkv, dq, nhead)
    atten = apply_mask(atten, mask) # (dkv, dq, nhead)
    scores = softmax(atten; dims=1) # (dkv, dq, nhead)
    batched_mul(value, scores), scores # (dh, dkv, nhead)*(dkv, dq, nhead) => (dh, dq, nhead)
end

As explained above, we need to reshape 4D arrays into 3D arrays, apply the usual scaled dot attention and then reshape back:

function scaled_dot_attention(query::A4, key::A4, value::A4
    ; kwargs...) where {T, A4 <: AbstractArray{T, 4}}
    batch_size = size(query)[3:end]
    Q, K, V = map(x -> reshape(x, size(x, 1), size(x, 2), :), (query, key, value))
    A, scores = scaled_dot_attention(Q, K, V; kwargs...)
    A = reshape(A, (size(A, 1), size(A, 2), batch_size...))
    scores = reshape(scores, (size(scores, 1), size(scores, 2), batch_size...))
    A, scores
end

3.5.7 Full example

Model:

mha = MultiheadAttention(4, 32, 32)
Flux._big_show(stdout, mha)
#=
MultiheadAttention(
  4,
  Dense(32 => 32; bias=false),          # 1_024 parameters
  Dense(32 => 32; bias=false),          # 1_024 parameters
  Dense(32 => 32; bias=false),          # 1_024 parameters
  Dense(32 => 32),                      # 1_056 parameters
)                   # Total: 5 arrays, 4_128 parameters, 16.422 KiB.
=#

Forward pass:

x = randn(Float32, 32, 20, 2) # d×n×B
mask = make_causal_mask(ones(32, 20))
y, scores = mha(x, x, x; mask=mask) # 32×20×2 Array{Float32, 3}, 20×20×4×2 Array{Float32, 4}

Backpropagation:

using Flux
loss = sum # dummy loss function
grads = Flux.gradient(m -> loss(m(x, x, x; mask=mask)[1]), mha)
keys(grads[1]) # (:nhead, :denseQ, :denseK, :denseV, :denseO)

3.6 Transformer Blocks

The other components we need for the transformer block are Layer Norm, Feed Forward (two consecutive dense layers) and dropout. We can use the Flux.jl implementations for these.

Source: GPT1 paper (2018)

This means we can now create a transformer block:

struct TransformerBlock{
    MHA<:MultiheadAttention,
    N1<:LayerNorm,
    D1<:Dense,
    D2<:Dense,
    N2<:LayerNorm,
    DO<:Dropout}
    multihead_attention::MHA
    norm_attention::N1
    dense1::D1
    dense2::D2
    norm_feedforward::N2
    dropout::DO
end

Flux.@layer TransformerBlock # make whole layer trainable

This whole block can be defined with only 5 parameters:

The number of heads $H$.
The dimension $d$.
The hidden dimension for the feed-forward network. The convention is $4d$.
The activation function.
A drop out probability.

In code:

TransformerBlock(
    nhead::Int,
    dim_model::Int,
    dim_hidden::Int;
    act=relu,
    pdrop::Float64=0.1,
    ) = TransformerBlock(
    MultiheadAttention(nhead, dim_model, dim_model),
    LayerNorm(dim_model),
    Dense(dim_model, dim_hidden, act),
    Dense(dim_hidden, dim_model),
    LayerNorm(dim_model),
    Dropout(pdrop),
)

There are skip connections in the forward pass:³

function (t::TransformerBlock)(x::A; mask::M=nothing) where {
    A<:AbstractArray, M<:Union{Nothing, AbstractArray{Bool}}}
    h, scores = t.multihead_attention(x, x, x; mask=mask) # (dm, N, B)
    h = t.dropout(h) 
    h = x + h
    h = t.norm_attention(h)     # (dm, N, B)
    hff = t.dense1(h)           # (dh, N, B)
    hff = t.dense2(hff)         # (dm, N, B)
    hff = t.dropout(hff)
    h = h + hff
    h = t.norm_feedforward(h)   # (dm, N, B)
    h
end

Model:

block = TransformerBlock(4, 32, 32*4) 
Flux._big_show(stdout, block)
#=
TransformerBlock(
  ...
)  # Total: 13 arrays, 12_608 parameters, 50.234 KiB.
=#

Forward pass:

x = randn(Float32, 32, 20, 2) # d×n×B
mask = make_causal_mask(ones(32, 20)) # 20×20 Matrix{Bool}
y = block(x; mask=mask) # 32×20×2 Array{Float32, 3}

Backpropagation:

loss = sum # dummy loss function
grads = Flux.gradient(m -> loss(m(x; mask=mask)), block)
keys(grads[1]) # (:multihead_attention, :norm_attention, :dense1, :dense2, :norm_feedforward, :dropout)

3.7 Generator

Modified from GPT1 paper (2018)

We will create a struct to hold the generator.

struct TransformerGenerator{
    E<:Flux.Embedding, 
    PE<:Flux.Embedding, 
    DO<:Dropout, 
    TB<:Vector{<:TransformerBlock}, 
    D<:Dense,
    M<:Union{Nothing, AbstractMatrix{Bool}},
    } 
    embedding::E
    position_encoding::PE
    dropout::DO
    blocks::TB
    head::D
    mask::M # optional buffer
end

Flux.@layer TransformerGenerator trainable=(embedding, position_encoding, blocks, dropout, head)

By default the forward pass will use the model’s mask, else the user can pass a mask to it:

function (t::TransformerGenerator)(x::A; mask::M=t.mask) where {
    A<:AbstractArray, M<:Union{Nothing, AbstractMatrix{Bool}}}
    x = t.embedding(x)              # (dm, N, B)
    N = size(x, 2)
    x = x .+ t.position_encoding(1:N) # (dm, N, B)
    x = t.dropout(x)                # (dm, N, B)
    for block in t.blocks
        x = block(x; mask=mask)     # (dm, N, B)
    end
    x = t.head(x)                   # (vocab_size, N, B)
    x
end

Create a model:

context_size = 64
dim = 32
nheads = 4
vocab_size = 71
mask = make_causal_mask(ones(context_size, context_size))
model = TransformerGenerator(
    Embedding(vocab_size => dim),
    Embedding(context_size => dim),
    Dropout(0.1),
    TransformerBlock[
        TransformerBlock(4, dim, dim * 4; pdrop=0.1),
        TransformerBlock(4, dim, dim * 4; pdrop=0.1),
        TransformerBlock(4, dim, dim * 4; pdrop=0.1),
    ],
    Dense(dim, vocab_size),
    copy(mask)
)
Flux._big_show(stdout, model)
#=
TransformerGenerator(
  ...
)         # Total: 43 trainable arrays, 44_487 parameters,
          # plus 1 non-trainable, 4_096 parameters, summarysize 180.410 KiB.
=#

We can test it with a random vector of indices:

x = reshape(rand(1:vocab_size, 34), :, 1) # make it a batch of 1
mask = make_causal_mask(ones(dim, length(x)))
y = model(x; mask=mask) # 71×34×1 Array{Float32, 3}

Or a random batch:

X = rand(1:vocab_size, 34, 10)
Y = model(X; mask=mask) # 71×34×10

3.8 Generation

Let’s now generate text with the model.

The model has a fixed context length. To generate text longer than this fixed length we will implement a sliding window. This window will take the last $n$ tokens (rows) of the current context for each column (sample) in the batch:

function tail(A::AbstractMatrix, n::Int)
    n = min(n, size(A, 1))
    A[(end - n + 1):end, :]
end

The transformer generates a $V\times N \times B$ matrix. We will only take the logits for the last token per iteration, resulting in a $V\times B$ matrix. These logits will be converted to probabilities via the softmax function $\ref{eq:softmax}$.

We have a choice of how to sample these probabilities. The greedy approach is to always take the token with the maximum probability. A better approach is to randomly sample based on the probabilities. That way a token with a high probability is more likely to be chosen, but it is not guaranteed. This gives us some diversity in the results. We then add this to the context and repeat.

The full function is:

using Random, StatsBase
function generate(
    rng::AbstractRNG, model::TransformerGenerator, context::AbstractMatrix{T}
    ; context_size::Int, max_tokens::Int=100,
    ) where T
    for i in 1:max_tokens
        context_crop = tail(context, context_size)
        n = size(context_crop, 1)
        mask = isnothing(model.mask) ? nothing : view(model.mask, 1:n, 1:n)
        logits = model(context_crop; mask=mask) |> cpu # (vocab_size, n, B)
        logits = logits[:, end, :] # (vocab_size, B) 
        context_next = multinomial_sampling(rng, logits)
        context = cat(context, context_next; dims=1) 
    end
    context
end

function generate(model::TransformerGenerator, context::AbstractMatrix; kwargs...)
    generate(Random.default_rng(), model, context; kwargs...)
end

function multinomial_sampling(rng::AbstractRNG, logits::AbstractMatrix)
    probs = softmax(logits; dims=1)
    tokens = [sample(rng, Weights(p)) for p in eachcol(probs)]
    tokens
end

Testing it out:

context = reshape([1], 1, 1) # start with the new line symbol
out = generate(model, context; context_size=64) # 101×1 Matrix{Int64}

Decode the output using the tokenizer from section 3.2:

decoded_text = join(decode(indexer, out[:, 1]))
print(decoded_text)

The output:


A[RH N)pEy.QEgs?YbgnRsz-ZRDdUXvU Pzwzzxukvv_P;goxe(G;C;I
RIgB ‘E[xIqZ-J;gK—wwEUTZYtUg:tEhl-kZ;s:x.ggt

This is nonsense. The model does no better than drawing each character randomly. We need to train the model to get something sensible out of it.

4 Training

4.1 Train/validation split

It is always good practice to split the data into train, validation and test splits. For simplicity, we’ll only use a train and validation split. We’ll put the first 95% of data in the train split and the remainder in the validation split.⁴

tokens = indexer(collect(text))
n_val = floor(Int, (0.95) * length(tokens))
train_data = tokens[1:n_val]
val_data = tokens[(n_val + 1):end]

4.2 Batch generation

The model will be trained on segments of the text which match the context length $n$. For a text of length $L$ there are $L-n+1$ characters we can select to be the first character of the segments, excluding the last $n-1$ characters. For the Shakespeare text, this results in approximately 4.9 million different segments.

There is however plenty of overlap so we don’t have to train on all of them. We can instead randomly sample segments from the text. Characters at any point in the text will have a probability of appearing of $p\approx n/L$ (the ends are less likely). For many steps $s$ this binomial distribution can be approximated with a normal distribution with a mean $sp\approx sn/L$ and standard deviation $\sqrt{sp(1-p)}\approx \sqrt{sn/L}$. For example, for 4.9 million characters, a context length of 64 and 100,000 steps, each character at each point will appear 1.31±1.14 times.

The other important task is to create the reference text that the model will be trained to generate, which is simply the input text shifted by one. (This reduces the number of valid segments by 1.)

The function is as follows:

using Random
function get_shifted_batch(rng::AbstractRNG, data::AbstractVector, context_size::Int, batch_size::Int)
    indices = rand(rng, 1:(length(data)-context_size), batch_size)
    X = similar(data, context_size, batch_size)
    Y = similar(data, context_size, batch_size)
    for (j, idx) in enumerate(indices)
        X[:, j] = data[idx:(idx + context_size - 1)]
        Y[:, j] = data[(idx + 1):(idx + context_size)]
    end
    X, Y
end

get_shifted_batch(data::AbstractVector, context_size::Int, batch_size::Int) = 
    get_shifted_batch(Random.default_rng(), data, context_size, batch_size)

Usage:

text = rand(1:72, 1000) # pretend we've already indexed it
rng = MersenneTwister(2)
X, Y = get_shifted_batch(rng, text, 4, 3)

The outputs look like:

    X              Y
 70  66  |   9  60   3
 60   3  |  26   4  32
 4  32   |   1  17  35
 17  35  |  68  54  70

Lastly, it can be convenient to wrap this functionality in a struct similar to Flux.jl’s DataLoader. For an example of this, please see the BatchGenerator object in my generate_batches.jl file.

4.3 Loss

What is our goal?

We want the probability of the true next character to be the highest.

The model returns a $V \times n \times B$ array. We have an $n \times B$ reference array of the true next characters ($Y$). The first step is to convert it to probabilities - a range of values from 0 to 1 summing to 1 - with the softmax equation $\ref{eq:softmax}$. We can then pick out the next true characters by converting the reference array to a one hot matrix and multiplying:

Z = model(X, mask=mask) # V×n×B
probs = softmax(Z, dims=1)
Y_onehot = Flux.onehotbatch(Y, 1:vocab_size) # V×n×B
Y_onehot .* probs # V×n×B

All the non-zero values are the probabilities of interest.

Since these values are small numbers the convention is to instead use the cross entropy, so $-Y\log(P)$ rather than $YP$. This maps the values from the range $(0, 1)$ to the range $(0, \infty)$. We then reduce it to a single value by taking the mean. This is known as the cross entropy loss:

\[\begin{align} l(y, p) &= -\frac{1}{N}\sum^{N}_i y_i \log(p_i) \\ &= -\frac{1}{N}\sum^{N}_i y_i \log\left(\frac{e^{z_i}}{\sum e^z}\right) \\ &= -\frac{1}{N}\sum^{N}_i y_i \left(z_i - \log\left(\sum e^z\right)\right) \tag{4.2.1} \label{eq:cross_entropy} \end{align}\]

where $N=nB$.

As a baseline, imagine a model which predicts characters uniformly randomly. All probabilities will be $1/V$ and hence the loss will reduce to $-\log(1/V)$. For $V=71$ the expected loss is therefore 4.26. A trained model should achieve a value closer to 0.

Flux.jl comes with Flux.logitcrossentropy that will implement equation $\ref{eq:cross_entropy}$:

l1 = Flux.logitcrossentropy(Z, Y_onehot) # Float32
l2 = -sum(Y_onehot .* log.(probs)) / (n * B) # Float32
l1 ≈ l2 # true

In a single function:

function full_loss(Ŷ::AbstractArray{T, 3}, Y::AbstractMatrix{Int}) where T
    vocab_size = size(Ŷ, 1) 
    Y_onehot = Flux.onehotbatch(Y, 1:vocab_size)
    Flux.logitcrossentropy(Ŷ, Y_onehot)
end

I’ve called it the full loss to indicate that it is over all $nB$ token predictions and not only the last ($B$) tokens.

4.4 Perplexity

Another common measure of the ability of the model is perplexity, which is the inverse of the average probability for each character. It is defined as:

\[e^{l(y, p)} = \prod_i^N p_i^{-y_i/N} = 1 \div \left(\prod_i^N p_i \right)^{1/N} \tag{4.3} \label{eq:perplexity}\]

where $l(y, p)$ is the cross entropy loss.

The perplexity for random sampling with $p_i=1/V$ is simply $V$. In other words, the perplexity for randomly sampling 72 characters is a 1 in 72 chance for each character. A trained model should achieve a value closer to 1 in 1, because the context and known distributions allow the model to select characters with greater than random chance.

Like other types of averages, perplexity does not describe the shape of the distribution and outliers can have an outsized effect on it.

We can use many samples, say 1000 steps of 32 sized batches each to estimate it:

using ProgressMeter
batch_size = 32
num_steps = 1000
mean_loss = 0.0f0
@showprogress for step in 1:num_steps
    X, Y = get_shifted_batch(tokens, context_size, batch_size)
    mean_loss += full_loss(model(X), Y)
end
mean_loss /= num_steps
perplexity = exp(mean_loss)

Running this with an untrained model gave me a mean loss of 4.518 and perplexity of 91.7, which is even worse than the theoretical values.

4.5 Training loop

We can now setup a training loop. It will use gradient descent with an Adam optimizer to adjust learning rates during the process:

using Flux, ProgressMeter
batch_size = 32
opt_state = Flux.setup(Flux.Adam(0.01), model) 
@showprogress for step in 1:1_000
    X, Y = get_shifted_batch(train_data, context_size, batch_size)
    batch_loss, grads = Flux.withgradient(model) do m
        full_loss(m(X), Y)
    end
    Flux.update!(opt_state, model, grads[1])
end

This works well enough, but will require many more steps to train. I recommend at least 10 epochs, where one epoch is defined as $0.95L/(nB)$ steps. Then based on the logic in Batch Generation each character at each position in the text should appear approximately once per epoch. For $L=4.9\times10^6$, $n=64$ and $B=32$, this is 2,300 steps per epoch.

Please see my training.jl file for a train! function which also does the following:

Displays a running total of the latest batch loss and the mean batch loss.
Calculates the total loss and accuracy at the end of each epoch.
Returns a history Dictionary which saves these values for each epoch for each metric.

5 Evaluation

5.1 Qualitative

After the model has been properly trained, we can test how well it generates text:

context = reshape([1], 1, 1) # start with the new line symbol
out = generate(model, context; context_size=64, max_tokens=300)
decoded_text = join(decode(indexer, out[:, 1]))

Enter which at con to pratele-timen, man,
Nus maxchant newall the strainans, spauks wring-all likell come bein?

PAGLERANIA.
I not all sakompty hanet are the our
parry adme is waith
On shalt, full, in mety infor to thee I pater: let fathing
you, do taks you mail was the mascain
Am.
Him fore our waka

The output is not fully cohesive and is not proper English. However, there are many true English words and the made-up words follow general English patterns. The general structure of the output matches the input structure. Characters (Actors) are introduced in all capitals.

We could also use a prompt, for example a famous line from the input:

tokens = indexer(collect("To be, or not to be, that is the question:\n"));
context = reshape(tokens, :, 1)
out = generate(model, context; context_size=64, max_tokens=300)
decoded_text = join(decode(indexer, out[:, 1]))

To be, or not to be, that is the question:
Of them I conful usall but as dull will henow
I wold you stay shaked marce, I witth all mine Ren, to siven,
Thoumbeines.

GUERD.
Swetr with they bloctain now tires’d do stord’s my leed.

NAWIPAR.
And then tillf’s broky! house stoop lord
you lay’d beater of Ettion say.

DUKEnge what by to and King an

It does not reproduce the actual line in the text (“Whether ‘tis nobler in the mind to suffer “). However, as before, the output is not entirely nonsense and at least has the correct strcture.

5.2. Quantitative

Calculate the mean loss and perplexity again:

mean_loss = 0.0f0
@showprogress for step in 1:1000
    X, Y = get_shifted_batch(val_data, 64, 32)
    mean_loss += full_loss(model(X), Y)
end
mean_loss /= 1000
perplexity = exp(mean_loss)

The mean loss is 1.853 and the perplexity is 6.379. This is a significant improvement from the initialisation values of 4.277 and 72.0 respectively.

6 Inspection

6.1 Embeddings

For the most part the model we have created is black box. There are however various techniques to inspect the model. For example, cosine similarities which was showcased in the Position Encoding section.

Another popular technique is to visually examine the embeddings after dimension reduction. For example our model has a dimension of 32, and we can reduce this to 2 dimensions and then create a 2D scatter plot. The popular techniques to do this are PCA (Principal Component Analysis) and t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE starts with PCA and iterates to give better looking results.

Here is an implementation of t-SNE with Julia:

using TSne
W = model.embedding.weight # or transpose(model.head.weight)
reduce_dims, max_iter, perplexit = 0, 1000, 20.0
Y = tsne(transpose(W), 2, reduce_dims, max_iter, perplexit);
scatter(Y[:,1], Y[:,2], series_annotations=vocabulary, 
    markeralpha=0.0,
    label="",
    aspectratio=:equal
)

where the vocabulary is:

vocabulary = string.(indexer.vocabulary)
vocabulary[1] = string(Int(indexer.vocabulary[1])) #\n => 10
vocabulary[2] = string(Int(indexer.vocabulary[2])) #' '=> 32

The output:

t-SNE embeddings for the embedding matrix (left) and head matrix (right). New line is 10 and space is 32.

Note that t-SNE is stochastic and each run will give different results.

For the embedding matrix we can see that the model groups all the vowels (a, e, i, o, u) and their capital forms together. It also tends to group the lowercase form and uppercase form together e.g. ‘g’ and ‘G’. The head meanwhile has 3 distinct groups: capital letters, punctuation and lower case letters. It also groups the vowels together.

Perhaps with further training more meaning would be encoded into these vectors.

6.2 Attention scores

We can pass an input to the model and visually inspect the attention scores. To do this we need to alter the transformer functions to return the score as well (including reshaping it as needed). At the top level - the forward pass of the model - these scores should be saved in a vector. Then we can plot them:

Code for scores plot ⇩

using Plots
text = """LYSANDER.
How now, my love? Why is your cheek so pale?
How chance the roses there do fade so fast?"""
tokens = reshape(indexer(collect(text)), :, 1);
X = tokens[1:context_size, :];
X_text = decode(indexer, X[:, 1]);
Y, scores = predict_with_scores(model, X, mask=model.mask); # modified forward pass
s = scores[3][:, :, 3, 1]
s = ifelse.(model.mask, s, NaN)
heatmap(s,
    xticks=(1:context_size, X_text),
    yticks=(1:context_size, X_text),
    yrotation=90,
    aspectratio=:equal,
    xlims=(0.5, n+0.5),
    size=(500, 500),
)

Attention scores for block 3, head 3.

The attention matrices are very sparse. Most tokens only place emphasis on the four or less tokens directly before them. This suggests we could have used a much smaller context length, for example 16 and indeed that does work.

Ideally the model should be learning long range relationships and it is worrying that it is not.

That said, the model does confidently predict that after “How chanc” is an “e”:

Code for probability plot ⇩

using Plots
probs_next = softmax(Y[:, end, 1])
v = length(indexer.vocabulary)
bar(probs_next,
    xticks=(1:v, indexer.vocabulary),
    xlims=(1, v),
    label="",
    ylabel="probabilities",
    xlabel="tokens"
)

Probabilities for the next token for the last token in the sequence.

Perhaps with more training the model would give better results.

Conclusion

Thank you for following this tutorial. I hope you now have a working transformer and have much better insight into how they work.

The cosine similarity is calculated as $W^TW/ m^T m $ where $m_{1j}=\sqrt{\sum_i W_{ij}^2}$ for each column $j$ in $W$. In code:

using LinearAlgebra
function cosine_similarity(W::AbstractMatrix)
    sim = transpose(W) * W
    magnitudes = sqrt.(diag(sim))
    for i in 1:size(sim, 1)
        for j in 1:size(sim, 2)
            sim[i, j] /= magnitudes[i] * magnitudes[j]
        end
    end
    sim
end

↩

In general multiplication is not defined for higher order arrays. But there is a set of multidimensional algebraic objects called tensors where it is. Confusingly, Google named their machine learning framework TensorFlow and calls higher order arrays tensors. So one should differentiate between machine learning tensors and geometric tensors. They are not the same. To give a simple explanation: one can think of geometric tensors as higher order arrays with severe constraints on their entries and operations because they represent geometric objects. These constraints make it harder - not easier - to code higher order arrays as geometric tensors. ↩
The design decision is to purposely drop the attention scores in the TransformerBlock’s forward pass. This is to simplify the code and to not place a bias on the attention. In a typical block the MultiheadAttention layer will make up 1/3rd of parameters while the dense layers will make up 2/3rds, so the dense layers are potentially more important. To return the scores you can edit the forward pass for the block and model, or create two new functions entirely. ↩
A smarter strategy is to randomly sample passages throughout the text until the desired proportions are reached. ↩

Radix Tree in Julia

2024-03-21T00:00:00+00:00

A radix tree in Julia, built following Test Driven Development (TDD).

1 Introduction

I recently discovered radix trees, also known as compressed tries. They are a specialised, space-optimised data structure for storing and searching through strings. They can be used for text suggestions in search engines and for predictive text. They are used in databases for storing IP addresses and for the inverted index of search engines.¹

Source: en.wikipedia.org/wiki/Radix_tree.

The above figure shows an example of a radix tree. Each edge stores part of a string. The full string can be recovered by combining all the edges of the parents of a given node. Searching through the tree is $\mathcal{O}(\log_r(n))$ where $r$ is called the radix of the tree and $n$ is the total number of items stored in the tree.

This post describes how to build one in Julia. I’ll be following Test Driven Development (TDD) for part of the process.

As always, the full code can be viewed at my Github repository at github.com/LiorSinai/RadixTree.jl.

I’d like to note upfront that radix trees are not always the best solution for text search. In particular, binary search through a sorted linear list is $\mathcal{O}(\log_2(n))$ and is much simpler. In Julia the inbuilt searchsortedfirst function does this:

idx = searchsortedfirst(sorted_words, key)

So this is partly an academic exercise.

2 Implementation

Project setup (optional)

To start, make a package in the Julia REPL:


        julia> cd("path\\to\\project")
        

        julia> ] # enter package mode
        

        (@v1.x) pkg> generate RadixTree # make a directory structure
        
 
        (@v1.x) pkg> dev "path\\to\\project\\RadixTree"

The purpose of making a package is that we can now use the super helpful Revise package, which will dynamically update most changes during development without errors:

julia> using Revise
julia> using RadixTree

RadixTreeNode

My goal is to create a simple radix tree where each node stores a string. In this way the tree functions as a type of array.² The struct looks like:

mutable struct RadixTreeNode{T<:AbstractString}
    data::T
    is_label::Bool
    children::Vector{<:RadixTreeNode}
end

RadixTreeNode(data::T="", label::Bool=false) where T = 
    RadixTreeNode{T}(data, label, RadixTreeNode{T}[])

In Julia an immutable struct is usually preferable because the compiler can more easily optimise code for it. However here we will often need to change the data field during inserts, and so require a mutable struct.

The whole tree will be accessed through the first node, which is called the root:

root = RadixTreeNode() # RadixTreeNode{String}("", false, RadixTreeNode{String}[])

If we store children in the root then the default printing will print them too:

root = RadixTreeNode{String}("", false, [RadixTreeNode("a", true), RadixTreeNode("b", true)])
#= RadixTreeNode{String}("", false, RadixTreeNode{String}[RadixTreeNode{String}("a", true, RadixTreeNode{String}[]), RadixTreeNode{String}("b", true, RadixTreeNode{String}[])]) =#

This will get out of hand for a large tree, as it will print the entire tree. To avoid this, we can create a custom printing function which will only print the data for the immediate children of a node:

children_data(node::RadixTreeNode) = [child.data for child in node.children]

function Base.show(io::IO, node::RadixTreeNode)
    print(io, typeof(node))
    print(io, "(data=", node.data)
    print(io, ", is_label=", node.is_label)
    print(io, ", children=", children_data(node))
    print(io, ")")
end

Now if we print(root) we get:

#= RadixTreeNode{String}(data=, is_label=false, children=["a", "b"]) =#

We can create other helper functions for the RadixTreeNode:

Base.eltype(node::RadixTreeNode{T}) where T = T
children(node::RadixTreeNode) = node.children
is_leaf(node::RadixTreeNode) = isempty(node.children)

Search

We can use a very basic example to create and test a search function. See the tree below:

We can construct it directly as:

root = RadixTreeNode{String}(
    "", false, [ 
            RadixTreeNode{String}("te", false, 
            [
                RadixTreeNode("am"), RadixTreeNode("st")
            ]
        )
    ]
)

The goal of the search algorithm is to return the deepest node in the tree that matches the given key. We would also like to know how many letters are matched. We can make the following two tests:

using Test
node, num_found = get(root, "hello")
@test node == root && num_found == 0
node, num_found = get(root, "team")
@test node == root.children[1].children[1] && num_found == 4

The algorithm on Wikipedia is as follows:

Check if any child has a matching prefix with the key.
Chop off the matching prefix (keep the suffix) of the key and set the node to the child.
Repeat steps 1-2. Stop when:
- There is no matching prefix.
- Or the node is a leaf (has no children).
- Or all the letters are matched.

Here is the full algorithm in code:

function Base.get(root::RadixTreeNode, key::AbstractString)
    node = root
    num_found = 0
    suffix = key
    while !(isnothing(node)) && !(is_leaf(node)) && (num_found < length(key))
        child = search_children(node, suffix)
        if isnothing(child)
            break
        end
        node = child
        num_found += length(node.data)
        suffix = get_suffix(suffix, length(node.data))
    end
    node, num_found
end

function get_suffix(s::AbstractString, head::Int)
    if isempty(s)
        return s
    end
    s[nextind(s, firstindex(s), head):end]
end

function search_children(node::RadixTreeNode, key::AbstractString)
    for child in node.children
        if startswith(key, child.data)
            return child
        end
    end
end

This passes both tests.

Some comments:

These functions are fully compatible with unicode strings. See this tutorial for more information.
The get_suffix function may also be implemented using chop(s; head=head, tail=0) which returns SubString instead of String. Working directly with strings seems to reduce memory allocations.
The search_children function can be made faster with binary search. But in practice the child arrays tend to be small so this is not essential.

A question is, what will get(root, "tea") return? Technically “tea” is in the tree, split up as “te” and “am”. However this function is purposely limited to only full matching prefixes and not partial matches. Hence the “te” node will be returned with a match length of 2.

Insert

The Wikipedia page has a fairly complex insert example. I’m instead going to work through four simple examples, extending the insert! function each time to make the tests pass. By the end the function will be able to handle all scenarios.

1 Insert in order

For efficient search we want the children inserted in order. Our test is:

root = RadixTreeNode()
insert!(root, "t")
insert!(root, "z")
insert!(root, "a")
@test root.children[1].data == "a"
@test root.children[2].data == "t"
@test root.children[3].data == "z"

For a given key we first need to find which node to insert it at (get) then we can use searchsortedfirst to find which index to put it in:

function Base.insert!(root::RadixTreeNode{T}, key::AbstractString) where T
    node, match_length = get(root, key)
    new_node = RadixTreeNode(key, true)
    idx = searchsortedfirst(node.children, new_node; lt=(n1, n2)->n1.data < n2.data)
    insert!(node.children, idx, new_node)
end

And all our tests pass.

2 Extend

If we add strings which share prefixes with existing nodes, then we only want to extend by the suffix. Our test is:

root = RadixTreeNode("")
insert!(root, "s")
insert!(root, "slow")
insert!(root, "slowly")
insert!(root, "slower")
@test root.children[1].data == "s"
@test root.children[1].children[1].data == "low"
@test root.children[1].children[1].children[1].data == "er"
@test root.children[1].children[1].children[2].data == "ly"

The new code is:

function Base.insert!(root::RadixTreeNode{T}, key::AbstractString) where T
    node, match_length = get(root, key)
    suffix = get_suffix(key, match_length) # new
    new_node = RadixTreeNode(T(suffix), true) # edit
    idx = searchsortedfirst(node.children, new_node; lt=(n1, n2)->n1.data < n2.data)
    insert!(node.children, idx, new_node)
end

3 Split

If we add a string which shares a prefix with an existing node, then we have to split that node.

Our test is:

root = RadixTreeNode("")
insert!(root, "test")
insert!(root, "team")
@test root.children[1].data == "te"
@test root.children[1].children[1].data == "am"
@test root.children[1].children[2].data == "st"

Unlike before with get, we now will go the extra step of checking if any child overlaps with the remaining suffix. This requires checking all prefixes up to the suffix length $s$ for all children $c$, so this is inherently an $\mathcal{O}(cs)$ operation. If it does, we will split! that child into two and then add the suffix as a new child. The child will only have two children - the suffix of the old data and this new suffix - so determining the order is straightforward.

function Base.insert!(root::RadixTreeNode{T}, key::AbstractString) where T
    node, match_length = get(root, key)
    suffix = get_suffix(key, match_length)
    child, overlap = search_children_with_overlap(node, suffix) # new
    if isnothing(child) # new
        new_node = RadixTreeNode(T(suffix), true)
        idx = searchsortedfirst(node.children, new_node; lt=(n1, n2)->n1.data < n2.data)
        insert!(node.children, idx, new_node)
    else # new
        node = child # new
        split!(node, overlap) # new
        new_suffix = get_suffix(suffix, overlap) # new
        new_node = RadixTreeNode(T(new_suffix), true) # new
        idx = new_node.data < node.children[1].data ? 1 : 2 # new
        insert!(node.children, idx, new_node) # new
    end # new
end

function search_children_with_overlap(node::RadixTreeNode, key::AbstractString)
    for len_prefix in length(key):-1:1
        for child in node.children
            data = first(child.data, len_prefix)
            if startswith(key, data)
                return child, min(len_prefix, length(data))
            end
        end
    end
    nothing, 0
end

function split!(node::RadixTreeNode{T}, i::Int) where T
    suffix = get_suffix(node.data, i)
    new_node = RadixTreeNode{T}(T(suffix), node.is_label, node.children)
    node.data = first(node.data, i)
    node.children = [new_node]
    node.is_label = false
    node
end

4 Split with no add

There are two extra scenarios we have to account for. The first is if the word is already in the tree, in which case we should ignore it. The second is if we add a word that is fully a prefix of another word, then we shouldn’t add a new node after splitting.

Our test is:

root = RadixTreeNode()
insert!(root, "team")
insert!(root, "team") # ignore
insert!(root, "tea")
@test root.children[1].data == "tea"
@test root.children[1].children[1].data == "m"

This requires extra checks:

function Base.insert!(root::RadixTreeNode{T}, key::AbstractString) where T
    node, match_length = get(root, key)
    if match_length == length(key) # new
        node.is_label = true # new
        return # new
    end  # new
    suffix = get_suffix(key, match_length)
    child, overlap = search_children_with_overlap(node, suffix)
    if isnothing(child)
        new_node = RadixTreeNode(T(suffix), true)
        idx = searchsortedfirst(node.children, new_node; lt=(n1, n2)->n1.data < n2.data)
        insert!(node.children, idx, new_node)
    else
        node = child
        split!(node, overlap)
        if (overlap) < length(suffix) # new
            new_suffix = get_suffix(suffix, overlap)
            new_node = RadixTreeNode(T(new_suffix), true)
            idx = new_node.data < node.children[1].data ? 1 : 2
            insert!(node.children, idx, new_node)
        else # new
            node.is_label = true # new
            node # new
        end # new
    end
end

Print tree

We can now make fairly complex trees. To prove this it will be helpful to print the entire tree.

The tree will be printed by visiting a node and printing its data, then moving on to each of its children and doing the same one by one. This is known as a pre-order traversal.

Each time we go up a level we will increase the indent for easy reading.

print_tree(io::IO, root::RadixTreeNode; options...) = print_tree_preorder(io, root; options...)
print_tree(root::RadixTreeNode; options...) = print_tree(stdout, root; options...)

function print_tree_preorder(io::IO, node::RadixTreeNode, level_indent=""
    ; indent::AbstractString="--", use_data_as_separator::Bool=false
    )
    println(io, level_indent * node.data)
    separator = use_data_as_separator ? node.data : "|"
    next_level = level_indent * separator * indent
    for child in node.children
        print_tree_preorder(io, child, next_level
        ; indent=indent, use_data_as_separator=use_data_as_separator
        )
    end
end

A basic example:

root = RadixTreeNode("")
insert!(root, "t")
insert!(root, "ten")
insert!(root, "team")
insert!(root, "tea")
print_tree(root)

The output:

|--t
|--|--e
|--|--|--a
|--|--|--|--m
|--|--|--n

Here is a fairly complex example from Wikipedia:

In code:

root = RadixTreeNode("")
for key in ["romane", "romanus", "romulus", "rubens", "ruber", "rubicon", "rubicundus"]
    insert!(root, key)
end
print_tree(root)

The output:

|--r
|--|--om
|--|--|--an
|--|--|--|--e
|--|--|--|--us
|--|--|--ulus
|--|--ub
|--|--|--e
|--|--|--|--ns
|--|--|--|--r
|--|--|--ic
|--|--|--|--on
|--|--|--|--undus

Height

An important statistic of the tree is its height. This is the maximum number of nodes it must traverse to find a key. This height can be attained via a recursive function:

function get_height(node::RadixTreeNode, height::Int=0)
    if is_leaf(node)
        return height
    end
    next_height = height + 1
    for child in node.children
        height = max(height, get_height(child, next_height))
    end
    height
end

For the Romane tree above this returns a height of 4.

Iteration

The last useful feature I want to add is an iterator, also known as a generator in other languages. The utility of an iterator is to return one data point at a time. This reduces memory usage as opposed to returning the entire dataset.

Julia is a functional language and as such making an iterator requires more thought than some other languages. In Python for example it is easy to implement one with the yield keyword. In Julia, the onus is on the programmer to manage the state of the iterator. At first I found it challenging to make one for a tree but Henrique Becker’s answer in this Discourse forum gave me clarity.

Once again, the default is a pre-order traversal:

According to the documentation on interfaces, the following code

for item in iter   
    # body
end

is translated into:

next = iterate(iter)
while next !== nothing
    (item, state) = next
    # body
    next = iterate(iter, state)
end

The iterator will be a PreOrderTraversal object which will step through all nodes of the tree. We want to only return labels so we can stop the iteration when it reaches a label. The item will be made up of a tuple: the data and a boolean for is_label.

function Base.iterate(root::RadixTreeNode, state=nothing)
    iter = PreOrderTraversal(root)
    next = isnothing(state) ? iterate(iter) : iterate(iter, state)
    while next !== nothing
        ((data, is_label), state) = next
        if is_label
            return (data, state)
        end
        next = iterate(iter, state)
    end
end

Base.IteratorSize(::RadixTreeNode) = Base.SizeUnknown()

This shifts the problem to making an iterator for the PreOrderTraversal. Firstly, this object is just a wrapper around the node:

struct PreOrderTraversal{R<:RadixTreeNode}
    root::R
end

The hardest part is, what is the state? It is all the information about the node’s parents and its parents and so on, so that we can backtrack when we need to do so. For example at step 5 in the figure, we are at “test” which is the first child (“est”) of the second child (“t”) of the root. This is nothing more than a list of tuples of (node, idx, word). We can implement this as a stack. If idx ≤ length(node.children), then increment idx up by one, otherwise pop from the stack and backtrack. In full:³

Base.IteratorSize(::PreOrderTraversal) = Base.SizeUnknown() 

Base.iterate(iter::PreOrderTraversal) = ((iter.root.data, iter.root.is_label), [(iter.root, 1, iter.root.data)])

function Base.iterate(iter::PreOrderTraversal, stack_::Vector{Tuple{RadixTreeNode{T}, Int, T}}) where T
    if isempty(stack_)
        return nothing
    end
    node, idx, word = last(stack_)
    if idx <= length(node.children)
        return _increment_stack!(stack_)
    else # backtrack
        pop!(stack_)
        while !(isempty(stack_))
            node, idx, word = last(stack_)
            if idx <= length(node.children)
                return _increment_stack!(stack_)
            end
            pop!(stack_)
        end
    end
    nothing
end

function _increment_stack!(stack_::Vector{<:Tuple})
    node, idx, word= last(stack_)
    stack_[end] = (node, idx + 1, word)
    child = node.children[idx]
    new_word = word * child.data
    push!(stack_, (child, 1, new_word))
    (new_word, child.is_label), stack_ 
end

Testing it out:

root = RadixTreeNode()
for key in ["toast", "toaster", "toasting", "test", "slow", "slower", "slowly"]
    insert!(root, key)
end
for item in PreOrderTraversal(root)
    print(item, ", ")
end
#= ("", false), ("slow", true), ("slower", true), ("slowly", true), ("t", false), ("test", true), ("toast", true), ("toaster", true), ("toasting", true) =#
for item in root
    print(item, ", ")
end
#= slow, slower, slowly, test, toast, toaster, toasting, =#

3 Worked example

Here is a list of 10,000 words compiled by MIT: www.mit.edu/~ecprice/wordlist.10000.⁴

After downloading the list we can load and insert it into a tree:

tree = RadixTreeNode()
filepath = "mit_words.txt"
open(filepath, "r") do f
    for line in eachline(f)
        insert!(tree, line)
    end
end

Some basic statistics:

get_height(tree) # 11
Base.summarysize(tree) # 978170 = 0.93 MB

Print the tree to a file:

open("tree.txt", "w") do f
    print_tree(f, tree; use_data_as_separator=true)
end

All words that start with “trea”:

node, matched = get(tree, "trea")
prefix = first("trea", matched)
suffix = get_suffix("trea", num_found)
for child in node.children
    if startswith(child.data, suffix)
        for data in child
            print(prefix * data, ", ")
        end
    end
end
#= treasure, treasurer, treasures, treasury, treat, treated, treating, treatment, treatments, treaty, =#

4 Conclusion

Thank you for following along. I hope you found this useful.

For an example of a radix tree used for an inverted index, see this post from Algolia. Although as far as inverted indexes go, Lucene is the industry standard with the most optimised implementation. Its complicated inverted index is based on skip lists and finite state tranducers. Lucene forms the basis of the popular ElasticSearch search engine. ↩
Another option is to make the tree a kind of dictionary by using the string at each node as a key and storing another value. This is the design choice made by DataStructures.jl in their Trie data structure. The values we could store are the term frequency of the word or a list of documents where that word occurs (inverted index). ↩
I’ve called the variable stack_ instead of stack because a function already exists with that name. ↩
Warning: there are profanities in this list. Also there are at least two mistakes: “trembl” and “documentcreatetextnode”. ↩

Lior Sinai

DeepSeek’s Multi-Head Latent Attention

Table of Contents

1 Introduction

2 KV Caching

2.1 Theory

2.2 Code

3 Multi-Head Latent Attention

3.1 C cache

3.2 Code

3.3 Absorption

3.4 Broadcasted batched multiplication

3.5 Code

3.6 Decoupled RoPE

3.7 Code

Conclusion

Notes on the Martinez-Rueda Polygon Clipping algorithm

Table of Contents

1 Introduction

2 Is above?

No fallback

3 Compare events

4 Conclusion

MicroGrad.jl: Part 5 MLP

Table of Contents

1 Introduction

2 Moons dataset

3 Layers

3.1 ReLU

3.2 Dense layer

3.3 Reverse broadcast

3.4 Chain

4 Loss

4.1 Cross entropy

4.2 Logit cross entropy

5 Train and Evaluate

5.1 Train

5.2 Evaluate

6 Conclusion

MicroGrad.jl: Part 4 Extensions

Table of Contents

1 Introduction

2 Extending pullback

2.1 map

2.2 Instrument

2.3 getfield

2.4 new

3 Gradient Descent revisited

3.1 Generic Gradient Descent

3.2 Polynomial curve fitting revisited

4 Conclusion

MicroGrad.jl: Part 3 Automation with IRTools

Table of Contents

1 Introduction

2 Differentiating Wengert Lists

3 Pullback

3.1 Definition

3.2 ChainRules

3.3 IR

3.4 Primal

3.5 Convert

3.6 Reverse

4 Conclusion

MicroGrad.jl: Part 2 Automation with expressions

Table of Contents

1 Introduction

2 Differentiating Wengert Lists

3 Pullback

3.1 Definition

3.2 ChainRules

3.3 AST

3.4 Primal

3.5 Sanitise

3.6 Reverse

4 Conclusion

MicroGrad.jl: Part 1 ChainRules

Table of Contents

1 Introduction

2 Julia AD Ecosystem

3 ChainRules