A radix tree in Julia, built following Test Driven Development (TDD).
Table of Contents
1 Introduction
I recently discovered radix trees, also known as compressed tries. They are a specialised, space-optimised data structure for storing and searching through strings.
They can be used for text suggestions in search engines and for predictive text.
They are used in databases for storing IP addresses and for the inverted index of search engines.1
The above figure shows an example of a radix tree. Each edge stores part of a string.
The full string can be recovered by combining all the edges of the parents of a given node.
Searching through the tree is $\mathcal{O}(\log_r(n))$ where $r$ is called the radix of the tree and $n$ is the total number of items stored in the tree.
This post describes how to build one in Julia.
I’ll be following Test Driven Development (TDD) for part of the process.
I’d like to note upfront that radix trees are not always the best solution for text search.
In particular, binary search through a sorted linear list is $\mathcal{O}(\log_2(n))$ and is much simpler.
In Julia the inbuilt searchsortedfirst function does this:
So this is partly an academic exercise.
2 Implementation
Project setup (optional)
To start, make a package in the Julia REPL:
The purpose of making a package is that we can now use the super helpful Revise package,
which will dynamically update most changes during development without errors:
RadixTreeNode
My goal is to create a simple radix tree where each node stores a string.
In this way the tree functions as a type of array.2
The struct looks like:
In Julia an immutable struct is usually preferable because the compiler can more easily optimise code for it.
However here we will often need to change the data field during inserts, and so require a mutable struct.
The whole tree will be accessed through the first node, which is called the root:
If we store children in the root then the default printing will print them too:
This will get out of hand for a large tree, as it will print the entire tree.
To avoid this, we can create a custom printing function which will only print the data for the immediate children of a node:
Now if we print(root) we get:
We can create other helper functions for the RadixTreeNode:
Search
We can use a very basic example to create and test a search function. See the tree below:
We can construct it directly as:
The goal of the search algorithm is to return the deepest node in the tree that matches the given key.
We would also like to know how many letters are matched.
We can make the following two tests:
Check if any child has a matching prefix with the key.
Chop off the matching prefix (keep the suffix) of the key and set the node to the child.
Repeat steps 1-2. Stop when:
There is no matching prefix.
Or the node is a leaf (has no children).
Or all the letters are matched.
Here is the full algorithm in code:
This passes both tests.
Some comments:
These functions are fully compatible with unicode strings. See this tutorial for more information.
The get_suffix function may also be implemented using chop(s; head=head, tail=0) which returns SubString instead of String. Working directly with strings seems to reduce memory allocations.
The search_children function can be made faster with binary search. But in practice the child arrays tend to be small so this is not essential.
A question is, what will get(root, "tea") return?
Technically “tea” is in the tree, split up as “te” and “am”.
However this function is purposely limited to only full matching prefixes and not partial matches.
Hence the “te” node will be returned with a match length of 2.
Insert
The Wikipedia page has a fairly complex insert example.
I’m instead going to work through four simple examples, extending the insert! function each time to make the tests pass.
By the end the function will be able to handle all scenarios.
1 Insert in order
For efficient search we want the children inserted in order. Our test is:
For a given key we first need to find which node to insert it at (get) then we can use searchsortedfirst to find which index to put it in:
And all our tests pass.
2 Extend
If we add strings which share prefixes with existing nodes, then we only want to extend by the suffix. Our test is:
The new code is:
3 Split
If we add a string which shares a prefix with an existing node, then we have to split that node.
Our test is:
Unlike before with get, we now will go the extra step of checking if any child overlaps with the remaining suffix.
This requires checking all prefixes up to the suffix length $s$ for all children $c$, so this is inherently an $\mathcal{O}(cs)$ operation.
If it does, we will split! that child into two and then add the suffix as a new child.
The child will only have two children - the suffix of the old data and this new suffix - so determining the order is straightforward.
4 Split with no add
There are two extra scenarios we have to account for.
The first is if the word is already in the tree, in which case we should ignore it.
The second is if we add a word that is fully a prefix of another word, then we shouldn’t add a new node after splitting.
Our test is:
This requires extra checks:
Print tree
We can now make fairly complex trees.
To prove this it will be helpful to print the entire tree.
The tree will be printed by visiting a node and printing its data, then moving on to each of its children and doing the same one by one.
This is known as a pre-order traversal.
Each time we go up a level we will increase the indent for easy reading.
An important statistic of the tree is its height. This is the maximum number of nodes it must traverse to find a key.
This height can be attained via a recursive function:
For the Romane tree above this returns a height of 4.
Iteration
The last useful feature I want to add is an iterator, also known as a generator in other languages.
The utility of an iterator is to return one data point at a time. This reduces memory usage as opposed to returning the entire dataset.
Julia is a functional language and as such making an iterator requires more thought than some other languages.
In Python for example it is easy to implement one with the yield keyword.
In Julia, the onus is on the programmer to manage the state of the iterator.
At first I found it challenging to make one for a tree but Henrique Becker’s answer in this Discourse forum gave me clarity.
Once again, the default is a pre-order traversal:
According to the documentation on interfaces, the following code
is translated into:
The iterator will be a PreOrderTraversal object which will step through all nodes of the tree.
We want to only return labels so we can stop the iteration when it reaches a label.
The item will be made up of a tuple: the data and a boolean for is_label.
This shifts the problem to making an iterator for the PreOrderTraversal.
Firstly, this object is just a wrapper around the node:
The hardest part is, what is the state?
It is all the information about the node’s parents and its parents and so on, so that we can backtrack when we need to do so.
For example at step 5 in the figure, we are at “test” which is the first child (“est”) of the second child (“t”) of the root.
This is nothing more than a list of tuples of (node, idx, word).
We can implement this as a stack.
If idx ≤ length(node.children), then increment idx up by one, otherwise pop from the stack and backtrack.
In full:3
After downloading the list we can load and insert it into a tree:
Some basic statistics:
Print the tree to a file:
All words that start with “trea”:
4 Conclusion
Thank you for following along. I hope you found this useful.
For an example of a radix tree used for an inverted index, see this post from Algolia. Although as far as inverted indexes go, Lucene is the industry standard with the most optimised implementation. Its complicated inverted index is based on skip lists and finite state tranducers. Lucene forms the basis of the popular ElasticSearch search engine. ↩
Another option is to make the tree a kind of dictionary by using the string at each node as a key and storing another value. This is the design choice made by DataStructures.jl in their Trie data structure. The values we could store are the term frequency of the word or a list of documents where that word occurs (inverted index). ↩
I’ve called the variable stack_ instead of stack because a function already exists with that name. ↩
Warning: there are profanities in this list. Also there are at least two mistakes: “trembl” and “documentcreatetextnode”. ↩