On convolutional neural networks, overly large models and the importance of understanding your data.
This post is part of a series. The other articles are:
All code is available online at my repository: github.com/LiorSinai/SudokuReader.jl.
When I initially envisioned this blog post, I thought I wouldn’t have much to talk about with regards to machine learning. Digit recognition is the entry level problem of machine learning. It is incorporated into almost every beginner course on the topic. My assumption was you as the reader have done this before and I would only need to present the code. The model would be copied off the first tutorial I found on the internet and would be trained on the MNIST dataset. But I found MNIST ill-suited to the task at hand and the models subpar. In particular:
I’ll now delve further into each of these issues.
The Modified National Institute of Standards and Technology database (MNIST) is a large set of 70,000 labelled handwritten digits. Here is a sample of the digits:^{1}^{,}^{2}
I trained my model and then discovered very quickly that despite a 99% test accuracy, it was not the best dataset for a model that will mostly be aimed at computer printed digits. To understand why, let us look at a different dataset, the Char74K dataset. It is based on computer fonts:
Here is a general comparison of the mean image of each digit for each dataset:
The computer 1s tend to have a hook that handwritten 1s don’t have. The handwritten 4s are open at the top whereas computer 4s are closed. The circle in handwritten 6s tends to be smaller than the circle in computer 6s. All these small differences make a big impact.
As an experiment, I trained the LeNet5 model (detailed below) on the MNIST dataset and evaluated it on the Char74k dataset. While it had a 98.7% test accuracy for MNIST data, this only translated to a 81.3% accuracy on the whole Char74k dataset. So these datasets are strongly correlated but that is not enough for the required task. Here is the confusion matrix for the Char74k dataset:
As expected, 7, 1 and 2 are often confused. So was 6s with 5s and - to a less a extent - 6s with 0s. This sort of confusion also appeared with the few Sudoku images I tried. Training the model on the Char74k data removed this problem.
Going the other way, a model trained on the Char74k dataset to a 99.1% test accuracy only achieved a 54.1% accuracy on the MNIST data. I found this acceptable because the target is computer fonts not handwritten digits. Its overall performance with the Sudoku images was much better.
Another strategy is to train on both: 10,000 Char74k figures and 10,000 MNIST figures. This model has a test accuracy of 98.1%. For the separate datasets (train &test) it is 99.4% on the Char74k data and 98.8% on the MNIST data. A flaw of this model is that 42 times it confused 2s for 7s. (Out of 4000 2s and 7s that is acceptable.) Otherwise the second largest off-diagonal value in the confusion matrix was 12. Overall, this model seemed to perform slighlty worse on the Sudoku images.
I eventually used the LeNet5 model originally published in 1998. Its architecture is as follows:
It is a convolutional neural network. If you’re unfamiliar with this type of neural network, please see this article for a guide on how they work.
In Julia code, LeNet5 is built as follows:^{3}
This was not the first model I encountered. Instead that was huge a 1.2 million parameter model on a blog post. It worked, but it is unnecesary large and slow. It is built as follows:
This model is similar to LeNet5 but it has a few key differences. Firstly, it does 2× the amount of convolutions per Conv
step. Secondly, and very importantly, it only down samples once with a MaxPool
.
Down sampling again by a factor 2 in both directions would have reduced the total pixel size by 4, which in turn reduces
the parameter space by approximately 4. So 225,034 parameters instead of over a million.
I then investigated a few other models. Here is a summary of my findings for models trained only on the Char74k dataset:
name | train accuracy (%) | test accuracy (%) | inference time (ms) | file size (kB) | parameters | |
---|---|---|---|---|---|---|
1 | cnn_mastery | 99.99 | 98.77 | 0.1547 | 2119.68 | 542230 |
2 | LeNet5 | 99.51 | 98.67 | 0.0806 | 183.0 | 44426 |
3 | cnn_medium | 99.37 | 98.67 | 0.1355 | 79.4 | 18378 |
4 | cnn_huge | 99.26 | 98.13 | 0.8164 | 4689.92 | 1199882 |
5 | cnn_small | 98.62 | 97.74 | 0.0576 | 27.7 | 5142 |
6 | nn_fc | 98.43 | 96.75 | 0.0023 | 103.0 | 25450 |
Th cnn_huge model comes in 4th place despite its large amount of parameters. It is 10 times slower and 25.6 times larger than LeNet5. You may notice it still only has an inference time of 0.8ms and a file size of 4.6MB. It took 10 minutes to train as opposed to 2 minutes for LeNet5. These numbers are still small, so what is the fuss? I just greatly dislike increasing complexity and reducing performance for no reason.
The fastest model is nn_fc. It only has fully connected layers with no convolutions. Here is its architecture:
This model is prone to errors because it does not take the structure of the image to account. So very weirdly, you can swap rows and columns and not affect the prediction. Clearly this limits the overall accuracy of the model.
LeNet5 is a good balance between slow convolutions with few parameters and large but fast fully connected layers. It reduces the feature space by a factor of 3 from 784 pixels to 256 kernel pixels before classification. I still think it is too large - the smallest model in the table is 1/8th its size. But for 183kB against 28kB, I think it’s a reasonable tradeoff for a 1% increase in accuracy.
I followed the tutorials of Nigel Adams and Clark Fitzgerald. Julia is a young language and is still short on resources and tutorials for Flux. I am therefore glad these two ventured into the unknown early with their articles.
I used the digits from the Char74k dataset, specifically Sample001-Sample010 in EnglishFnt.tgz.
It is useful to convert the Char74k dataset to the MNIST format so the same model can be used for both datasets.
This script will do that (be sure to have the correct inpath
for your data):
Now we can look into the training. Here are all the imports we will need:
I also wrote a few helper functions at ml_utils.jl. For example, here is the function to load the data:
An important caveat is that the data should be of type Float32
. This does noticeably increase the speed of training the model.
Next split the data into train, validation and test sets.
Here is the split_data
function:
It’s best to fix the seed at an arbitrary number so that the data is always “randomly” sorted the same way. This makes comparisons between training runs consistent.
Now we need to get the data into the required format for Flux.
The sample data needs to be in the form height×width×channels×batch_size.
To prevent using extra space, we can use Flux.DataLoader
which lazily batches the data. This means it doesn’t create a copy of the data (unlike split_data
) when allocating a batch for a small gradient descent step.
The test data and validation data however are all evaluated at once, and are much smaller, so I eagerly batch them with Flux.batch
. This does create duplicates.
The labels need to be one hot matrices for the loss function:
Now we can load the LeNet5 model from before:
Flux assumes you know the input and output sizes for the dense layers. Therefore we need to know the output size of the last convolution layer. This can be caluated with the following formula: $$ \left\lfloor \frac{i+2p-k}{s} \right\rfloor + 1 $$ Where $i$ is the input size, $p$ is the pad, $k$ is the kernel (filter) size and $s$ is the stride length. For a very thorough explanation, please see this paper: A guide to convolution arithmetic for deep learning.
In Julia code:
Apply this repeatedly to get the final output dimension size. For example, for LeNet5:We then need to define an accuracy function and a loss function:
The Flux.logitcrossentropy
calculates the following:
The logits are the direct outputs of the model, which can be any real number. The softmax function $\sigma$ converts them to values between to 0 and 1 which sum to 1. That is, it converts them to a probability distribution. ADAM is a well known and stable gradient descent algorithm that works well without much fine tuning.
With these function in place, we can simply call Flux.train!(loss, params(model), train_data, opt)
and wait for our model to train. But I wanted more. I wanted to create a history of the change in accuracy during training and return it. I wanted to have a print out each time a batch was completed. I wanted to save after every epoch, where an epoch is one full run through the entire training set.
So I copied the definition for Flux.train!
and edited it as follows:
Here is an example of the training history for LeNet5 trained on the Char74k dataset:
Finally we calculate the final test accuracy on the small test data we had left behind:
This was long section with unexpected detours. I hope you enjoyed it. We can now go to the final section where we can see the results of our model: part_5.
In the MNIST data, the 7s and 9sx are lowered with respect to the top of the image and the 6s are slightly raised. I presume this is because when this data was processed the centroid was used as the centre of image. I believe a better choice would have been to use the centre of the bounding box. This is what the digit extraction does in part 3. ↩
One can argue that you don’t need to train with zero because there should never be a zero in Sudoku. For the user experience I’d argue it is worth the marginal effort. If the user does have a zero in their image, for whatever reason, it will make the model appear very incompotent if it cannot classify zeros. But this is your personal preference. ↩
Flatten is exported by both DataFrames and Flux and therefore needs to be qualified. ↩