Wrapping up the Sudoku OCR reader series.
This post is part of a series. The other articles are:
All code is available online at my repository: github.com/LiorSinai/SudokuReader.jl.
Thank you for following along until now. This final part is split into the following sections:
First the required imports:
Now that we have all the pieces assembled, we can pass the outputs from one part as the input to the next:
read_digits
uses a function called prediction
. It provides a wrapper around the output of the model, which are logits.
The softmax probability is a useful proxy for how confident the model is in its prediction. On the training data, the confidence for correct predictions is 100%.
The output of read_digits
is three 9×9 matrices: grid, centres and probabilities.
The grid has the numbers, the centres has the co-ordinates of the centres of the bounding boxes in the warped image, and the probabilities has the maximum probability. The latter are zero if no prediction was made.
Drawing text over the original numbers is easy if we use Plots.jl. We will need the perspective_transform
function from part 3 to unwarp the centres back to their positions in the original image.
Here is the result:
There are two things we can do which greatly improve the presentation:
First a very basic function for making lines which form a grid:
Then here is a loop for projecting those lines onto the original image:
Next the align_centres
function. We can use the mean of the co-ordinates of the numbers above and below a point to get its $x$ value, and similarly for numbers to the left and right of it for the $y$ value:
Applying these two functions makes the result look much more professional:
The final step is to pass the grid into a Sudoku solver, get those numbers back, and project them on to the grid. But I’ll stop here 🙂.
This application used several algorithms, some rather complex, to do a task that humans consider trivial. This is not to downplay the effort. The task is a complex one, and we only consider it trivial because our brains have exquistively adapted to it.
We’ve used several algorithms along the way. It is worth taking stock of all of them and all the parameters that are needed. Some of these parameters are fixed, whether set explicitly or implied. For example, the blurring is done the same in the horizontal and vertical directions and so one parameter is fixed. Others are free and may require hand tuning. Here is a table with an overview of all fixed and free parameters:^{1}
Step | Algorithm | Fixed parameters | Free parameters |
---|---|---|---|
preprocess | imresize | 0 | 1 |
Guassian Blur | 2 | 2 | |
AdaptiveThreshold | 1 | 2 | |
detect_grid | find_contours | 0 | 0 |
extract_digits | warp | 8 | 0 |
read_digits | 1 | 1 | |
detect_in_centre | 0 | 2 | |
extract_digit (label_components) | 0 | 0 | |
prediction | pad_image | 1 | 1 |
model (LeNet5) | 16 | 44426 | |
threshold | 0 | 1 |
For the image processing algorithms there are 9 free parameters. Some are subsets of more diverse algorithms. Others are more bespoke and are optimised specifically for one use case.
For machine learning, there are 44,426 free parameters. Compared to the hand crafted image processing algorithms, it is more general and can be repurposed (retrained) for other tasks such as recognising alphabet letters.
As with everything, one does not need to understand these algorithms in depth. But you do need sufficient knowledge of each in order to be able to integrate and fine tune them.
I hope you enjoyed this series and have a working Sudoku OCR reader yourself now.
The 16 fixed parameters for LeNet5 are: $k_1$, $k_2$, $s$, $p$, $n_{out}$ for each convolution layer (5×2); $k_1$, $k_2$ for each max pool layer (2×2) and $n_{out}$ for the hidden dense layers (2×1). This count excludes other hyper-parameters such as training parameters, number of layers, number of choices for activation function etc. ↩