Wrapping up the Sudoku OCR reader series.
Update 5 August 2023: code refactoring.
This post is part of a series. The other articles are:
All code is available online at my repository: github.com/LiorSinai/SudokuReader-Julia.
Thank you for following along until now. This final part is split into the following sections:
First the required imports:
Now that we have all the pieces assembled, we can pass the outputs from one part as the input to the next:
The extract_digits_from_grid
function requires a predictor
which returns a tuple of (label, confidence)
for each extracted digit image. Here is one constructed by closing it over the model:
The output of extract_digits_from_grid
is three 9×9 matrices: grid
, centres
and confidences
.
The grid
has the labels, the centres
has the co-ordinates of the centres of the bounding boxes in the warped image, and the confidences
the probability of the estimated labels. The latter are zero if no prediction was made.
Drawing text over the original numbers is easy if we use Plots.jl. We will need the perspective_transform
function from part 3 to unwarp the centres back to their positions in the original image.
Here is the result:
There are two things we can do which greatly improve the presentation:
First a very basic function for making lines which form a grid:
Then here is a loop for projecting those lines onto the original image:
Next the align_centres
function. We can use the mean of the co-ordinates of the numbers above and below a point to get its $x$ value, and similarly for numbers to the left and right of it for the $y$ value:
Applying these two functions makes the result look much more professional:
The final step is to pass the grid into a Sudoku solver, get those numbers back, and project them on to the grid. But I’ll stop here 🙂.
This application used several algorithms, some rather complex, to do a task that humans consider trivial. This is not to downplay the effort. The task is a complex one, and we only consider it trivial because our brains have exquisitely adapted to it.
We’ve used several algorithms along the way. It is worth taking stock of all of them and all the parameters that are needed. Some of these parameters are fixed, whether set explicitly or implied. For example, the blurring the same in the horizontal and vertical directions so one parameter is fixed. Others are free and may require hand tuning. Here is a table with an overview of all fixed and free parameters:1
Step | Algorithm | Fixed parameters | Free parameters |
---|---|---|---|
preprocess | imresize | 0 | 1 |
Guassian Blur | 2 | 2 | |
AdaptiveThreshold | 1 | 2 | |
detect grid | find_contours | 0 | 0 |
fit_rectangle | 0 | 0 | |
fit_quad | 0 | 0 | |
extract digits | four_point_transform | 8 | 0 |
extract_digits_from_grid | 1 | 1 | |
detect_in_centre | 0 | 2 | |
label_components | 0 | 0 | |
extract_component_in_square | 0 | 0 | |
predictor | pad_image | 1 | 1 |
model (LeNet5) | 16 | 44426 | |
threshold | 0 | 1 |
For the image processing algorithms there are 9 free parameters. Some are subsets of more diverse algorithms. Others are more bespoke and are optimised specifically for one use case.
For machine learning, there are 44,426 free parameters. Compared to the hand crafted image processing algorithms, it is more general and can be repurposed (retrained) for other tasks such as recognising alphabet letters.
As with everything, one does not need to understand these algorithms in depth. But you do need sufficient knowledge of each in order to be able to integrate and fine tune them.
I hope you enjoyed this series and have a working Sudoku OCR reader yourself now.
The 16 fixed parameters for LeNet5 are: $k_1$, $k_2$, $s$, $p$, $n_{out}$ for each convolution layer (5×2); $k_1$, $k_2$ for each max pool layer (2×2) and $n_{out}$ for the hidden dense layers (2×1). This count excludes other hyper-parameters such as training parameters, number of layers, number of choices for activation function etc. ↩