Identifying and extracting numbers for the Sudoku OCR Reader.
Update 5 August 2023: code refactoring.
This post is part of a series. The other articles are:
All code is available online at my repository: github.com/LiorSinai/SudokuReader-Julia.
From part 2 we have a quadrilateral which represents our grid. We could mask our image to exclude pixels outside the quadrilateral and crop it to the rectangle borders. For example:
But we can do one better: we can straighten the whole grid. While the grid extraction from the previous section might have been done with machine learning, I don’t think this straightening can be.
PyImageSearch has a great Python tutorial on how to do this. I’ve recreated the code and their explanations here in Julia. It was slightly harder than using OpenCV in Python because I had to write more of the functions myself.
Firstly our points should be in a consistent order. This is so that we map the top-left point of the quadrilateral to the top-left point of the rectangle and so on. As a sanity check, we can do the following:
Next we calculate a four point transformation matrix that will transform our four quadrilateral points to four rectangular points. We can use fit_rectangle
from part_2 to get our destination rectangle.
We can use warp
from ImageTransformations.jl to apply the matrix to our image.
We need to return the inverse matrix because it is needed for projecting text back on to the grid, which we will do in part 5. The function get_perspective_matrix
is detailed in the more detail block below.
I’ve built the perspective_transform
function with the function composition notation of CoordinateTransformations.jl as follows:
It is mathematically equivalent to:
For some reason the function composition way is about 2× faster. I am not sure if this is because the packages work nicely together or for deeper reasons with the compiler.
Here is the result of the warping:
More detail on perspective transformations ⇩
getPerspectiveTransform
. I had to write the Julia version of this myself.
To explain it properly, I would need to explain pinhole cameras, projection matrices, rotation matrices and more. Here is a source which does that: estimating a homography matrix.
For now, all I am going to say is that the calculations reduce to multiplication of every pixel with a special matrix called a homography matrix:
$$
\begin{bmatrix}
u' \\
v' \\
w \\
\end{bmatrix}
=
\begin{bmatrix}
c_{11} & c_{12} & c_{13} \\
c_{21} & c_{22} & c_{23}\\
c_{31} & c_{32} & 1 \\
\end{bmatrix}
\begin{bmatrix}
x \\
y \\
1 \\
\end{bmatrix}
$$
We then calculate the warped points as:
$$
u = \frac{u'}{w} = \frac{c_{11}x+c_{12}y+c_{13}}{c_{31}x+c_{32}y + 1}, \;
v = \frac{v'}{w} = \frac{c_{21}x+c_{22}y+c_{23}}{c_{31}x+c_{32}y + 1}
$$
We have 8 unknowns in $c_{ij}$ and 8 equations from 4 pairs of $(x, y)$ to $(u, v)$.
We can rearrange these 8 equations into a single linear algebra problem:
$$
\begin{bmatrix}
u_1 \\
u_2 \\
u_3 \\
u_4 \\
v_1 \\
v_2 \\
v_3 \\
v_4
\end{bmatrix}
=
\begin{bmatrix}
x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1u_1 & -y_1u_1\\
x_2 & y_2 & 1 & 0 & 0 & 0 & -x_2u_2 & -y_2u_2\\
x_3 & y_3 & 1 & 0 & 0 & 0 & -x_3u_3 & -y_3u_3\\
x_4 & y_4 & 1 & 0 & 0 & 0 & -x_4u_4 & -y_4u_4\\
0 & 0 & 0 & x_1 & y_1 & 1 & -x_1v_1 & -y_1v_1\\
0 & 0 & 0 & x_2 & y_2 & 1 & -x_2v_2 & -y_2v_2\\
0 & 0 & 0 & x_3 & y_3 & 1 & -x_3v_3 & -y_3v_3\\
0 & 0 & 0 & x_4 & y_4 & 1 & -x_4u_4 & -y_4v_4
\end{bmatrix}
\begin{bmatrix}
c_{11} \\
c_{12} \\
c_{13} \\
c_{21} \\
c_{22} \\
c_{23} \\
c_{31} \\
c_{32}
\end{bmatrix}
$$
Here is the code:
Now that we have our matrix, we have a problem: our matrix won't map a set of discrete pixels to another set of discrete pixels. Thankfully the package ImageTransformations.jl handles this for us.
It uses the backwards algorithm: instead of warping our pixels to the rectangle, it warps pixels back from the rectangle to the quadrilateral and then interpolates a colour from the nearby pixels it lands in. This is why the perspective_transform
function uses the inverse matrix invM
.
We can see the effect of the linear interpolation by performing multiple warps. This is what the image looks like after warping to and back 50 times:
If we change perspective in reality, we gain information. But by virtually changing perspective we lose information. For a single warp however this loss in information is small and not serious.Now that the grid is a rectangle we can divide it into 9×9 cells - regions of interest - and check whether a digit is present in each cell. I’ve added some padding so that there is overlap between each cells. This provides a margin of error for the straightening. I’ve separated the check for a digit into two questions:
Answering the second questions allows us to extract the digit which we can then send off to our machine learning algorithm. That is explained in the next section. For now, here is the digit extraction loop in full:
The next two sections detail the centre component detection and bounding box algorithms.
The predictor
is detailed in part 5.
Here is a simple algorithm to detect whether or not there is a component in the centre:
This is what it looks like:
Here is the code:
The ratio for the digits lies between 15%-50% so I’ve set the threshold at a conservative 10%. The assumption is that the machine learning model will be able to distinguish between random artifacts and true digits; we’re just removing the most obviously empty cells.
Lastly, here is code to make a circle with discrete pixels:
Once we are sure the region of interest is not empty, we can extract the digit.
We could use the same contour algorithm as in part 2.
However we want all the digit’s pixels not just the border.
A better approach is the concept of connected components, where each connected component is made up of non-zero pixels that are neighbours of each other.
The simplest of these algorithms is a floodfill approach: find one pixel, then determine its neighbours and those neighbours’ neighbours and so on. I’ve used a more complex and efficient algorithm provided by ImageMorphology.jl called label_components
.
I must confess, I don’t fully understand it. You may inspect the source code here.
This function returns a matrix where each component’s pixels have a unique index, starting from 1.
For example, here is a label map for the whole image:
Doing this for each small region of interest allows us to extract the digits with a tight bounding box while removing all other components. Here is the full code:1
This function uses a struct
which stores information about each connected component:
It is calculated from a labels matrix as follows:
Now that we have individual digits we can pass them to our machine learning function. This is explained next in part 4.
We need to binarize again to remove distortions from warping. Originally I used the AdaptiveThreshold method which applies different thresholds through out the image. This is needed for a large complex image. Here we have much smaller images so a more straightforward algorithm called Otsu’s method can be used. This automatically calculates and applies a single threshold for the whole image. This threshold minimises the intra-class variance - it attempts to set a threshold such that all the pixels within each class (background and foreground) have the most similar values possible. ↩