Sudoku OCR reader in Julia: part 3

Posted on 10 August, 2021

Identifying and extracting numbers for the Sudoku OCR Reader.

This post is part of a series. The other articles are:

All code is available online at my repository: github.com/LiorSinai/SudokuReader.jl.

Part 3 - digit extraction

Straightening the image.

From part 2 we have a quadrilateral which represents our grid. We could mask our image to exlcude pixels outside the quadrilateral and crop it to the rectangle borders. For example:

But we can do one better: we can straighten the whole grid. While the grid extraction from the previous section might have been done with machine learning, I don’t think this straightening can be.

PyImageSearch has a great Python tutorial on how to do this. I’ve recreated the code and their explanations here in Julia. It was slightly harder than using OpenCV in Python because I had to write more of the functions myself.

Firstly our points should be in a consistent order. This is so that we map the top-left point of the quadrilateral to the top-left point of the rectangle and so on. As a sanity check, we can do the following:

Next we calculate a four point transformation matrix that will transform our four quadrilateral points to four rectangular points. We can use fit_rectangle from part_2 to get our destination rectangle. We can use warp from ImageTransformations.jl to apply the matrix to our image.

We need to return the inverse matrix because it is needed for projecting text back on to the grid, which we will do in part 5. The function get_perspective_matrix is detailed in the more detail block below.

I’ve built the perspective_transform function with the function composition notation of CoordinateTransformations.jl as follows:

It is mathematically equivalent to:

For some reason the function composition way is about 2× faster. I am not sure if this is because the packages work nicely together or for deeper reasons with the compiler.

Here is the result of the warping:

OpenCV comes with a fuction called getPerspectiveTransform. I had to write the Julia version of this myself. To explain it properly, I would need to explain pinhole cameras, projection matrices, rotation matrices and more. Here is a source which does that: estimating a homography matrix. For now, all I am going to say is that the calculations reduce to multiplication of every pixel with a special matrix called a homography matrix: $$\begin{bmatrix} u' \\ v' \\ w \\ \end{bmatrix} = \begin{bmatrix} c_{11} & c_{12} & c_{13} \\ c_{21} & c_{22} & c_{23}\\ c_{31} & c_{32} & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \\ \end{bmatrix}$$ We then calculate the warped points as: $$u = \frac{u'}{w} = \frac{c_{11}x+c_{12}y+c_{13}}{c_{31}+c_{32} + 1}, \; v = \frac{v'}{w} = \frac{c_{21}x+c_{22}y+c_{23}}{c_{31}+c_{32} + 1}$$ We have 8 unknowns in $c_{ij}$ and 8 equations from 4 pairs of $(x, y)$ to $(u, v)$. We can rearrange these 8 equations into a single linear algebra problem: $$\begin{bmatrix} u_1 \\ u_2 \\ u_3 \\ u_4 \\ v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} = \begin{bmatrix} x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1u_1 & -y_1x_1\\ x_2 & y_2 & 1 & 0 & 0 & 0 & -x_2u_2 & -y_2x_2\\ x_3 & y_3 & 1 & 0 & 0 & 0 & -x_3u_3 & -y_3x_3\\ x_4 & y_4 & 1 & 0 & 0 & 0 & -x_3u_3 & -y_3x_3\\ 0 & 0 & 0 & x_1 & y_1 & 1 & -x_1v_1 & -y_1v_1\\ 0 & 0 & 0 & x_2 & y_2 & 1 & -x_2v_2 & -y_2v_2\\ 0 & 0 & 0 & x_3 & y_3 & 1 & -x_3v_3 & -y_3v_3\\ 0 & 0 & 0 & x_4 & y_4 & 1 & -x_4u_4 & -y_4v_4 \end{bmatrix} \begin{bmatrix} c_{11} \\ c_{12} \\ c_{13} \\ c_{21} \\ c_{22} \\ c_{23} \\ c_{31} \\ c_{32} \end{bmatrix}$$ Here is the code:

Now that we have our matrix, we have a problem: our matrix won't map a set of discrete pixels to another set of discrete pixels. Thankfully the package ImageTransformations.jl handles this for us. It uses the backwards algorithm: instead of warping our pixels to the rectangle, it warps pixels back from the rectangle to the quadrilateral and then interpolates a colour from the nearby pixels it lands in. This is why the perspective_transform function uses the inverse matrix invM.

We can see the effect of the linear interpolation by performing multiple warps. This is what the image looks like after warping to and back 50 times: If we change perspective in reality, we gain information. But by virtually changing perspective we lose information. For a single warp however this loss in information is small and not serious.

Digit detection and extraction

Now that the grid is a rectangle we can divide it into 9×9 cells - regions of interest - and check whether a digit is present in each cell. I’ve added some padding so that there is overlap between each cells. This provides a margin of error for the straightening. I’ve separated the check for a digit into two questions:

1. Is there a component in the centre of the region of interest?
2. If yes, what are the bounds and centre of this component?

Answering the second questions allows us to extract the digit which we can then send off to our machine learning algorithm. That is explained in the next section. For now, here is the digit extraction loop in full:

The next two sections detail the centre component detection and bounding box algorithms. Prediction is detailed in part 5.

Centre component detection

Here is a simple algorithm to detect whether or not there is a component in the centre:

1. Draw a white circle in the middle of a blank image the same size as the region of interest.
2. Element-wise multiply the circle kernel with the region of interest.
3. If the ratio of non-zero pixels to the circle area is greater than some threshold return true, else false.

This is what it looks like:

Here is the code:

The ratio for the digits lies between 15%-50% so I’ve set the threshold at a conservative 10%. The assumption is that the machine learning model will be able to distinguish between random artefacts and true digits; we’re just removing the most obviously empty cells.

Lastly, here is code to make a circle with discrete pixels:

Component extraction

Once we are sure the region of interest is not empty, we can extract the digit. We could use the same contour algorithm as in part 2. However we want all the digit’s pixels not just the border. A better approach is the concept of connected components, where each connected component is made up of non-zero pixels that are neighbours of each other. The simplest of these algorithms is a floodfill approach: find one pixel, then determine its neighbours and those neighbours’ neighbours and so on. I’ve used a more complex and efficient algorithm provided by ImageMorphology.jl called label_components. I must confess, I don’t fully understand it. You may inspect the source code here. This function returns a matrix where each component’s pixels have a unique index, starting from 1. For example, here is a label map for the whole image: