# Regex matching with nonograms and finite state machines

Posted on 29 October, 2020

I recently solved a particular kind of puzzle, nonograms, using finite state machines for regex matching. This is a very efficient way to do regex matching and in fact formed the basis of the first regex matchers. But finite state machines have since been replaced with more versatile but slower backtracking algorithms.

## Introduction

I recently got hooked on a new type of puzzle, nonograms. Wikipedia has a great entry explaining them here. Here is a small example:

Each puzzle consists of a grid with numbers along the rows and columns. These numbers indicate connected sets of shaded cells along that row/column, with an unknown amount of unshaded cells in between. For example, the clue (4, 2) along row 2 means that in this row there are an unknown number of white cells, followed by 4 shaded cells, followed by one or more white cells, followed by 2 shaded cells, and then another unknown number of white cells. The aim is to shade the cells so that all clues read correctly. Puzzles have the added bonus that they are often designed to result in a picture. For example, here is the unique solution to the above puzzle:

I wrote solvers for these puzzles in both C++ and Python. Of course, such a problem is already well investigated. See Jan Wolter’s survey of online solvers. My solvers are actually rather slow compared to some of those. It takes more than a minute for some puzzles, where as Jan claims his own solver took a second. See also Rosetta code’s collection of brute force solvers in many different languages. These solvers are slow but will always solve the puzzle.

This problem shares many characteristics with Sudoku, which I covered in a previous blog post. Like Sudoku, the puzzle involves a grid and the general problem is NP complete. Also like Sudoku, many puzzles are designed with the intention of having a unique solution, but there is no guarantee for this. Lastly, the most efficient way to solve it is through constraint propagation, but if you get stuck guessing is the simplest way forward.

Unlike my Sudoku post, I am not going to fully describe the solvers here. Instead, I would like to concentrate on one particular aspect of the solver, pattern matching. I think this is the most valuable part of the solver, because this technique is transferable to many other situations.

## How to solve nonograms

A blank nonogram puzzle can be daunting. Thankfully there is a simple technique to get going. For every row/column, construct the left-most/top-most solution. Then slide it across all the way to the right/bottom. The cells that always remain part of the same sequence (whether black or white) can be shaded in.

For example, for the above nonogram:

Row 6 is the row of interest. Rows 5 and 7 show the left-most and right-most solutions respectively (and not solutions for row 5 or 7). The shaded cells in row 6 are the cells that always overlap in row 5 and row 7. These can be kept black, and all other cells should be left unknown.

This process is then completed for every row. Next, it is done for each column, except this time the solution is constrained by the cells already shaded in each row. This process can then be repeated on each row with the new constraints. And so on. For many puzzles, the entire puzzle can be solved with this one technique alone.

Before continuing, I would like to showcase examples where this fails. The first is this aeroplane puzzle:

In the left image, the left-right matcher algorithm has got stuck. But this puzzle can still be solved with other logic. Notably, two white blocks in the middle are always white regardless of which sequence the single blacks in the corresponding rows belong too. This unlocks the rest of the puzzle. It is not obvious - in fact in the source video, the guy resorts to guessing.

The next two puzzles are more challenging. Both consist of many small black runs with plenty of white space:

At first logic line solving will not get you anywhere with these puzzles. But a few guesses will unlock them. A caveat though with “Domino Logic III” - a string of bad guesses can lead down a seemingly never ending path of guesses.

## Regex and nondeterministic finite state automata

The trickiest part of the left-right algorithm is finding the left-most match.1 For an empty puzzle, this is trivial. However as more cells are shaded and the puzzle gets larger, this gets harder. See for example the above half finished puzzle (final solution). The main difficulty comes from deciding whether to place a white or not. This binary choice leads to $\mathcal{O}(2^n)$ complexity. I originally implemented a depth-first search (DFS) with backtracking, but for large puzzles (100x100) this was too slow. This was especially crippling on puzzles which required guessing, which adds another layer of backtracking.

I tried coding other matching algorithms, but they always seemed to turn into DFS. It is easy to write heuristics for simple cases, but the general case eluded me. So I looked around for more efficient matching algorithms, and came across this often cited blog post by Russ Cox. It describes Ken Thompson’s $\mathcal{O}(n^2)$ matching algorithm for regex2 using nondeterministic finite state automata (NFA).

The nonogram problem here can easily be rephrased as a regex problem. This is done as follows:

• Encode in binary:
• black cell: $01_2 = 1$
• white cell: $10_2 = 2$
• either: $11_2 = 3$
• Write each nonogram pattern (a, b, …, z) as the following regex: ([23]*)([13]){a}([23]+)([13]){b}([23]+)...([13]){z}([23]*).

Regex matchers themselves are usually DFS algorithms. For the most part this is more practical than finite state automata. DFS allows more complex regular expressions (Cox’s article has a full list in his post). However the exponential nature of backtracking can cause them to fail, sometimes in spectacular fashion.

An alternative is to represent the character expression as a nondeterministic finite state automation. Let’s break that name down:

• automation: a sort of machine/program. I find the term archaic but it does come from a time before computers.
• finite state: there is a countable number of states. This is opposed to infinite or continuous states, such as measurements with gradients in nature, or states which depend on such a large combination of elements that they are for practical purposes uncountable.
• nondeterministic: given a state, it cannot always be determined what the previous state was. Actually, that is a minor problem here because we do want to know how many times we have passed through each state, not just the current state. But we can store this information outside of the NFA.

The NFA is composed of states S which transition in one of the following ways:

We can use these components to build up a full state automation for our regex pattern:

We start at the start state S, and as we read characters, transition to the right. We only transition if the next character matches the next state. If the next character doesn’t match any transition, then we halt the matching process.

If we were to backtrack on a halt, then this algorithm would still be an $\mathcal{O}(2^n)$ process. It is made into an $O(n^2)$ algorithm by doing the following:

1. At each split with multiple valid transitions, simultaneously do all.
2. Only keep track of current active states (and not the history).

Only doing the first would result in breadth-first search (BFS), which still has $O(2^n)$ complexity. The second step however means that the number of active states per character is a function of the pattern length. That is $O(n)$. So since we process $n$ characters,the whole algorithm is $O(n^2)$.

The minor difficulty is that we do want to keep track of part of the history - that is, the current matching vectors - as well as the current state. But we can keep one match per state without increasing the complexity, and since we only want the left-most match, we can just keep the left-most match per state.

## Code

This is my C++ code for the NFA matcher. For the full code, see my Github repository.

Firstly, three types of objects are defined in the header: Match, State and NonDeterministicFiniteAutomation:3

The first step is to compile the pattern to a finite state machine. This is tricky because of potential nested expressions and also because of skipping states (e.g. with ? or *). The Thompson construction first converts the pattern to postfix notation. It then uses a stack to keep track of the start and end states. For my implementation here, I’ve stored all the states in the vector states, and the stack holds the indices of each state in that vector. These indices are referred to as “state IDs” in the code. Then depending on the input symbol, a different action is taken.

After compiling, we can run our simulation. In practice, the active states are kept in a hash table, with the key being the state ID, and the entry being the matching vector so far. The unique key property of the hash table guarantees that our current states are all unique. A duplicate state is either overwritten or not entered into the table.

We cannot advance each state simultaneously, but this is very closely achieved by advancing each state one step at a time, one after the other. A second hash table is used to make sure that the new transitions are all independent. That is, an old state is not overwritten because it was a new state for another state that should have advanced simultaneously with it, but happened to advance one step earlier.

## Conclusion

String matching with NFAs is a useful technique that I’m sure will come in handy in many different situations. It is understandable why more versatile DFS replaced them as the default algorithm for regex. The simplicity of NFAs is also their downfall - complex regex cannot be modelled with them. But certainly for cases like this which don’t have complex matching criteria, it is the superior option.

I’m still curious if there are faster ways to solve nonograms or to find a left-most match. Jan Wolter’s code for one was certainly faster than mine, and I am not sure why.

1. The right-most match can be found with the same algorithm by either iterating through the indices backwards, or passing a reversed version of the pattern and line to the left-most matcher.

2. Regex is short for regular expression. It is a standardised sequence of characters that define a search pattern in a (text) string. Most modern program languages have an inbuilt regex implementation. For more detail, see the Wikipedia article or an online regex tester like regex101

3. To make the code more readable, I’ve presented it as if I called using namespace std. But according to best practice, I did not do this in the actual code.