Friday, August 6, 2010

Pattern Matching - Boyer's Algorithm

In 1977 Robert Boyer and L. Moore published a new exact pattern matching algorithm that had superior logic to existing versions, it performed the character comparison in reverse order to the pattern being searched for and had a method that did not require the full pattern to be searched if a mismatch was found. Legend has it that the original algorithm was coded in PDP-10 assembler.

Computer hardware has changed very considerably since that time yet the fundamental logic is still sound in a different world under very different conditions. Modern personal computers now have the capacity to work on very large sources and often have enough memory to handle those sources without requiring any form of tiling scheme and this capacity very well suits the design of the Boyer Moore exact pattern matching algorithm.

Usage would include tasks like recursively searching files for virus patterns, searching databases for keys or data, text and word processing and any other task that requires handling large amounts of data at very high speed.

The versions presented here are coded in 32 bit Microsoft assembler (MASM) and include a main version that is reasonably close to the original design and two other variant versions that use parts of the original design. The coding is designed for both search speed and mismatch recovery speed in a practical sense that has been determined by direct benchmarking. While all three achieve "sublinearity", the code design is aimed at the delivery of performance, not character count theory based on assumptions related to older ANSI C code.

The code is provided as both source code in three assembler modules and a Microsoft format library that can be used by both MASM and Visual C/C++. To use the modules in VC++, the correct prototypes must be written. The parameters are 32 bit unsigned integers.

How does it work ?

The following example is set up to search for the pattern within the source.

Source "This is a test of the Boyer Moore algorithm."
Pattern "algorithm"

Constructing the Table

A 256 member table is constructed that is initially filled with the length of the pattern which in this case is 9 characters. The 256 members represent the full range of characters in the ASCII character set. A second pass is then made on the table that places a descending count from the original length of the pattern in the ASCII table for each character that occurs.

algorithm <- pattern
87654321 <- shift values for each character

The table constructed in this manner allows the algorithm to determine in one access if the character being compared is within the search pattern or not. The first character compared is the end character of the pattern "m" to the corresponding position in the source.
|
Source "This is a test of the Boyer Moore algorithm."
Pattern "algorithm"
|
The character being compared is "a" which is within the characters that are in the pattern. Character "a" has a shift of 8 so the pattern is shifted 8 characters right.

|
Source "This is a test of the Boyer Moore algorithm."
Pattern "algorithm"
|

The GOOD SUFFIX shift

The shift that was just performed has normally been called the GOOD SUFFIX shift. The next character being compared is "f" which is not within the patern and this requires a different strategy to handle the shift. The logic of the Boyer Moore design is that if a character is compared that is not within the characters that are in the pattern, no match can be found by comparing any further characters at this position so the pattern can be shifted completely past the mismatching character.

|
Source "This is a test of the Boyer Moore algorithm."
Pattern "algorithm"
|

The BAD CHARACTER shift

This shift is usually called the BAD CHARACTER shift and it is calculated in a different manner. The table for the BAD CHARACTER shift occurs in this form.

Pattern "algorithm"
123456789

Some of the older implementation have constructed a second table but there is a far more efficient way to do it. The loop that does the reverse comparisons must have a loop counter and this loop counter produces the descending shift value that can be used to perform the BAD CHARACTER shift. The above shift that mismatched on the character "f" occurred on the first comparison after the shift so the shift past the mismatching character is 9.

The following character compared in the source is "e" which is also not in the table and it produces a BAD CHARACTER shift on the first position of the comparison so the shift is also 9.

|
Source "This is a test of the Boyer Moore algorithm."
Pattern "algorithm"
|

The next mismatch is "a" which occurs in the search pattern. This lines up with the first character of the pattern being searched for. The shift value in the table for the character "a" is 8 which will produce the match being searched for.

|
Source "This is a test of the Boyer Moore algorithm."
Pattern "algorithm"
|

The reverse comparison loop will compare all of the characters in the pattern to the position in the source and will produce a match at the end.

The additional heuristic

The is the need to produce an addition heuristic to handle the occurrence of repeated sequences of characters which is common in both plain text and binary files. What will happen if there is a repeated sequence of characters that are within the characters used in the search pattern is that the value in the table for the GOOD SUFFIX shift will produce a shift that goes past the correct match.

|
xxxxBooooxxxx
Boooo
|

When the GOOD SUFFIX shift table is constructed from the characters in a pattern that has repeated sequences, the repeat character values are overwritten at the same position in the table so you do not have an incremental decrementing of the values for each character position.

Boooo
4111

If the GOOD SUFFIX shift is applied with these values in the table, the pattern will be shifted past the correct match.

|
xxxxBooooxxxx
Boooo
|

The additional heuristic calculates the number of comparisons made in the current position and subtracts that from the GOOD SUFFIX shift. This will produce the correct match in most instances but the subtraction can also produce 0 so to ensure that a shift is applied in the worst case, if the calculated value is less than 1, a minimum shift of one is performed.

|
xxxxBooooxxxx
Boooo
|

Two comparisons have been made at this position so the GOOD SUFFIX shift of 4 for "B" has 2 subtracted from it to produce a shift of 2 characters.

|
xxxxBooooxxxx
Boooo
|

--------------------------------------------------------------------------------

The two variations

The two variations supplied with the complete Boyer Moore algorithm are based on the two different shifts that the original uses, the SBM.ASM version uses the GOOD SUFFIX shift from the table without the BAD CHARACTER shift, the BMH.ASM uses the BAD CHARACTER shift and simply increments the location if the character occurs within the table.

Both will produce reasonable performance because they have less overhead than the original. Their performance in loop speed must be offset against the lack of the extra logic that is used in the original algorithm.

The SBM.ASM version is generally faster on older machines and some AMD machines because the shorter pipeline with a lower penalty for register stalls better suits the code design.

The BMH.ASM version is faster on a late model Intel machine because it better suits the longer pipeline of later Intel processors and is less prone to branch prediction buffer penalties. This version will show much slower speeds if the character range is limited to characters that are within the search pattern as it only does a single increment in that situation.

The original algorithm BM.ASM averages better across x86 processors and is less vulnerable in worst case situations. The range of variation between all three versions is about 10%. The original version is within a few percent of the fastest tests on both types of machines and this reflects its dependence on logic rather than just fast loop code.

No comments: