Friday, August 6, 2010

String Search Alogrithm

Major categories and characteristics of exact string pattern matching algorithms have been reviewed by Hume and Sunday16 and Pirklbauer,17 among others.

Deep and comprehensive understanding of conceptual frameworks and theoretic estimates of time and space complexity is provided,18 especially in analyzing the behavior of these algorithms in the worst cases or using randomly generated string targets and patterns.

However, there is a lack of data about the average behavior of these algorithms when used with real text corpora that have highly non-normal distributed symbols, such as the medical sublanguage. When neither preprocessing nor indexing is feasible for the target, and only limited preprocessing in space and time is affordable for the pattern, the problem of pattern matching analysis can be roughly divided into two sequences—comparison and slide to next comparison.

Both steps can be optimized.The comparison step often can be spared by the use of a hashing table.21 Hashing functions provide an alternative way to compare strings. Essentially, this technique is based on the comparison of string hash values instead of direct character comparison and usually requires the text to be preprocessed. Hashing is a powerful and simple technique for accessing information: Each key is mathematically converted into a number, which is then used as an index into a table. In typical applications, it is common for two different strings to map to the same index, requiring some strategy for collision resolution.

With a perfect function, a string could be identified in exactly one probe, without worrying about collisions.A simple hash function performs calculations using the binary codes of the characters in a key. The record's position in the table is calculated based only on the key value itself, not how many elements are in the table. This is why hash-table inserts, deletes, and searches are rated constant time.

However, finding perfect hashing functions is possible only when the set of possibilities is completely known in advance, so that each unique entry in the text has one entry in the hash table. In case of collision, that is, when the hashing value is the same for two different substrings, a formal, character-by-character comparison is performed. If the comparison is a success, a match is found. In our measures, such algorithms behave poorly for medical text, with high costs due to the time spent in the hashing function.

It must be emphasized, however, that algorithms using hashing tables perform spectacularly when the target can be preprocessed and can be easily extended to multidimensional pattern matching problems. These problems are common in bioinformatics and imaging.Use of a skip table that allows a jump of more than one character in the case of failure20 can optimize the slide step. To understand the use of heuristic skip tables, consider that the target is examined through a window. This window delimits a region of the target and usually has the length of the pattern. Such a window slides along the target from one side to the other and is periodically shifted according to rules specific for each algorithm.

When the window is at a certain position on the target, the algorithm checks whether the pattern occurs there or not by comparing symbols in the window with the corresponding aligned symbols of the pattern. If a whole match occurs, the position is reported. During this scan operation, the algorithm acquires information from the target that is used to determine the length of the next shift of the window. This operation is repeated until the end of the pattern goes beyond or reaches the end of the target. It is an effective procedure to optimize time cost during a scan operation that can be used in right-to-left or left-to-right search engines.

The so-called naive or brute force algorithm is the most intuitive approach to the string pattern-matching problem. This algorithm attempts simply to match the pattern in the target at successive positions from left to right. If failure occurs, it shifts the comparison window one character to the right until the end of the target is reached. In the worst case, this method runs in O(mn) time, but in a practical approach, the expected overall performance is near O(n + m).

Despite the theoretic bad performance of this algorithm, our measures show that the naive algorithm is one of the fastest methods when the pattern is a short sequence of characters.The Knuth-Morris-Pratt method19 derives an algorithm from the two-way deterministic automata that runs, in the worst case, in O(m + n) time for random sequences. The basic idea behind this algorithm is to avoid backtracking in the target string in the event of a mismatch, by taking advantage of information given by the type of mismatch. The target is processed sequentially from left to right. When a substring matchattempt fails, the previous symbols, which are known, are used to determine how far the pattern can be shifted to the right for the next match attempt.Boyer and Moore20 published a much faster algorithm in 1974.

The speed of this algorithm is achieved by disregarding portions of the target that cannot, in any case, participate in a successful match.

Major categories and characteristics of exact string pattern matching algorithms have been reviewed by Hume and Sunday16 and Pirklbauer,17 among others. Deep and comprehensive understanding of conceptual frameworks and theoretic estimates of time and space complexity is provided,18 especially in analyzing the behavior of these algorithms in the worst cases or using randomly generated string targets and patterns. However, there is a lack of data about the average behavior of these algorithms when used with real text corpora that have highly non-normal distributed symbols, such as the medical sublanguage.

When neither preprocessing nor indexing is feasible for the target, and only limited preprocessing in space and time is affordable for the pattern, the problem of pattern matching analysis can be roughly divided into two sequences—comparison and slide to next comparison. Both steps can be optimized.The comparison step often can be spared by the use of a hashing table.21 Hashing functions provide an alternative way to compare strings. Essentially, this technique is based on the comparison of string hash values instead of direct character comparison and usually requires the text to be preprocessed. Hashing is a powerful and simple technique for accessing information: Each key is mathematically converted into a number, which is then used as an index into a table.

In typical applications, it is common for two different strings to map to the same index, requiring some strategy for collision resolution. With a perfect function, a string could be identified in exactly one probe, without worrying about collisions.A simple hash function performs calculations using the binary codes of the characters in a key. The record's position in the table is calculated based only on the key value itself, not how many elements are in the table. This is why hash-table inserts, deletes, and searches are rated constant time. However, finding perfect hashing functions is possible only when the set of possibilities is completely known in advance, so that each unique entry in the text has one entry in the hash table. In case of collision, that is, when the hashing value is the same for two different substrings, a formal, character-by-character comparison is performed.

If the comparison is a success, a match is found. In our measures, such algorithms behave poorly for medical text, with high costs due to the time spent in the hashing function. It must be emphasized, however, that algorithms using hashing tables perform spectacularly when the target can be preprocessed and can be easily extended to multidimensional pattern matching problems. These problems are common in bioinformatics and imaging.Use of a skip table that allows a jump of more than one character in the case of failure20 can optimize the slide step.

To understand the use of heuristic skip tables, consider that the target is examined through a window. This window delimits a region of the target and usually has the length of the pattern. Such a window slides along the target from one side to the other and is periodically shifted according to rules specific for each algorithm. When the window is at a certain position on the target, the algorithm checks whether the pattern occurs there or not by comparing symbols in the window with the corresponding aligned symbols of the pattern. If a whole match occurs, the position is reported. During this scan operation, the algorithm acquires information from the target that is used to determine the length of the next shift of the window. This operation is repeated until the end of the pattern goes beyond or reaches the end of the target.

It is an effective procedure to optimize time cost during a scan operation that can be used in right-to-left or left-to-right search engines.The so-called naive or brute force algorithm is the most intuitive approach to the string pattern-matching problem. This algorithm attempts simply to match the pattern in the target at successive positions from left to right. If failure occurs, it shifts the comparison window one character to the right until the end of the target is reached. In the worst case, this method runs in O(mn) time, but in a practical approach, the expected overall performance is near O(n + m).

Despite the theoretic bad performance of this algorithm, our measures show that the naive algorithm is one of the fastest methods when the pattern is a short sequence of characters.The Knuth-Morris-Pratt method19 derives an algorithm from the two-way deterministic automata that runs, in the worst case, in O(m + n) time for random sequences. The basic idea behind this algorithm is to avoid backtracking in the target string in the event of a mismatch, by taking advantage of information given by the type of mismatch. The target is processed sequentially from left to right. When a substring matchattempt fails, the previous symbols, which are known, are used to determine how far the pattern can be shifted to the right for the next match attempt.Boyer and Moore20 published a much faster algorithm in 1974. The speed of this algorithm is achieved by disregarding portions of the target that cannot, in any case, participate in a successful match.

The Naive Algorithm

The naive algorithm (Figure 2) is the simplest and most often used algorithm. It uses a linear and sequential character-based comparison at all positions in the text between y0 and yn-m-1, whether or not an occurrence of the pattern x starts at the current position. In case of success in matching the first element of the pattern x0, each element of the pattern is successively tested against the text until failure or success occurs at the last position. After each unsuccessful attempt, the pattern is shifted exactly one position to the right, and this procedure is repeated until the end of the target is reached.

The naive search algorithm has several advantages. It needs no preprocessing of any kind on the pattern and requires only a fixed amount of extra memory space.

The Karp-Rabin Algorithm

In theory, hashing functions provide a simple way to avoid the quadratic number of symbol comparisons in most practical situations. These functions run in constant time under reasonable probabilistic assumptions. Hashing technique was introduced by Harrison and later fully analyzed by Karp and Rabin. The Karp-Rabin algorithm (Figure 3) is a typical string-pattern-matching algorithm that uses the hashing technique.21 Instead of checking at each position of the target to see whether the pattern occurs, it checks only whether the portion of the target aligned with the pattern has a hashing value similar to the pattern. To be helpful for the string-matching problem, great attention must be given to the hashing function. It must be efficiently computable, highly discriminating for strings, and computable in an incremental manner in order to decrease the cost of processing. Incremental hashing functions allow new hashing values for a window of the target to be computed step by step without the whole window having to be recomputed. The time performance of this algorithm is, in the unlikely worst case, O(mn) with an expected time performance of O(m + n).

Several different ways to perform the hashing function have been published. The original Karp-Rabin algorithm has been implemented in our system because it is easily adaptable to different alphabet sizes. The comparative measures we performed showed that most of the time is spent during the incremental hashing phase. Therefore, even though the Karp-Rabin algorithm needs fewer symbol comparisons than other algorithms to locate a pattern in a large target text, the cost of computing the hashing function outweighs the advantage of performing fewer symbol comparisons, at least for common medical language.

The Knuth-Morris-Pratt Algorithm

The first published linear-time string-matching algorithm was from Morris and Pratt and was improved by Knuth et al.22 and others.23 The Knuth-Morris-Pratt algorithm (Figure 4) is the classical algorithm that implements efficiently the left-to-right scan strategy. The search behaves like a recognition process by automaton, and a character of the target is compared to a character of the pattern. The basic idea behind the algorithm is to avoid backtracking in the target string in the event of a mismatch. This is achieved by the use of a failure function. When a substring match attempt fails, the previous character sequence is known (the suffix), and this fact can be exploited to determine how far to shift the pattern to the right for the next match attempt. Basically, if a partial match is found such that target[i - j + 1..i - 1] = pattern[1..j - 1] and target[i] pattern[j], then matching is resumed by comparing target[i] and pattern[f(j - 1) + 1] if j > 1, where f is the failure function. If j = 1, then the comparison continues with target[i + 1] and pattern[1]. An auxiliary step table (heuristic skip table) containing this optimal shift information (failure function) is computed from the pattern before the string is searched.

Two examples of failure function follow:•Pattern “abcabcacab”: If we are currently determining whether there is a match beginning at target[i], then if target[i] is not the character “a,” we can proceed by comparing target[i + 1] and “a.” Similarly, if target[i] = “a” and target[i + 1] < > “b,” then we may continue by comparing target[i + 1] and “a.” If target[i] = “a” and target[i + 1] = “b” = ab and target[i + 2] < > “c,” then we have the situation:
where ? means that the character is not yet known.
•Pattern: “abcabcacab”: The first ? in the target represents target[i + 2] and it is different from the character “c.” At this point, it is known that the search might be continued for a match by comparing the first character in the pattern with target[i + 2]. However, there is no need to compare this character of the pattern with target[i + 1], since we already know that target[i + 1] is the same as the second character of the pattern, “b,” and that it is not matched with the first character of the pattern, “a.” Thus, by knowing the characters in the pattern and the position in the pattern where a mismatch occurs with a character in the target we can determine where in the pattern to continue the search for a match without moving backwards in the target.
The worst case performance occurs for normally distributed string patterns. In this case, the algorithm usually runs near O(mn) in time complexity, but it may run in O(m + n). This method performs well for large patterns made of repetitive short substrings. This situation is rarely met in medical texts. Thus, in practice, the KMP algorithm is not likely to be significantly faster than the other approaches.

Boyer-Moore-Horspool Algorithm

The Boyer-Moore algorithm (Figure 5) is considered the most efficient string-matching algorithm for natural language. In this method, the pattern is searched in the target from left to right. At each trial position, the symbol comparisons are performed to minimize the number of trials in case of unsuccessful matches while maximizing the pattern shift for the next trial. In a case of mismatch, the BM algorithm combines two different shift functions to optimize the number of characters that can be skipped during the skip process. These two shift functions are called the bad-character shift (or occurrence heuristic) and the good-suffix shift (or match heuristic). The latter method is similar to the one used in the KMP algorithm and is based on the pending symbol causing the mismatch. For every symbol of the pattern the length of a safe shift is stored. This shift corresponds to the distance between the last occurrence of that symbol in the pattern and the length of the pattern. The most interesting property of that technique is that the longer the length of the pattern, the longer the potential shifts. The final pattern shift is determined by taking the larger of the values from the two precomputed auxiliary tables. It is important to note that both methods can be preprocessed and kept in tables. Their time and space complexities depend only on the pattern. However, although there is a theoretic advantage in using the BM algorithm, many computational steps in this algorithm are costly in terms of processor instructions. The cost in time of the computational step was not shown to be amortized by the economy of character comparisons. This is particularly true of the function that computes the size comparisons in the skip tables. Horspool proposed a simplified form of the BM algorithm that use only a single auxiliary skip table indexed by the mismatching text symbols.24 Baeza-Yates showed that the Boyer-Moore-Horspool algorithm (BMH) is the best in terms of average case performance for nearly all pattern lengths and alphabet sizes (remember that the French alphabet is somewhat larger than the English alphabe).25 Yet, among all published algorithms we tested, the BMH algorithm showed the best performance in time, by far. The BMH algorithm will be studied more deeply, as it appears to be the best-choice algorithm in most cases. Because of its low space complexity, we recommend the use of the BMH algorithm in any case of exact string pattern matching, whatever the size of the target. Despite its apparent conceptual complexity, the BMH algorithm is relatively simple to implement. Moreover, it can easily be extended to handle multiple matches by generating events when a match occurs instead of leaving the algorithm as in the following example:

This means that the distance from any character in the pattern to the last position in the pattern is known and that any character not in the pattern will produce a shift that is the length of the pattern. This table contains an entry for every single symbol present in the alphabet. For usual hardware platforms, it consists of the 256 entries found in an 8-bit ASCII table.The BMH-2 algorithm is a slight variant of the BMH algorithm, which implements a new comparison method. In the BM and BMH algorithms, the comparison loop is entered if the last character of the pattern matches the current target position. In the BMH-2 algorithm, this loop is entered only if both the last and the middle characters of the pattern match with their respective positions in the target. This is a more conservative control of the right-to-left comparison loop. For long medical words, because of the frequent use of common suffixes and prefixes, this preliminary comparison generally allows skipping of several character comparisons and optimizes the left-to-right shift. Although this improvement decreases the number of character comparisons by 80 percent in the best cases, as described under Measures, the overall time complexity is often similar to that of the BMH algorithm. Given the increased time cost, performing the middle comparison becomes cost-effective only for long patterns, usually longer than 20 characters.Space Complexity of the Tested Algorithms.
For all algorithms, space usage of the pattern and the target is the same. The extra space complexity of each algorithm is self-explained in the code by the added variables—4 bytes for the naive algorithm, 12 bytes for the Karp-Rabin algorithm, 516 bytes for the Knuth-Morris-Pratt algorithm, and 518 bytes for the Boyer-Moore-Horspool algorithm. These values have been computed with 32-bit integers. They are constant.The Naive Algorithm
The naive algorithm (Figure 2) is the simplest and most often used algorithm. It uses a linear and sequential character-based comparison at all positions in the text between y0 and yn-m-1, whether or not an occurrence of the pattern x starts at the current position. In case of success in matching the first element of the pattern x0, each element of the pattern is successively tested against the text until failure or success occurs at the last position. After each unsuccessful attempt, the pattern is shifted exactly one position to the right, and this procedure is repeated until the end of the target is reached.

The Naive algorithm.

The naive search algorithm has several advantages. It needs no preprocessing of any kind on the pattern and requires only a fixed amount of extra memory space.The Karp-Rabin Algorithm
In theory, hashing functions provide a simple way to avoid the quadratic number of symbol comparisons in most practical situations. These functions run in constant time under reasonable probabilistic assumptions. Hashing technique was introduced by Harrison and later fully analyzed by Karp and Rabin. The Karp-Rabin algorithm (Figure 3) is a typical string-pattern-matching algorithm that uses the hashing technique.21 Instead of checking at each position of the target to see whether the pattern occurs, it checks only whether the portion of the target aligned with the pattern has a hashing value similar to the pattern. To be helpful for the string-matching problem, great attention must be given to the hashing function. It must be efficiently computable, highly discriminating for strings, and computable in an incremental manner in order to decrease the cost of processing. Incremental hashing functions allow new hashing values for a window of the target to be computed step by step without the whole window having to be recomputed. The time performance of this algorithm is, in the unlikely worst case, O(mn) with an expected time performance of O(m + n).
Figure 3
The Karp-Rabin algorithm.

Several different ways to perform the hashing function have been published. The original Karp-Rabin algorithm has been implemented in our system because it is easily adaptable to different alphabet sizes. The comparative measures we performed showed that most of the time is spent during the incremental hashing phase. Therefore, even though the Karp-Rabin algorithm needs fewer symbol comparisons than other algorithms to locate a pattern in a large target text, the cost of computing the hashing function outweighs the advantage of performing fewer symbol comparisons, at least for common medical language.

The Knuth-Morris-Pratt Algorithm
The first published linear-time string-matching algorithm was from Morris and Pratt and was improved by Knuth et al.22 and others.23 The Knuth-Morris-Pratt algorithm (Figure 4) is the classical algorithm that implements efficiently the left-to-right scan strategy. The search behaves like a recognition process by automaton, and a character of the target is compared to a character of the pattern. The basic idea behind the algorithm is to avoid backtracking in the target string in the event of a mismatch. This is achieved by the use of a failure function. When a substring match attempt fails, the previous character sequence is known (the suffix), and this fact can be exploited to determine how far to shift the pattern to the right for the next match attempt. Basically, if a partial match is found such that target[i - j + 1..i - 1] = pattern[1..j - 1] and target[i] pattern[j], then matching is resumed by comparing target[i] and pattern[f(j - 1) + 1] if j > 1, where f is the failure function. If j = 1, then the comparison continues with target[i + 1] and pattern[1]. An auxiliary step table (heuristic skip table) containing this optimal shift information (failure function) is computed from the pattern before the string is searched.
Figure 4
The Knuth-Morris-Pratt algorithm.
Two examples of failure function follow:•Pattern “abcabcacab”: If we are currently determining whether there is a match beginning at target[i], then if target[i] is not the character “a,” we can proceed by comparing target[i + 1] and “a.” Similarly, if target[i] = “a” and target[i + 1] < > “b,” then we may continue by comparing target[i + 1] and “a.” If target[i] = “a” and target[i + 1] = “b” = ab and target[i + 2] < > “c,” then we have the situation:


where ? means that the character is not yet known.
•Pattern: “abcabcacab”: The first ? in the target represents target[i + 2] and it is different from the character “c.” At this point, it is known that the search might be continued for a match by comparing the first character in the pattern with target[i + 2]. However, there is no need to compare this character of the pattern with target[i + 1], since we already know that target[i + 1] is the same as the second character of the pattern, “b,” and that it is not matched with the first character of the pattern, “a.” Thus, by knowing the characters in the pattern and the position in the pattern where a mismatch occurs with a character in the target we can determine where in the pattern to continue the search for a match without moving backwards in the target.
The worst case performance occurs for normally distributed string patterns. In this case, the algorithm usually runs near O(mn) in time complexity, but it may run in O(m + n). This method performs well for large patterns made of repetitive short substrings. This situation is rarely met in medical texts. Thus, in practice, the KMP algorithm is not likely to be significantly faster than the other approaches.The Boyer-Moore-Horspool Algorithm
The Boyer-Moore algorithm (Figure 5) is considered the most efficient string-matching algorithm for natural language. In this method, the pattern is searched in the target from left to right. At each trial position, the symbol comparisons are performed to minimize the number of trials in case of unsuccessful matches while maximizing the pattern shift for the next trial. In a case of mismatch, the BM algorithm combines two different shift functions to optimize the number of characters that can be skipped during the skip process. These two shift functions are called the bad-character shift (or occurrence heuristic) and the good-suffix shift (or match heuristic). The latter method is similar to the one used in the KMP algorithm and is based on the pending symbol causing the mismatch. For every symbol of the pattern the length of a safe shift is stored. This shift corresponds to the distance between the last occurrence of that symbol in the pattern and the length of the pattern. The most interesting property of that technique is that the longer the length of the pattern, the longer the potential shifts. The final pattern shift is determined by taking the larger of the values from the two precomputed auxiliary tables. It is important to note that both methods can be preprocessed and kept in tables. Their time and space complexities depend only on the pattern. However, although there is a theoretic advantage in using the BM algorithm, many computational steps in this algorithm are costly in terms of processor instructions. The cost in time of the computational step was not shown to be amortized by the economy of character comparisons. This is particularly true of the function that computes the size comparisons in the skip tables. Horspool proposed a simplified form of the BM algorithm that use only a single auxiliary skip table indexed by the mismatching text symbols.24 Baeza-Yates showed that the Boyer-Moore-Horspool algorithm (BMH) is the best in terms of average case performance for nearly all pattern lengths and alphabet sizes (remember that the French alphabet is somewhat larger than the English alphabe).25 Yet, among all published algorithms we tested, the BMH algorithm showed the best performance in time, by far. The BMH algorithm will be studied more deeply, as it appears to be the best-choice algorithm in most cases. Because of its low space complexity, we recommend the use of the BMH algorithm in any case of exact string pattern matching, whatever the size of the target. Despite its apparent conceptual complexity, the BMH algorithm is relatively simple to implement. Moreover, it can easily be extended to handle multiple matches by generating events when a match occurs instead of leaving the algorithm as in the following example:
Figure 5
The Boyer-Moore-Horspool algorithm.

This means that the distance from any character in the pattern to the last position in the pattern is known and that any character not in the pattern will produce a shift that is the length of the pattern. This table contains an entry for every single symbol present in the alphabet. For usual hardware platforms, it consists of the 256 entries found in an 8-bit ASCII table.The BMH-2 algorithm is a slight variant of the BMH algorithm, which implements a new comparison method. In the BM and BMH algorithms, the comparison loop is entered if the last character of the pattern matches the current target position. In the BMH-2 algorithm, this loop is entered only if both the last and the middle characters of the pattern match with their respective positions in the target. This is a more conservative control of the right-to-left comparison loop. For long medical words, because of the frequent use of common suffixes and prefixes, this preliminary comparison generally allows skipping of several character comparisons and optimizes the left-to-right shift. Although this improvement decreases the number of character comparisons by 80 percent in the best cases, as described under Measures, the overall time complexity is often similar to that of the BMH algorithm. Given the increased time cost, performing the middle comparison becomes cost-effective only for long patterns, usually longer than 20 characters.Space Complexity of the Tested Algorithms
For all algorithms, space usage of the pattern and the target is the same. The extra space complexity of each algorithm is self-explained in the code by the added variables—4 bytes for the naive algorithm, 12 bytes for the Karp-Rabin algorithm, 516 bytes for the Knuth-Morris-Pratt algorithm, and 518 bytes for the Boyer-Moore-Horspool algorithm. These values have been computed with 32-bit integers. They are constant.

No comments: