Boyer-Moore String Search

Shared 12/6/2025•71 views

/ 1

Cheatsheet Content

Boyer-Moore String Search Algorithm The Boyer-Moore algorithm is an efficient string searching algorithm. It works by pre-processing the pattern to be searched for (the "needle") and then using this information to skip sections of the text (the "haystack") where a match is impossible. It compares the pattern with the text from right to left. 1. Key Concepts Text (Haystack): The larger string in which we are searching. Pattern (Needle): The smaller string we are trying to find within the text. Shift: The amount by which we move the pattern to the right along the text. Boyer-Moore aims to maximize this shift. Bad Character Rule: The primary heuristic used in this implementation to determine the shift. 2. Algorithm Steps (Simplified) Pre-processing the Pattern: Create a "bad character" table. This table tells us, for each character, its last occurrence in the pattern. Searching: Align the pattern with the beginning of the text. Compare the pattern with the text from right to left. If a mismatch occurs, use the bad character rule to calculate the shift. If a match is found, record the position and then shift the pattern to search for further occurrences. 3. Bad Character Rule Explained When a mismatch occurs at position $j$ (counting from the right of the pattern, $m-1$ being the rightmost character) between $P[j]$ and $T[s+j]$ (where $s$ is the current shift of the pattern): Let the mismatched text character be $C = T[s+j]$. Look up $C$ in the bad character table. Let its last occurrence in the pattern be at index $k$ (from the left, $0$-indexed). The shift is calculated as $j - k$. If $C$ does not appear in the pattern at all, or if $k > j$ (meaning $C$ appears to the right of the mismatch point in the pattern), the shift is $j - (-1) = j+1$ (or just $1$ if $j=0$). A common simplification is to shift by at least $1$ to ensure progress. This rule tries to align the last occurrence of the mismatched text character $C$ in the pattern with the $C$ in the text, allowing for a potentially large shift. 4. Code Breakdown 4.1. Initialization char text[200], pat[100]; int bad[256]; // Array to store last occurrence of each character // ... (input for text and pat) int n = strlen(text); // Length of text int m = strlen(pat); // Length of pattern `text`, `pat`: Character arrays to hold the input text and pattern. `bad[256]`: This is our "bad character" table. It's an array of integers, indexed by ASCII values (0-255). It will store the index of the *last occurrence* of each character in the pattern. `n`, `m`: Store the lengths of the text and pattern for convenience. 4.2. Pre-processing: Building the Bad Character Table for (int i = 0; i First loop: Sets all entries in the `bad` array to -1. This means that, by default, if a character is encountered in the text that is not in the pattern, its "last occurrence" in the pattern is considered -1. Second loop: Iterates through the pattern. For each character `pat[i]`, it updates `bad[pat[i]]` to `i`. If a character appears multiple times, this loop ensures that `bad[char]` will store the *last* (rightmost) index of that character in the pattern. Example: `pat = "ABCABA"` `bad['A']` initially 0, then 3, then 5 `bad['B']` initially 1, then 4 `bad['C']` initially 2 4.3. Searching Loop int s = 0; // Current shift of the pattern relative to the text while (s = 0 && pat[j] == text[s + j]) j--; // Move left if characters match if (j `s`: This variable represents the current starting index in the `text` where the pattern is aligned. It's the "shift" value. `while (s `j = m - 1`: Comparison starts from the rightmost character of the pattern. `while (j >= 0 && pat[j] == text[s + j])`: This inner loop compares characters from `pat[m-1]` down to `pat[0]` against the corresponding characters in the text. `s + j`: This is the index in the `text` corresponding to `pat[j]`. `if (j This condition is true if the inner loop completed without finding any mismatches, meaning the entire pattern `pat` matched a substring of `text` starting at index `s`. `printf("Pattern found at index %d\n", s);`: Prints the starting index of the match. `s += (s + m `else`: A mismatch occurred at `pat[j]` and `text[s+j]`. `int shift = j - bad[(unsigned char)text[s + j]];`: This is the core of the bad character rule application. `text[s + j]` is the mismatched character in the text. `bad[(unsigned char)text[s + j]]` gives us `k`, the last index of this character in the pattern. The shift is calculated as `j - k`. Example: `text = "EXAMPLE"`, `pat = "PLE"` `s=0`, `pat` aligned with `TEX`. Mismatch at `j=2` (rightmost). `pat[2] = 'E'`, `text[2] = 'X'`. `bad['X']` is -1 (assuming 'X' not in "PLE"). `shift = 2 - (-1) = 3`. Pattern shifts by 3. Example: `text = "SEARCHEXAMPLE"`, `pat = "EXAMPLE"` `s=6`, `pat` aligned with `EXAMPLE`. Mismatch at `j=0` (leftmost). `pat[0] = 'E'`, `text[6] = 'S'`. `bad['S']` is -1. `shift = 0 - (-1) = 1`. Pattern shifts by 1. `if (shift `s += shift;`: The pattern is shifted by the calculated amount. 5. Why Boyer-Moore is Efficient Large Shifts: By comparing from right to left and using the bad character rule, Boyer-Moore can often make very large shifts, skipping many characters in the text. This is especially effective when the pattern is long and common characters appear in the text but not in the pattern. Optimized for Mismatches: The algorithm is designed to quickly identify mismatches and make intelligent shifts, rather than just shifting one position at a time (like naive search) or always shifting by pattern length (like KMP in some cases). 6. Limitations/Considerations The provided code only implements the "Bad Character Rule" heuristic. A full Boyer-Moore implementation also includes a "Good Suffix Rule," which can provide even larger shifts, especially for patterns with repeating sub-patterns. Combining both rules (taking the maximum of the two calculated shifts) makes the algorithm even more powerful. The `gets()` function is unsafe and can lead to buffer overflows. In modern C programming, `fgets()` is preferred for reading string input.

Boyer-Moore String Search

Related Cheatsheets

Mass & Energy Balance Cheatsheet

Kohlberg

Material Properties Comparison

Oscillation SHM

Create Your Own AI Cheatsheet