String Algorithms

String Algorithms

Overview

String pattern matching algorithms efficiently find specific patterns in text. Covers various techniques from brute force O(nm) to KMP/Z-algorithm O(n+m).


Table of Contents

  1. Brute Force Matching
  2. KMP Algorithm
  3. Rabin-Karp Algorithm
  4. Z-Algorithm
  5. String Hashing
  6. Application Problems
  7. Practice Problems

1. Brute Force Matching

1.1 Basic Idea

Text:    A B C D A B C E A B C D
Pattern: A B C D

Position 0: ABCD = ABCD āœ“ (found!)
Position 1: BCDA ≠ ABCD āœ—
Position 2: CDAB ≠ ABCD āœ—
...
Position 8: ABCD = ABCD āœ“ (found!)

1.2 Implementation

def brute_force(text, pattern):
    """
    Brute force pattern matching
    Time: O(n * m), Space: O(1)
    n = len(text), m = len(pattern)
    """
    n, m = len(text), len(pattern)
    result = []

    for i in range(n - m + 1):
        match = True
        for j in range(m):
            if text[i + j] != pattern[j]:
                match = False
                break
        if match:
            result.append(i)

    return result

# Example
text = "ABCDABCEABCD"
pattern = "ABCD"
print(brute_force(text, pattern))  # [0, 8]
// C++
vector<int> bruteForce(const string& text, const string& pattern) {
    int n = text.length(), m = pattern.length();
    vector<int> result;

    for (int i = 0; i <= n - m; i++) {
        bool match = true;
        for (int j = 0; j < m; j++) {
            if (text[i + j] != pattern[j]) {
                match = false;
                break;
            }
        }
        if (match) result.push_back(i);
    }

    return result;
}

2. KMP Algorithm

2.1 Core Idea

KMP (Knuth-Morris-Pratt):
- On mismatch, use information from already matched portion
- Failure function determines skip position
- Time: O(n + m)

Failure function (Ļ€ array):
- π[i] = longest proper prefix of pattern[0..i] that is also suffix

Example: pattern = "ABCABD"
Index:   0  1  2  3  4  5
Char:    A  B  C  A  B  D
Ļ€[i]:    0  0  0  1  2  0

Ļ€[4] = 2 → In "ABCAB", "AB" is both prefix and suffix

2.2 Building Failure Function

pattern = "ABAAB"

i=0: "A"      → Ļ€[0] = 0 (by definition)
i=1: "AB"     → no prefix=suffix → Ļ€[1] = 0
i=2: "ABA"    → "A" matches → Ļ€[2] = 1
i=3: "ABAA"   → "A" matches → Ļ€[3] = 1
i=4: "ABAAB"  → "AB" matches → Ļ€[4] = 2

Ļ€ = [0, 0, 1, 1, 2]
def compute_failure(pattern):
    """Compute failure function - O(m)"""
    m = len(pattern)
    pi = [0] * m  # π[0] = 0
    j = 0  # Current matched length

    for i in range(1, m):
        # On mismatch, move j to pi[j-1]
        while j > 0 and pattern[i] != pattern[j]:
            j = pi[j - 1]

        # If match, increment j
        if pattern[i] == pattern[j]:
            j += 1
            pi[i] = j

    return pi

# Example
print(compute_failure("ABAAB"))  # [0, 0, 1, 1, 2]
print(compute_failure("ABCABD"))  # [0, 0, 0, 1, 2, 0]

2.3 KMP Matching

def kmp_search(text, pattern):
    """
    KMP pattern matching
    Time: O(n + m), Space: O(m)
    """
    if not pattern:
        return []

    n, m = len(text), len(pattern)
    pi = compute_failure(pattern)
    result = []
    j = 0  # Current position in pattern

    for i in range(n):
        # On mismatch, move j using failure function
        while j > 0 and text[i] != pattern[j]:
            j = pi[j - 1]

        # If match, increment j
        if text[i] == pattern[j]:
            if j == m - 1:
                # Complete match!
                result.append(i - m + 1)
                j = pi[j]  # Move for next match
            else:
                j += 1

    return result


# Example
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
print(kmp_search(text, pattern))  # [10]

text = "AAAAAA"
pattern = "AA"
print(kmp_search(text, pattern))  # [0, 1, 2, 3, 4]

2.4 C++ Implementation

#include <vector>
#include <string>
using namespace std;

vector<int> computeFailure(const string& pattern) {
    int m = pattern.length();
    vector<int> pi(m, 0);
    int j = 0;

    for (int i = 1; i < m; i++) {
        while (j > 0 && pattern[i] != pattern[j]) {
            j = pi[j - 1];
        }
        if (pattern[i] == pattern[j]) {
            pi[i] = ++j;
        }
    }

    return pi;
}

vector<int> kmpSearch(const string& text, const string& pattern) {
    int n = text.length(), m = pattern.length();
    vector<int> pi = computeFailure(pattern);
    vector<int> result;
    int j = 0;

    for (int i = 0; i < n; i++) {
        while (j > 0 && text[i] != pattern[j]) {
            j = pi[j - 1];
        }
        if (text[i] == pattern[j]) {
            if (j == m - 1) {
                result.push_back(i - m + 1);
                j = pi[j];
            } else {
                j++;
            }
        }
    }

    return result;
}

2.5 KMP Visualization

Text:    A B A B D A B A C D A B A B C A B A B
Pattern: A B A B C A B A B
Ļ€:       0 0 1 2 0 1 2 3 4

Matching process:
i=0: A=A āœ“ j=1
i=1: B=B āœ“ j=2
i=2: A=A āœ“ j=3
i=3: B=B āœ“ j=4
i=4: D≠C āœ— j=Ļ€[3]=2, D≠A āœ— j=Ļ€[1]=0, D≠A āœ— j=0
i=5: A=A āœ“ j=1
...
i=10: A=A āœ“ j=1
i=11: B=B āœ“ j=2
...
i=18: B=B āœ“ j=9 → Complete match! (position 10)

3. Rabin-Karp Algorithm

3.1 Core Idea

Rabin-Karp:
- Compare strings using hash function
- Rolling hash for O(1) next hash computation
- Average O(n + m), Worst O(nm) (hash collision)

Rolling hash:
hash("ABC") = A*d² + B*d + C
hash("BCD") = (hash("ABC") - A*d²) * d + D

d = base (usually 31 or 256)

3.2 Implementation

def rabin_karp(text, pattern, d=256, q=101):
    """
    Rabin-Karp pattern matching
    d: base (number of characters)
    q: modulus (large prime)
    Time: Average O(n + m), Worst O(nm)
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []

    result = []
    h = pow(d, m - 1, q)  # d^(m-1) mod q

    # Compute initial hash values
    p_hash = 0  # Pattern hash
    t_hash = 0  # Text window hash

    for i in range(m):
        p_hash = (d * p_hash + ord(pattern[i])) % q
        t_hash = (d * t_hash + ord(text[i])) % q

    # Sliding window
    for i in range(n - m + 1):
        # If hashes match, compare actual strings
        if p_hash == t_hash:
            if text[i:i + m] == pattern:
                result.append(i)

        # Compute next window hash (rolling)
        if i < n - m:
            t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
            if t_hash < 0:
                t_hash += q

    return result


# Example
text = "ABABDABACDABABCABAB"
pattern = "ABAB"
print(rabin_karp(text, pattern))  # [0, 10, 15]
def rabin_karp_multiple(text, patterns, d=256, q=101):
    """Search multiple patterns simultaneously"""
    n = len(text)
    result = {p: [] for p in patterns}

    # Group patterns by length
    by_length = {}
    for p in patterns:
        m = len(p)
        if m not in by_length:
            by_length[m] = {}
        # Compute pattern hash
        p_hash = 0
        for c in p:
            p_hash = (d * p_hash + ord(c)) % q
        if p_hash not in by_length[m]:
            by_length[m][p_hash] = []
        by_length[m][p_hash].append(p)

    # Search for each length
    for m, hash_to_patterns in by_length.items():
        if m > n:
            continue

        h = pow(d, m - 1, q)
        t_hash = 0

        for i in range(m):
            t_hash = (d * t_hash + ord(text[i])) % q

        for i in range(n - m + 1):
            if t_hash in hash_to_patterns:
                for p in hash_to_patterns[t_hash]:
                    if text[i:i + m] == p:
                        result[p].append(i)

            if i < n - m:
                t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
                if t_hash < 0:
                    t_hash += q

    return result

4. Z-Algorithm

4.1 Core Idea

Z array:
- Z[i] = length of longest common prefix of s[i:] and s

Example: s = "aabxaab"
Index:   0  1  2  3  4  5  6
Char:    a  a  b  x  a  a  b
Z[i]:    -  1  0  0  3  1  0

Z[1] = 1: Common prefix of "abxaab" and "aabxaab" = "a" (length 1)
Z[4] = 3: Common prefix of "aab" and "aabxaab" = "aab" (length 3)

Pattern matching:
- Construct s = pattern + "$" + text
- Z[i] == len(pattern) means match

4.2 Computing Z Array

def z_function(s):
    """
    Compute Z array
    Time: O(n)
    """
    n = len(s)
    z = [0] * n
    z[0] = n  # By definition, full string

    l, r = 0, 0  # Z-box left and right boundaries

    for i in range(1, n):
        if i < r:
            # Inside Z-box: use previous information
            z[i] = min(r - i, z[i - l])

        # Try to extend
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1

        # Update Z-box
        if i + z[i] > r:
            l, r = i, i + z[i]

    return z


# Example
print(z_function("aabxaab"))  # [7, 1, 0, 0, 3, 1, 0]
print(z_function("aaaaa"))    # [5, 4, 3, 2, 1]

4.3 Z-Algorithm Pattern Matching

def z_search(text, pattern):
    """
    Pattern matching using Z-algorithm
    Time: O(n + m)
    """
    concat = pattern + "$" + text
    z = z_function(concat)
    m = len(pattern)

    result = []
    for i in range(m + 1, len(concat)):
        if z[i] == m:
            result.append(i - m - 1)

    return result


# Example
text = "ABABDABACDABABCABAB"
pattern = "ABAB"
print(z_search(text, pattern))  # [0, 10, 15]

4.4 C++ Implementation

#include <vector>
#include <string>
using namespace std;

vector<int> zFunction(const string& s) {
    int n = s.length();
    vector<int> z(n, 0);
    z[0] = n;
    int l = 0, r = 0;

    for (int i = 1; i < n; i++) {
        if (i < r) {
            z[i] = min(r - i, z[i - l]);
        }
        while (i + z[i] < n && s[z[i]] == s[i + z[i]]) {
            z[i]++;
        }
        if (i + z[i] > r) {
            l = i;
            r = i + z[i];
        }
    }

    return z;
}

vector<int> zSearch(const string& text, const string& pattern) {
    string concat = pattern + "$" + text;
    vector<int> z = zFunction(concat);
    int m = pattern.length();
    vector<int> result;

    for (int i = m + 1; i < concat.length(); i++) {
        if (z[i] == m) {
            result.push_back(i - m - 1);
        }
    }

    return result;
}

5. String Hashing

5.1 Polynomial Hash

def polynomial_hash(s, base=31, mod=10**9 + 9):
    """
    Polynomial rolling hash
    hash(s) = s[0]*base^(n-1) + s[1]*base^(n-2) + ... + s[n-1]
    """
    h = 0
    for c in s:
        h = (h * base + ord(c) - ord('a') + 1) % mod
    return h

5.2 Prefix Hash (Range Hash)

class StringHash:
    """
    String hash (O(1) range hash queries)
    """
    def __init__(self, s, base=31, mod=10**9 + 9):
        self.base = base
        self.mod = mod
        self.n = len(s)

        # Prefix hash
        self.prefix = [0] * (self.n + 1)
        # Powers of base
        self.power = [1] * (self.n + 1)

        for i in range(self.n):
            self.prefix[i + 1] = (self.prefix[i] * base + ord(s[i]) - ord('a') + 1) % mod
            self.power[i + 1] = (self.power[i] * base) % mod

    def get_hash(self, l, r):
        """Hash of s[l:r+1] (0-indexed)"""
        h = (self.prefix[r + 1] - self.prefix[l] * self.power[r - l + 1]) % self.mod
        return (h + self.mod) % self.mod


# Usage example
s = "abcabc"
sh = StringHash(s)

print(sh.get_hash(0, 2))  # Hash of "abc"
print(sh.get_hash(3, 5))  # Hash of "abc" (should be same)
print(sh.get_hash(0, 2) == sh.get_hash(3, 5))  # True

5.3 Double Hash (Collision Prevention)

class DoubleHash:
    """Two hashes to minimize collision probability"""
    def __init__(self, s):
        self.h1 = StringHash(s, base=31, mod=10**9 + 7)
        self.h2 = StringHash(s, base=37, mod=10**9 + 9)

    def get_hash(self, l, r):
        return (self.h1.get_hash(l, r), self.h2.get_hash(l, r))

6. Application Problems

6.1 Longest Palindromic Substring

def longest_palindrome_substring(s):
    """
    Expand around center method
    Time: O(n²)
    """
    def expand(l, r):
        while l >= 0 and r < len(s) and s[l] == s[r]:
            l -= 1
            r += 1
        return s[l + 1:r]

    result = ""
    for i in range(len(s)):
        # Odd length palindrome
        odd = expand(i, i)
        if len(odd) > len(result):
            result = odd

        # Even length palindrome
        even = expand(i, i + 1)
        if len(even) > len(result):
            result = even

    return result


# Example
print(longest_palindrome_substring("babad"))  # "bab" or "aba"

6.2 Repeated Substring Pattern (KMP Application)

def repeated_substring_pattern(s):
    """
    Check if string is made of repeated pattern
    Example: "abab" → True (ab repeated 2 times)
    """
    n = len(s)
    pi = compute_failure(s)

    # Check last failure function value
    length = pi[n - 1]

    # Pattern unit length
    pattern_length = n - length

    # Check if n is multiple of pattern_length and actually repeats
    return length > 0 and n % pattern_length == 0


# Example
print(repeated_substring_pattern("abab"))   # True
print(repeated_substring_pattern("abcab"))  # False
def longest_common_substring(s1, s2):
    """
    Binary search + rolling hash
    Time: O((n+m) log(min(n,m)))
    """
    def get_hashes(s, length, base=31, mod=10**9 + 9):
        """Hashes of all substrings of given length"""
        if length > len(s):
            return set()

        hashes = set()
        h = 0
        power = pow(base, length - 1, mod)

        for i in range(length):
            h = (h * base + ord(s[i])) % mod

        hashes.add(h)

        for i in range(length, len(s)):
            h = (h - ord(s[i - length]) * power) % mod
            h = (h * base + ord(s[i])) % mod
            hashes.add(h)

        return hashes

    def check(length):
        """Check if common substring of given length exists"""
        h1 = get_hashes(s1, length)
        h2 = get_hashes(s2, length)
        return len(h1 & h2) > 0

    left, right = 0, min(len(s1), len(s2))
    result = 0

    while left <= right:
        mid = (left + right) // 2
        if check(mid):
            result = mid
            left = mid + 1
        else:
            right = mid - 1

    return result


# Example
print(longest_common_substring("abcdxyz", "xyzabcd"))  # 4 ("abcd" or "xyz")

6.4 Find Anagrams

def find_anagrams(s, p):
    """
    Find start indices of p's anagrams in s
    Sliding window + hashmap
    """
    from collections import Counter

    result = []
    p_count = Counter(p)
    s_count = Counter()
    m = len(p)

    for i, c in enumerate(s):
        s_count[c] += 1

        # Maintain window size
        if i >= m:
            left = s[i - m]
            s_count[left] -= 1
            if s_count[left] == 0:
                del s_count[left]

        if s_count == p_count:
            result.append(i - m + 1)

    return result


# Example
print(find_anagrams("cbaebabacd", "abc"))  # [0, 6]

6.5 String Compression

def compress_string(s):
    """
    Compress consecutive same characters
    "aabcccccaaa" → "a2b1c5a3"
    """
    if not s:
        return s

    result = []
    count = 1

    for i in range(1, len(s)):
        if s[i] == s[i - 1]:
            count += 1
        else:
            result.append(s[i - 1] + str(count))
            count = 1

    result.append(s[-1] + str(count))

    compressed = "".join(result)
    return compressed if len(compressed) < len(s) else s

7. Practice Problems

Difficulty Problem Platform Algorithm
⭐⭐ Find BOJ KMP
⭐⭐ Implement strStr() LeetCode KMP
⭐⭐ Repeated String Match LeetCode KMP/Rabin-Karp
⭐⭐⭐ Substring BOJ KMP
⭐⭐⭐ Longest Happy Prefix LeetCode KMP failure
⭐⭐⭐ Shortest Palindrome LeetCode KMP
⭐⭐⭐⭐ Advertisement BOJ KMP

Algorithm Comparison

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Algorithm    │ Time        │ Space       │ Characteristics│
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│ Brute Force  │ O(nm)       │ O(1)        │ Simple, short  │
│ KMP          │ O(n+m)      │ O(m)        │ Exact, general │
│ Rabin-Karp   │ O(n+m) avg  │ O(1)        │ Multi-pattern  │
│ Z-Algorithm  │ O(n+m)      │ O(n+m)      │ Simple impl    │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

n = text length, m = pattern length

Next Steps


References

to navigate between lessons