Introduction
Pattern matching is fundamental in computer science, from DNA sequencing to search engines. While most developers know about KMP or Rabin-Karp algorithms, suffix arrays offer a powerful alternative that combines simplicity with exceptional performance.
Suffix arrays provide O(m log n) pattern matching after O(n log n) preprocessing, making them ideal for applications requiring multiple queries on the same text.
What Are Suffix Arrays?
A suffix array is a sorted array of all suffixes of a string. For string S of length n, it contains integers representing starting positions of suffixes in lexicographical order.
Example
String: "banana"
Suffixes:
0: banana
1: anana
2: nana
3: ana
4: na
5: a
Suffix Array: [5, 3, 1, 0, 4, 2]
(sorted: a, ana, anana, banana, na, nana)
Implementation
Basic Suffix Array Construction
def build_suffix_array(s):
"""Build suffix array using sorting approach"""
n = len(s)
suffixes = [(s[i:], i) for i in range(n)]
suffixes.sort()
return [suffix[1] for suffix in suffixes]
# Optimized O(n log²n) version
def build_suffix_array_optimized(s):
n = len(s)
s += chr(0) # Add sentinel
# Initial ranking
rank = [ord(c) for c in s]
sa = list(range(n + 1))
k = 1
while k <= n:
# Sort by (rank[i], rank[i+k])
sa.sort(key=lambda i: (rank[i], rank[i + k] if i + k <= n else -1))
# Update ranks
new_rank = [0] * (n + 1)
for i in range(1, n + 1):
new_rank[sa[i]] = new_rank[sa[i-1]]
if (rank[sa[i]], rank[sa[i] + k] if sa[i] + k <= n else -1) != \
(rank[sa[i-1]], rank[sa[i-1] + k] if sa[i-1] + k <= n else -1):
new_rank[sa[i]] += 1
rank = new_rank
k *= 2
return sa[1:] # Remove sentinel
Pattern Matching with Suffix Arrays
def binary_search_pattern(text, suffix_array, pattern):
"""Find all occurrences of pattern using suffix array"""
n = len(text)
m = len(pattern)
# Binary search for leftmost occurrence
left, right = 0, n
while left < right:
mid = (left + right) // 2
suffix = text[suffix_array[mid]:suffix_array[mid] + m]
if suffix < pattern:
left = mid + 1
else:
right = mid
start = left
# Binary search for rightmost occurrence
left, right = 0, n
while left < right:
mid = (left + right) // 2
suffix = text[suffix_array[mid]:suffix_array[mid] + m]
if suffix <= pattern:
left = mid + 1
else:
right = mid
end = left
# Return all positions
return [suffix_array[i] for i in range(start, end)]
# Usage example
text = "banana"
sa = build_suffix_array_optimized(text)
positions = binary_search_pattern(text, sa, "ana")
print(f"Pattern found at positions: {positions}") # [1, 3]
Complexity Analysis
| Operation | Time Complexity | Space Complexity |
| ----------- | ---------------- | ------------------ |
| Construction | O(n log²n) | O(n) |
| Pattern Search | O(m log n) | O(1) |
| Multiple Queries | O(m log n) per query | O(n) total |
Why This Matters: - Preprocessing once, query many times efficiently - Better than KMP for multiple pattern searches - Excellent for bioinformatics applications
Real-World Applications
1. Bioinformatics - DNA Sequence Analysis
def find_dna_patterns(genome, patterns):
"""Find multiple DNA patterns in genome sequence"""
sa = build_suffix_array_optimized(genome)
results = {}
for pattern in patterns:
positions = binary_search_pattern(genome, sa, pattern)
results[pattern] = positions
return results
# Example: Finding restriction enzyme sites
genome = "ATCGATCGATCGAATTCGATCG"
enzymes = ["GAATTC", "ATCG"]
sites = find_dna_patterns(genome, enzymes)
2. Text Processing - Document Search
class DocumentSearchEngine:
def __init__(self, documents):
self.text = " ".join(documents)
self.suffix_array = build_suffix_array_optimized(self.text)
self.doc_boundaries = self._build_boundaries(documents)
def search(self, query):
positions = binary_search_pattern(self.text, self.suffix_array, query)
return self._map_to_documents(positions)
Advanced Optimizations
LCP Array Enhancement
def build_lcp_array(text, suffix_array):
"""Build Longest Common Prefix array"""
n = len(text)
rank = [0] * n
lcp = [0] * (n - 1)
# Build rank array
for i in range(n):
rank[suffix_array[i]] = i
# Kasai algorithm for LCP
h = 0
for i in range(n):
if rank[i] > 0:
j = suffix_array[rank[i] - 1]
while i + h < n and j + h < n and text[i + h] == text[j + h]:
h += 1
lcp[rank[i] - 1] = h
if h > 0:
h -= 1
return lcp
Common Mistakes & Pitfalls
❌ Mistake 1: Incorrect Boundary Handling
# Wrong - doesn't handle string boundaries
def wrong_search(text, sa, pattern):
for i in range(len(sa)):
if text[sa[i]:].startswith(pattern): # May go out of bounds
return sa[i]
# Correct - proper boundary checking
def correct_search(text, sa, pattern):
for i in range(len(sa)):
if sa[i] + len(pattern) <= len(text):
if text[sa[i]:sa[i] + len(pattern)] == pattern:
return sa[i]
❌ Mistake 2: Inefficient Construction
# Inefficient O(n²log n)
def slow_suffix_array(s):
return sorted(range(len(s)), key=lambda i: s[i:])
# Better O(n log²n) with proper ranking
# Use the optimized version shown above
Performance Comparison
| Algorithm | Preprocessing | Search | Multiple Queries |
| ----------- | -------------- | -------- | ------------------ |
| Naive | O(1) | O(nm) | O(qnm) |
| KMP | O(m) | O(n+m) | O(q(n+m)) |
| Suffix Array | O(n log²n) | O(m log n) | O(qm log n) |
When to Use Suffix Arrays: - Multiple pattern searches on same text - Text preprocessing is acceptable - Memory usage is not critical constraint
Conclusion
Suffix arrays provide an elegant solution for pattern matching problems, especially when dealing with multiple queries. Their combination of reasonable construction time and fast query performance makes them invaluable for bioinformatics, text processing, and competitive programming.
Next Steps: - Practice with string matching problems - Implement suffix trees for even faster queries - Explore applications in data compression
Try These Problems: - Pattern Matching Challenge - DNA Sequence Analysis - Text Search Engine
--- *Master advanced algorithms with our comprehensive coding problem collection. Start solving today!*