Introduction
Given a string, sort the suffixes of that string in ascending order. The resulting sorted list is called a suffix array. For example, for s = "banana"
the suffixes are:0. banana
1. anana
2. nana
3. ana
4. na
5. a
After sorting them we get:5. a
3. ana
1. anana
0. banana
4. na
2. nana
Therefore, sa = [5, 3, 1, 0, 4, 2]
is the suffix array for s
.
The naive solution would be to simply sort the strings using quick sort which takes O(nlogn)
, and since each string comparison takes O(n)
, overall the complexity would be O(n2logn)
.
There are better algorithms to construct a suffix array and they typically run in O(n)
or O(nlogn)
. If you are interested in an O(nlogn)
approach, read this post by cp-algorithms or the paper by Manber and Myers. Also in theory, suffix sorting can be done using Suffix Trees in O(n)
. However, due to the large constant and high memory overhead, in practice suffix arrays are often preferred. In addition, suffix arrays can also be constructed in linear time (ex: 1 & 2). However, a good O(nlogn)
implementation of suffix arrays such as qsufsort is usually sufficient and in practice faster than most linear time implementations.
In the next section, we look at a relatively simple algorithm that creates the suffix array for a given string in O(nlog2n)
using dynamic programming.
Dynamic Programming
The idea of this algorithm is based on efficient suffix comparisons. Suppose cmp(i, j, k)
compares the suffixes starting at i
and j
using only the first k
characters of them. Now, in order to calculate cmp
we look at the comparison of the first half (using the first k/2
characters). If they are different, we know the answer. If they are equal, then we look at the comparison of the second half. psudo-code:# k is either 1 or an even number
cmp(i, j, k):
# base case: compare using the first characters
if k == 1:
return s[i] - s[j]
# recursion formula
first_half = cmp(i, j, k/2)
if first_half == 0:
return first_half
else:
return cmp(i + k/2, j + k/2, k/2)
One problem with this recursion formula is that we need to store the comparison between all possible pairs of suffixes at each (k
) to use it in the next round (2 * k
), which would make the algorithm slow. The trick to solve this efficiently is to sort them using the comparison function in O(nlogn)
, and then for each i < j
, using the sorted array sa
, we know the suffix starting at sa[i]
is less than or equal to the suffix starting at sa[j]
. In order to distinguish between <
and =
, we run a bucketization that simply walks over the sorted array and finds chunks of equal suffixes. Later, we use the bucket array to compute cmp(i, j, k / 2) = bucket[i] - bucket[j]
.
For example, far s = "banana"
and k = 2
, sorting all suffixes using only their first 2 letters would be:
[a-]
[an] ana
[an] a
[ba] nana
[na] na
[na]
So, since a- < an = an < ba < na = na
, the bucket array would be [0, 1, 1, 2, 3, 3]
, and to compare two suffixes we can simply subtract their buckets!
Implementation
Here is my implementation of the O(nlog2n)
algorithm in python:# compare suffixes starting at idx1 and idx2, using only the first k characters
def compare_suffixes(idx1, idx2, k, s, b):
if k == 1:
return ord(s[idx1]) - ord(s[idx2])
if b[idx1] != b[idx2]: # first half
return b[idx1] - b[idx2]
else:
k2 = k / 2
b1 = b[idx1 + k2] if idx1 + k2 < len(s) else -1
b2 = b[idx2 + k2] if idx2 + k2 < len(s) else -1
return b1 - b2
def bucketize(sa, k, s, prev_buckets):
b = [None] * len(s)
bucket = 0
for i in range(len(sa)):
if i > 0 and compare_suffixes(sa[i], sa[i-1], k, s, prev_buckets) != 0:
bucket += 1
b[sa[i]] = bucket
return b
def suffix_array(s):
n = len(s)
sa = range(n)
buckets = None
k = 1
while k <= n:
sa.sort(cmp=lambda i, j: compare_suffixes(i, j, k, s, buckets))
buckets = bucketize(sa, k, s, buckets)
k *= 2
return sa
Running it:>>> suffix_array('banana')
[5, 3, 1, 0, 4, 2]
Source
I learned this algorithm from TopCoder Forums.