Introduction
Given a string, sort the suffixes of that string in ascending order. The resulting sorted list is called a suffix array. For example, for s = "banana"
the suffixes are:
After sorting them we get:
Therefore, sa = [5, 3, 1, 0, 4, 2]
is the suffix array for s
.
The naive solution would be to simply sort the strings using quick sort which takes O(nlogn)
, and since each string comparison takes O(n)
, overall the complexity would be O(n2logn)
.
There are better algorithms to construct a suffix array and they typically run in O(n)
or O(nlogn)
. If you are interested in an O(nlogn)
approach, read this post by cp-algorithms or the paper by Manber and Myers. Also in theory, suffix sorting can be done using Suffix Trees in O(n)
. However, due to the large constant and high memory overhead, in practice suffix arrays are often preferred. In addition, suffix arrays can also be constructed in linear time (ex: 1 & 2). However, a good O(nlogn)
implementation of suffix arrays such as qsufsort is usually sufficient and in practice faster than most linear time implementations.
In the next section, we look at a relatively simple algorithm that creates the suffix array for a given string in O(nlog2n)
using dynamic programming.
Dynamic Programming
The idea of this algorithm is based on efficient suffix comparisons. Suppose cmp(i, j, k)
compares the suffixes starting at i
and j
using only the first k
characters of them. Now, in order to calculate cmp
we look at the comparison of the first half (using the first k/2
characters). If they are different, we know the answer. If they are equal, then we look at the comparison of the second half. psudo-code:
One problem with this recursion formula is that we need to store the comparison between all possible pairs of suffixes at each (k
) to use it in the next round (2 * k
), which would make the algorithm slow. The trick to solve this efficiently is to sort them using the comparison function in O(nlogn)
, and then for each i < j
, using the sorted array sa
, we know the suffix starting at sa[i]
is less than or equal to the suffix starting at sa[j]
. In order to distinguish between <
and =
, we run a bucketization that simply walks over the sorted array and finds chunks of equal suffixes. Later, we use the bucket array to compute cmp(i, j, k / 2) = bucket[i] - bucket[j]
.
For example, far s = "banana"
and k = 2
, sorting all suffixes using only their first 2 letters would be:
[a-]
[an] ana
[an] a
[ba] nana
[na] na
[na]
So, since a- < an = an < ba < na = na
, the bucket array would be [0, 1, 1, 2, 3, 3]
, and to compare two suffixes we can simply subtract their buckets!
Implementation
Here is my implementation of the O(nlog2n)
algorithm in python:
Running it:
Source
I learned this algorithm from TopCoder Forums.