An implementation of a suffix tree.
> ghci suffixTree.hs
λ graph . SuffixTree.construct $ "cacao"
(or whatever other string you want instead of "cacao
")
- Haskell
- GraphViz (
brew install graphviz
) - Haskell GraphViz (
cabal install graphviz
) - Zora (
cabal install zora
)
-
DC3 implementation uses regular sorting instead of radix sort, so it's
O(n log n)
instead of linear. -
Found a bug? E-mail me at
[email protected]
to let me know and I'll get right on it :)
###Terms
-
T$
: The input string (of length n) -
Suffix array: an array of the suffixes of
T$
, stored in sorted order. -
Suffix tree: A Patricia trie of
T$
where each leaf is labeled with the index where the corresponding suffix starts inT$
.-
O(n)
space, but large constant (15
) -
Usually just an array of start indices of suffixes
-
-
Patricia trie (radix trie): A trie where nodes with only one child are merged with their parents.
-
LCP: Longest common prefix of two strings
###Algorithm
(all steps are O(n)
):
- Build suffix array
S
- Build LCP array
L
of adjacent elems inS
- Build suffix tree from suffix array and LCP array
####Step 1: Build suffix array S
(DC3)
-
Recursively get the sorted order of all suffixes starting at positions that aren't multiples of three.
-
Construct a new string based on suffixes starting at positions in
T1:
andT2
.-
Begin by computing
T$[1:]
andT$[2:]
and padding each with'$'
until the lengths are multiples of three, then strcat -
Treat each block of three characters as its own character.
-
Can determine the relative ordering of those characters by radix sort.
-
Replace each block of three characters with its index.
-
Recursively compute the suffix array of that string.
-
-
Compute the suffix array of that string, recursively.
-
Use the resulting suffix array to deduce the orderings of the suffixes.
-
-
Using this information, sort the suffixes at positions that are at multiples of three (call them
T0
).-
For each position in
T0
, form a pair of the letter at that position and the index of the suffix right after it (which is inT1
). These pairs are effectively strings drawn from an alphabet of size\Sigma + n
. -
Radix sort them.
-
-
3-way merge the sorted lists of suffixes together into the overall suffix array.
- if two compared letters at indices are same, compare letters after them in string
####Step 2. Build LCP array L
of adjacent elems in S
-
pos[i]
: "what's thei
^th lexicographically ordered suffix (== what position does it start at?)?" -
rank[i]
: "what's the lexicographic order (rank) of the suffix starting ati
?" -
i
represents the starting index of a substring-
=>
rank[i]
is the lexicographic order (rank) of suffix starting ati
-
=> each it. of the forloop fills
height
at the index of the next substring, e.g.-
1st it. finds LCP for
"nonsense$"
and whatever suffix is lexicographically before that -
2nd it. finds LCP for
"onsense$"
and whatever suffix is lexicographically before that -
3rd it. ...
"nsense$"
...
-
-
-
k
represents the starting index of the suffix we're comparing against. Simply put,pos[rank[i]]
is justi
, the index of the current suffix (one of the two in the comparison); to get the other, since we want to compare with an adjacent suffix in the suffix array (pos
), we get precisely that suffix (it's adjacent inpos
):pos[rank[i] - 1]
(remember thatpos[j]
for anyj
is the starting index of thej
th lexicographically ordered suffix) -
if h > 0: h -= 1
only makes sense in context of comparings[i+h]
tos[k+h]
. So, why+h
? OK, this is pretty cool. So basically the idea is that ifh
> 0 then that means that last iteration we compared some suffixS
with the suffix lexicographically neighbouring it (this implementation uses the one previous to it inpos
, but you could do the one after it if you wanted) (let it beT
) and found some overlap. Because of the way the forloop is defined (forwards overs
/rank
), the next current suffix (let it beS'
) is justS
with the first char lopped off, and, because of the lexicographic ordering of the suffix array (pos
), the one to which we compare it (pos[rank[i] - 1]
,T'
) is justT
with the first char lopped off, hence, the LCP length betweenS'
andT'
is just the LCP length betweenS
andT
minus one. What's more, that's why this runs in linear time -- even if the string was the same character repeated a whole bunch of times, we'd still be caching the previous iteration's overlap amount inh
.
Here's an example of the interaction between pos
and height
during the construction of height
:
height:
====================
j^* | height[j]
--------------------
| LCP |
| length^** | suffix (pos)
(0) | ----------------------------
| | $ (8)
1 | 0 <-|-----------------
| | e$ (7)
2 | 1 <-|-----------------
| | ense$ (4)
3 | 0 <-|-----------------
| | nonsense$ (0)
4 | 1 <-|-----------------
| | nse$ (5)
5 | 3 <-|-----------------
| | nsense$ (2)
6 | 0 <-|-----------------
| | onsense$ (1)
7 | 0 <-|-----------------
| | se$ (6)
8 | 2 <-|-----------------
| | sense$ (3)
* in forloop, this will be rank[i]
** (LCP length)[q] corresponds to LCP for suffix[q] and [q+1]
####Step 3. Build suffix tree from suffix array and LCP array
-
Construct a Cartesian tree from the LCP array, fusing together nodes with the same values if one becomes a parent of the other.
-
Run a DFS over the tree and add missing children in the order in which they appear in the suffix array.
-
Assign labels to the edges based on the LCP values.
-
Keith Schwarz (explanation of the DC3 and suffix tree construction from suffix array and LCP values largely taken from CS166)
-
T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. CPM, volume 2089 of LNCS, pages 181–192. Springer, 2001.
-
J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction. In ICALP, 2003, pp. 943–955.
-
Harold Carr for examples using GraphViz in Haskell: (http://haroldcarr.com/posts/2014-02-28-using-graphviz-via-haskell.html)