Farach’s Algorithm

  • Farach’s Algorithm 1 aims to construct a Suffix Tree for integer alphabets (i.e., an alphabet that scales in relative to the size of the string so that a string contains potentially many characters).

  • An integer alphabet is one that is obtained by first sorting each character in the alphabet by rank and replacing them with their rank.

  • The algorithm proceeds as follows:

    • *Construct the odd tree . *

      • For , form pairs and radix sort them in linear time, remove any duplicates and compute . is computable in linear time.

      • Then, recursively compute , the suffix tree of . Any odd suffix is a suffix . Any internal node of with length becomes an internal node of with length . Call the new tree — the non-compacted trie containing all odd suffixes.

      • For all internal nodes , take ’s children whose edge start with each character, and introduce a new node between and these children.

        • If , in the process of doing this, ends up with no children because all of ’s children start with the same symbol, we simply delete .
      • can be built for an -string in time

    • Construct the even tree from the odd trie.

      • The idea is to use a lexicographic traversal of the leaves in the compacted odd trie as well as the length of the longest common prefix of adjacent leaves.

      • The even suffixes consist of a single character followed by an odd suffix. We can get the lexicographic ordering of the leaves of in time.

      • The longest common prefix can be calculated as follows for leaves .

      • Given the odd tree , can be constructed in time.

    • Merge the odd and even trees.

      • We use an oracle to report that two edges are “equal” if they share the same first character.

      • If the oracle reports two edges as equal we break the longer of the two edges and merge its prefix with the shorter of the two edges. This produces tree .

      • Let be a node. It is odd if it occurs in and even otherwise. be the length of calculated as the sum of the edge lengths from the path from the root to and be descendants such that is their least common ancestor .

        (Farach 4.1): is properly merged in if

        We can compute in linear time. In fact (Farach 4.2) = the depth of the tree.

      • If then we merged too far, we have the following procedure.

        Let be a border node if . Border nodes can be found in linear time.

        For each , let be the subtree below and the be the tree below .

        Create a new node and make it a child of , the parent of in the merged tree

        Set

        Hang and off of . Order the children of lexicographically.

  • 2 gives an implementation that uses -recursion. Essentially, instead of dividing into even and odd, divide by extracting — the tree of all suffixes in the positions, and then reconstruct the remaining suffix trees.

    This algorithm is time and space algorithm.

Links

Footnotes

  1. Farach (1997) Optimal Suffix Construction with Large Alphabets

  2. Karkkainen and Sanders (2003). Simple Linear Work Suffix Array Construction