Lyndon Word

Transcription

Lyndon Word
Alternative Algorithms for Lyndon
Factorization
Sukhpal Singh Ghuman, Emanuele Giaquinta,
and Jorma Tarhio
Aalto University
Finland
Lyndon Word
Given two strings w and w′, w′ is a rotation of w if w=uv
and w′=vu, for some strings u and v.
A string is a Lyndon word if it is lexicographically
(alphabetically) smaller than all its proper rotations.
Lyndon Word
w=ab, w′=ba where u=a, v=b.
w is lexicographically smaller than its rotation w′ .
w is Lyndon word.
Examples of Lyndon words
Lyndon words
• a
• ab
• aabab
Non-Lyndon words
• ba
• abaac
• abcaac
Lyndon factorization
A word w can be factorized into w0 w1 w2 … wm-1 factors
such that each factor is a Lyndon word.
Every string has a unique factorization in Lyndon words
with corresponding sequence of factors is nonincreasing with respect to lexicographical order.
The Lyndon factorization has importance in a recent
method for sorting the suffixes of a text.
Examples of Lyndon factorization
abcaabcaaabcaaaabc -> abc aabc aaabc aaaabc
aacaacaacaad -> aacaacaacaad
abacabab -> abac ab ab
Duval’s algorithm
For Lyndon factorization of a word w, computes the
longest prefix w1 of w = w1w′ which is a Lyndon word
and then recursively restart the process from w′.
Non-empty prefixes of Lyndon words are all of the form
(uv)ku.
Duval’s algorithm compute the factorization using a left
to right parsing.
Computing Lyndon factorization for
T=aabaabaaac
For the sting T=aabaabaaac, parsed prefix P=T[1….i]
of Lyndon word is equal to (uv)ku for strings u v and
constant k.
Then there are two cases, depending on the next
symbol to be read.
Computing Lyndon factorization for
T=aabaabaaac
For i=3 having P = aab. With u = empty string, v = aab
and k = 1.
The next symbol to read is 'a' and aaba is still a prefix of
a Lyndon word. The next iteration then starts with P =
aaba.
Computing Lyndon factorization for
T=aabaabaaac
For i = 6, P = aabaab; P as (uv)k u with u = empty string,
v = aab and k = 2.
The next symbol to read is 'a' and after reading 'aaa', it
is found aabaabaaa is not a prefix of a Lyndon word.
Output is two times aab and the next iteration starts on
the suffix aaac of T with P = a.
Variations of Duval’s algorithm.
First variation is designed with LF skip algorithm.
Second variation is for strings compressed with runlength encoding.
LF skip algorithm
The algorithm is able to skip a significant portion of the
characters of the string if it contains runs of smallest
character.
Let w be a word over an alphabet Σ with a factorization
CFL(w) = w1,w2,...,wm .
LF skip algorithm
Let c be the smallest symbol in Σ.
There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi.
If the last symbol of w is not c, then, c is a prefix of each
of wi, wi+1, . . . , wm.
This property is used to devise an algorithm for Lyndon
factorization that skip symbols.
LF skip algorithm
Let us consider the alphabet {a,b,c,…}. Let us assume
that the last character is not a.
Let wi start with aaad. We know that the prefix of wi+1
belongs to the set P = {aaaa,aaab,aaac,aaad}.
We search for occurrences of P with an algorithm (e.g.
SBNDM) sublinear on average in order to skip
characters.
aaadxxxxxxxxxxxaaac
---^---^--^^+++
Run Length Encoding
Run-length encoding (RLE) is a very simple form
of data compression in which runs of symbols are stored
as a single data value.
Given string: aaaaaabbbccaaabbbccbbbbbaaa
RLE: a6b3c2a3b3c2b5a3
Lyndon factorization of RLE string
The second variation is for strings compressed with runlength encoding.
Strings are stored in RLE for preferably.
Lyndon factorization of RLE string
The algorithm is based on Duval’s original algorithm and
on a combinatorial property between the Lyndon
factorization of a string and its RLE.
Run of length t in the RLE is either contained in one
factor of the Lyndon factorization, or it corresponds to t
unit-length factors.
Computing Lyndon factorization from
RLE for T=aabaabaaac
For the sting T=aabaabaaac, parsed prefix P=T[1….i]
of Lyndon word is equal to (uv)ku for strings u v and
constant k.
RLE algorithm works in it is similar, except the runs are
read instead of symbols.
Computing Lyndon factorization from
RLE for T=aabaabaaac
For i = 3, P = aab. The next run to be read is 'aa' and
aabaa is still a prefix of a Lyndon word. The next
iteration then starts with P = aabaa.
For i = 6, P = aabaab. The next run to be read is 'aaa'
and aabaabaaa is not a prefix of a Lyndon word.
Next iteration starts on the suffix aaac of T with P = aaa.
Complexity
Given a run-length encoded string R of length ρ,
algorithm computes the Lyndon factorization of R in
O(ρ) time.
It is preferable to Duval’s algorithm in the cases in which
the strings are stored or maintained in run-length
encoding.
Experimental results
LF-skip algorithm and Duval’s algorithm with various
texts.
LF-skip gave a significant speed-up over Duval’s
algorithm.
Following table shows the speed-ups for random texts of
5 MB with various alphabets sizes.
Speed-up of LF-skip
Conclusion
Two variations of Duval’s algorithm for computing the
Lyndon factorization of a string are presented.
The first algorithm is designed that skips a significant
portion of the characters.
Experimental results show that the algorithm
considerably faster than Duval’s original algorithm.
is
The second algorithm is for strings compressed with
run-length encoding and computes the Lyndon
factorization of a run-length encoded string of length ρ in
O(ρ) time.
THANK YOU