Lyndon Word
Transcription
Lyndon Word
Alternative Algorithms for Lyndon Factorization Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio Aalto University Finland Lyndon Word Given two strings w and w′, w′ is a rotation of w if w=uv and w′=vu, for some strings u and v. A string is a Lyndon word if it is lexicographically (alphabetically) smaller than all its proper rotations. Lyndon Word w=ab, w′=ba where u=a, v=b. w is lexicographically smaller than its rotation w′ . w is Lyndon word. Examples of Lyndon words Lyndon words • a • ab • aabab Non-Lyndon words • ba • abaac • abcaac Lyndon factorization A word w can be factorized into w0 w1 w2 … wm-1 factors such that each factor is a Lyndon word. Every string has a unique factorization in Lyndon words with corresponding sequence of factors is nonincreasing with respect to lexicographical order. The Lyndon factorization has importance in a recent method for sorting the suffixes of a text. Examples of Lyndon factorization abcaabcaaabcaaaabc -> abc aabc aaabc aaaabc aacaacaacaad -> aacaacaacaad abacabab -> abac ab ab Duval’s algorithm For Lyndon factorization of a word w, computes the longest prefix w1 of w = w1w′ which is a Lyndon word and then recursively restart the process from w′. Non-empty prefixes of Lyndon words are all of the form (uv)ku. Duval’s algorithm compute the factorization using a left to right parsing. Computing Lyndon factorization for T=aabaabaaac For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. Then there are two cases, depending on the next symbol to be read. Computing Lyndon factorization for T=aabaabaaac For i=3 having P = aab. With u = empty string, v = aab and k = 1. The next symbol to read is 'a' and aaba is still a prefix of a Lyndon word. The next iteration then starts with P = aaba. Computing Lyndon factorization for T=aabaabaaac For i = 6, P = aabaab; P as (uv)k u with u = empty string, v = aab and k = 2. The next symbol to read is 'a' and after reading 'aaa', it is found aabaabaaa is not a prefix of a Lyndon word. Output is two times aab and the next iteration starts on the suffix aaac of T with P = a. Variations of Duval’s algorithm. First variation is designed with LF skip algorithm. Second variation is for strings compressed with runlength encoding. LF skip algorithm The algorithm is able to skip a significant portion of the characters of the string if it contains runs of smallest character. Let w be a word over an alphabet Σ with a factorization CFL(w) = w1,w2,...,wm . LF skip algorithm Let c be the smallest symbol in Σ. There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi. If the last symbol of w is not c, then, c is a prefix of each of wi, wi+1, . . . , wm. This property is used to devise an algorithm for Lyndon factorization that skip symbols. LF skip algorithm Let us consider the alphabet {a,b,c,…}. Let us assume that the last character is not a. Let wi start with aaad. We know that the prefix of wi+1 belongs to the set P = {aaaa,aaab,aaac,aaad}. We search for occurrences of P with an algorithm (e.g. SBNDM) sublinear on average in order to skip characters. aaadxxxxxxxxxxxaaac ---^---^--^^+++ Run Length Encoding Run-length encoding (RLE) is a very simple form of data compression in which runs of symbols are stored as a single data value. Given string: aaaaaabbbccaaabbbccbbbbbaaa RLE: a6b3c2a3b3c2b5a3 Lyndon factorization of RLE string The second variation is for strings compressed with runlength encoding. Strings are stored in RLE for preferably. Lyndon factorization of RLE string The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE. Run of length t in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to t unit-length factors. Computing Lyndon factorization from RLE for T=aabaabaaac For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. RLE algorithm works in it is similar, except the runs are read instead of symbols. Computing Lyndon factorization from RLE for T=aabaabaaac For i = 3, P = aab. The next run to be read is 'aa' and aabaa is still a prefix of a Lyndon word. The next iteration then starts with P = aabaa. For i = 6, P = aabaab. The next run to be read is 'aaa' and aabaabaaa is not a prefix of a Lyndon word. Next iteration starts on the suffix aaac of T with P = aaa. Complexity Given a run-length encoded string R of length ρ, algorithm computes the Lyndon factorization of R in O(ρ) time. It is preferable to Duval’s algorithm in the cases in which the strings are stored or maintained in run-length encoding. Experimental results LF-skip algorithm and Duval’s algorithm with various texts. LF-skip gave a significant speed-up over Duval’s algorithm. Following table shows the speed-ups for random texts of 5 MB with various alphabets sizes. Speed-up of LF-skip Conclusion Two variations of Duval’s algorithm for computing the Lyndon factorization of a string are presented. The first algorithm is designed that skips a significant portion of the characters. Experimental results show that the algorithm considerably faster than Duval’s original algorithm. is The second algorithm is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time. THANK YOU