词条 | LCP array | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
释义 |
In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. For example, if A := [aab, ab, abaab, b, baab] is a suffix array, the longest common prefix between A[1] = aab and A[2] = ab is a which has length 1, so H[2] = 1 in the LCP array H. Likewise, the LCP of A[2] = ab and A[3] = abaab is ab, so H[3] = 2. Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up traversals of the suffix tree,{{sfn|Kasai|Lee|Arimura|Arikawa|2001}}{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}} speeds up pattern matching on the suffix array{{sfn|Manber|Myers|1993}} and is a prerequisite for compressed suffix trees.{{sfn|Ohlebusch|Fischer|Gog|2010}} HistoryThe LCP array was introduced in 1993, by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm.{{sfn|Manber|Myers|1993}} Gene Myers later became the vice president of Informatics Research at Celera Genomics, and Udi Manber the vice president of engineering at Google. DefinitionLet be the suffix array of the string and let denote the length of the longest common prefix between two strings and . Let further denote the substring of ranging from to . Then the LCP array is an integer array of size such that is undefined and for every . Thus stores the length of longest common prefix of the lexicographically 'th smallest suffix and its predecessor in the suffix array. ExampleConsider the string :
and its corresponding sorted suffix array :
Complete suffix array with sorted suffixes itself:
Then the LCP array is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:
So, for example, is the length of the longest common prefix shared by the suffixes and . Note that , since there is no lexicographically smaller suffix. Difference between suffix array and LCP arraySuffix array: Represents the lexicographic rank of each suffix of an array. LCP array: Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically. LCP array usage in finding the number of occurrences of a pattern{{Cleanup|reason=this section is a straight up copy of a [https://stackoverflow.com/a/11374737 StackOverflow answer] so it has the form of a reply to a question.|date=June 2016}}In order to find the number of occurrences of a given string P (length m) in a text T (length N),
The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons needed to be made, we compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N). The LCP-LR array helps improve this to O(m+log N), in the following way: At any point during the binary search algorithm, we consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether we continue our search in the left sub-range (L,...,M) or in the right sub-range (M,...,R). In order to make the decision, we compare P to the string at M. If P is identical to M, search is complete. But if not, we have already compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M. So, in the next step, we consider (M,...,R) and a new central point M' in the middle: | we know: lcp(P,M)==k The trick now is that LCP-LR is precomputed such that an O(1)-lookup tells us the longest common prefix of M and M', lcp(M,M'). We already know (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:
The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N). We still need to precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array. We know the standard LCP array gives us the lcp of consecutive entries only, i.e. lcp(i-1,i) for any i. However, M and M' in the description above are not necessarily consecutive entries. The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. Another way of looking at it is : every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size. Moreover, there is a straightforward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array. To sum up:
Efficient construction algorithmsLCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values. {{harvtxt|Manber|Myers|1993}} provide an algorithm to compute the LCP array alongside the suffix array in time. {{harvtxt|Kärkkäinen|Sanders|2003}} show that it is also possible to modify their time algorithm such that it computes the LCP array as well. {{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} present the first time algorithm (FLAAP) that computes the LCP array given the text and the suffix array.Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of bytes, while the original output (text, suffix array, LCP array) only occupies bytes. Therefore, {{harvtxt|Manzini|2004}} created a refined version of the algorithm of {{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} (lcp9) and reduced the space occupancy to bytes. {{harvtxt|Kärkkäinen|Manzini|Puglisi|2009}} provide another refinement of Kasai's algorithm (-algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the permuted LCP (PLCP) array, in which the values appear in text order rather than lexicographical order. {{harvtxt|Gog|Ohlebusch|2011}} provide two algorithms that although being theoretically slow () were faster than the above-mentioned algorithms in practice.{{As of|2012}}, the currently fastest linear-time LCP array construction algorithm is due to {{harvtxt|Fischer|2011}}, which in turn is based on one of the fastest suffix array construction algorithms by {{harvtxt|Nong|Zhang|Chan|2009}}.ApplicationsAs noted by {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} several string processing problems can be solved by the following kinds of tree traversals:
Deciding if a pattern of length is a substring of a string of length takes time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to time.{{sfn|Manber|Myers|1993}} {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} show how to improve this running time even further to achieve optimal time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the suffix tree. The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and lowest common ancestor queries.{{sfn|Sadakane|2007}}{{sfn|Fischer|Mäkinen|Navarro|2009}} Furthermore, it can be used together with the suffix array to compute the Lempel-Ziv LZ77 factorization in time.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}{{sfn|Crochemore|Ilie|2008}}{{sfn|Crochemore|Ilie|Smyth|2008}}{{sfn|Chen|Puglisi|Smyth|2008}} The longest repeated substring problem for a string of length can be solved in time using both the suffix array and the LCP array. It is sufficient to perform a linear scan through the LCP array in order to find its maximum value and the corresponding index where is stored. The longest substring that occurs at least twice is then given by . The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array. Suffix tree constructionGiven the suffix array and the LCP array of a string of length , its suffix tree can be constructed in time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array. Let be the partial suffix tree for . Further let be the length of the concatenation of all path labels from the root of to node . Start with , the tree consisting only of the root. To insert into , walk up the rightmost path beginning at the recently inserted leaf to the root, until the deepest node with is reached. We need to distinguish two cases:
A simple amortization argument shows that the running time of this algorithm is bounded by : The nodes that are traversed in step by walking up the rightmost path of (apart from the last node ) are removed from the rightmost path, when is added to the tree as a new leaf. These nodes will never be traversed again for all subsequent steps . Therefore, at most nodes will be traversed in total. LCP queries for arbitrary suffixesThe LCP array only contains the length of the longest common prefix of every pair of consecutive suffixes in the suffix array . However, with the help of the inverse suffix array (, i.e. the suffix that starts at position in is stored in position in ) and constant-time range minimum queries on , it is possible to determine the length of the longest common prefix of arbitrary suffixes in time. Because of the lexicographic order of the suffix array, every common prefix of the suffixes and has to be a common prefix of all suffixes between 's position in the suffix array and 's position in the suffix array . Therefore, the length of the longest prefix that is shared by all of these suffixes is the minimum value in the interval . This value can be found in constant time if is preprocessed for range minimum queries. Thus given a string of length and two arbitrary positions in the string with , the length of the longest common prefix of the suffixes and can be computed as follows: . NotesReferences
}}
| title = Simple linear work suffix array construction | url = http://dl.acm.org/citation.cfm?id=1759210.1759301 | year = 2003 | conference = Proceedings of the 30th international conference on Automata, languages and programming | pages = 943–955 | last1 = Kärkkäinen | first1 = Juha | last2 = Sanders | first2 = Peter | author2-link = Peter Sanders (computer scientist) | accessdate = 2012-08-28 }}
| title = Fast and Lightweight LCP-Array Construction Algorithms | url = http://www.siam.org/proceedings/alenex/2011/alx11_03_gogs.pdf | year = 2011 | conference = Proceedings of the Workshop on Algorithm Engineering and Experiments, ALENEX 2011 | pages = 25–34 | last1 = Gog | first1 = Simon | last2 = Ohlebusch | first2 = Enno | accessdate = 2012-08-28 }}
| last2 = Heun | first2 = Volker | title = A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array | conference = Combinatorics, Algorithms, Probabilistic and Experimental Methodologies | series = Lecture Notes in Computer Science | volume = 4614 | pages = 459 | year = 2007 | isbn = 978-3-540-74449-8 }}
External links{{Commonscat}}
3 : Arrays|Substring indices|String data structures |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。