请输入您要查询的百科知识:

 

词条 Trie
释义

  1. History and etymology

  2. Applications

      As a replacement for other data structures    Dictionary representation    Term indexing  

  3. Algorithms

      Sorting    Full text search  

  4. Implementation strategies

      Bitwise tries    Compressing tries   External memory tries 

  5. See also

  6. References

  7. External links

{{about|a tree data structure|the French commune|Trie-sur-Baïse}}{{lead rewrite|"example shown" prose belongs in the article body|date=June 2017}}

In computer science, a trie, also called digital tree, radix tree or prefix tree, is a kind of search tree—an ordered tree data structure used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Keys tend to be associated with leaves, though some inner nodes may correspond to keys of interest. Hence, keys are not necessarily associated with every node. For the space-optimized presentation of prefix tree, see compact prefix tree.

In the example shown, keys are listed in the nodes and values below them. Each complete English word has an arbitrary integer value associated with it. A trie can be seen as a tree-shaped deterministic finite automaton. Each finite language is generated by a trie automaton, and each trie can be compressed into a deterministic acyclic finite state automaton.

Though tries are usually keyed by character strings,{{citation needed lead|date=June 2017}} they need not be. The same algorithms can be adapted to serve similar functions of ordered lists of any construct, e.g. permutations on a list of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up any fixed-length binary datum, such as an integer or memory address.

History and etymology

Tries were first described by René de la Briandais in 1959.[1][2]{{rp|336}} The term trie was coined two years later by Edward Fredkin, who pronounces it {{IPAc-en|ˈ|t|r|iː}} (as "tree"), after the middle syllable of retrieval.[3][4] However, other authors pronounce it {{IPAc-en|ˈ|t|r|aɪ}} (as "try"), in an attempt to distinguish it verbally from "tree".[3][4][4]

Applications

As a replacement for other data structures

As discussed below, a trie has a number of advantages over binary search trees.[5] A trie can also be used to replace a hash table, over which it has the following advantages:

  • Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.{{cn|date=June 2017}}
  • There are no collisions of different keys in a trie.
  • Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
  • There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
  • A trie can provide an alphabetical ordering of the entries by key.

Tries do have some drawbacks as well:

  • Trie lookup can be slower in some cases than hash tables, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.[6]
  • Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly meaningful. Nevertheless, a bitwise trie can handle standard IEEE single and double format floating point numbers.{{cn|date=June 2017}}
  • Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.

Dictionary representation

A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile telephone. Such applications take advantage of a trie's ability to quickly search for, insert, and delete entries; however, if storing dictionary words is all that is required (i.e., storage of information auxiliary to each word is not required), a minimal deterministic acyclic finite state automaton (DAFSA) would use less space than a trie. This is because a DAFSA can compress identical branches from the trie which correspond to the same suffixes (or parts) of different words being stored.

Tries are also well suited for implementing approximate matching algorithms,[7] including those used in spell checking and hyphenation[4] software.

Term indexing

A discrimination tree term index stores its information in a trie data structure.[8]

Algorithms

The trie is a tree of nodes which supports Find and Insert operations. Find returns the value for a key string, and Insert inserts a string (the key) and a value into the trie. Both Insert and Find run in {{math|{{math|O(n)}}}} time, where n is the length of the key.

A simple Node class can be used to represent nodes in the trie:

class Node():

   def __init__(self):       # Note that using dictionary for children (as in this implementation) would not allow lexicographic sorting mentioned in the next section (Sorting),       # because ordinary dictionary would not preserve the order of the keys       self.children : Dict[str, Node] = {}  # mapping from character ==> Node       self.value : Any = None

Note that children is a dictionary of characters to a node's children; and it is said that a "terminal" node is one which represents a complete string.
A trie's value can be looked up as follows:

def find(node: Node, key: str) -> Any:

    for char in key:        if char in node.children:            node = node.children[char]        else:            return None    return node.value

Insertion proceeds by walking the trie according to the string to be inserted, then appending new nodes for the suffix of the string that is not contained in the trie:

def insert(node: Node, key: str, value: Any) -> None:

    for char in key:        if char not in node.children:            node.children[char] = Node()        node = node.children[char]    node.value = value

Sorting

Lexicographic sorting of a set of keys can be accomplished by building a trie from them, and traversing it in pre-order, printing only the leaves' values. This algorithm is a form of radix sort.{{cn|date=January 2018}}

A trie forms the fundamental data structure of Burstsort, which (in 2007) was the fastest known string sorting algorithm.[9] However, now there are faster string sorting algorithms.[10]

Full text search

A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.

Implementation strategies

There are several ways to represent tries, corresponding to different trade-offs between memory use and speed of the operations. The basic form is that of a linked set of nodes, where each node contains an array of child pointers, one for each symbol in the alphabet (so for the English alphabet, one would store 26 child pointers and for the alphabet of bytes, 256 pointers). This is simple but wasteful in terms of memory: using the alphabet of bytes (size 256) and four-byte pointers, each node requires a kilobyte of storage, and when there is little overlap in the strings' prefixes, the number of required nodes is roughly the combined length of the stored strings.{{r|brass}}{{rp|341}} Put another way, the nodes near the bottom of the tree tend to have few children and there are many of them, so the structure wastes space storing null pointers.[15]

The storage problem can be alleviated by an implementation technique called alphabet reduction, whereby the original strings are reinterpreted as longer strings over a smaller alphabet. E.g., a string of {{mvar|n}} bytes can alternatively be regarded as a string of {{math|2n}} four-bit units and stored in a trie with sixteen pointers per node. Lookups need to visit twice as many nodes in the worst case, but the storage requirements go down by a factor of eight.{{r|brass}}{{rp|347–352}}

An alternative implementation represents a node as a triple {{mono|(symbol, child, next)}} and links the children of a node together as a singly linked list: {{mono|child}} points to the node's first child, {{mono|next}} to the parent node's next child.[11][12] The set of children can also be represented as a binary search tree; one instance of this idea is the ternary search tree developed by Bentley and Sedgewick.{{r|brass}}{{rp|353}}

Another alternative in order to avoid the use of an array of 256 pointers (ASCII), as suggested before, is to store the alphabet array as a bitmap of 256 bits representing the ASCII alphabet, reducing dramatically the size of the nodes.[13]

Bitwise tries

{{unreferenced section|date=February 2015}}

Bitwise tries are much the same as a normal character-based trie except that individual bits are used to traverse what effectively becomes a form of binary tree. Generally, implementations use a special CPU instruction to very quickly find the first set bit in a fixed length key (e.g., GCC's __builtin_clz() intrinsic). This value is then used to index a 32- or 64-entry table which points to the first item in the bitwise trie with that number of leading zero bits. The search then proceeds by testing each subsequent bit in the key and choosing child[0] or child[1] appropriately until the item is found.

Although this process might sound slow, it is very cache-local and highly parallelizable due to the lack of register dependencies and therefore in fact has excellent performance on modern out-of-order execution CPUs. A red-black tree for example performs much better on paper, but is highly cache-unfriendly and causes multiple pipeline and TLB stalls on modern CPUs which makes that algorithm bound by memory latency rather than CPU speed. In comparison, a bitwise trie rarely accesses memory, and when it does, it does so only to read, thus avoiding SMP cache coherency overhead. Hence, it is increasingly becoming the algorithm of choice for code that performs many rapid insertions and deletions, such as memory allocators (e.g., recent versions of the famous Doug Lea's allocator (dlmalloc) and its descendents).

Compressing tries

Compressing the trie and merging the common branches can sometimes yield large performance gains. This works best under the following conditions:

  • The trie is mostly static (key insertions to or deletions from a pre-filled trie are disabled).{{cn|date=June 2017}}
  • Only lookups are needed.
  • The trie nodes are not keyed by node-specific data, or the nodes' data are common.[14]
  • The total set of stored keys is very sparse within their representation space.{{cn|date=June 2017}}

For example, it may be used to represent sparse bitsets, i.e., subsets of a much larger, fixed enumerable set. In such a case, the trie is keyed by the bit element position within the full set. The key is created from the string of bits needed to encode the integral position of each element. Such tries have a very degenerate form with many missing branches. After detecting the repetition of common patterns or filling the unused gaps, the unique leaf nodes (bit strings) can be stored and compressed easily, reducing the overall size of the trie.

Such compression is also used in the implementation of the various fast lookup tables for retrieving Unicode character properties. These could include case-mapping tables (e.g. for the Greek letter pi, from Π to π), or lookup tables normalizing the combination of base and combining characters (like the a-umlaut in German, ä, or the dalet-patah-dagesh-ole in Biblical Hebrew, {{Hebrew|דַּ֫}}). For such applications, the representation is similar to transforming a very large, unidimensional, sparse table (e.g. Unicode code points) into a multidimensional matrix of their combinations, and then using the coordinates in the hyper-matrix as the string key of an uncompressed trie to represent the resulting character. The compression will then consist of detecting and merging the common columns within the hyper-matrix to compress the last dimension in the key. For example, to avoid storing the full, multibyte Unicode code point of each element forming a matrix column, the groupings of similar code points can be exploited. Each dimension of the hyper-matrix stores the start position of the next dimension, so that only the offset (typically a single byte) need be stored. The resulting vector is itself compressible when it is also sparse, so each dimension (associated to a layer level in the trie) can be compressed separately.

Some implementations do support such data compression within dynamic sparse tries and allow insertions and deletions in compressed tries. However, this usually has a significant cost when compressed segments need to be split or merged. Some tradeoff has to be made between data compression and update speed. A typical strategy is to limit the range of global lookups for comparing the common branches in the sparse trie.{{cn|date=June 2017}}

The result of such compression may look similar to trying to transform the trie into a directed acyclic graph (DAG), because the reverse transform from a DAG to a trie is obvious and always possible. However, the shape of the DAG is determined by the form of the key chosen to index the nodes, in turn constraining the compression possible.

Another compression strategy is to "unravel" the data structure into a single byte array.[15]

This approach eliminates the need for node pointers, substantially reducing the memory requirements. This in turn permits memory mapping and the use of virtual memory to efficiently load the data from disk.

One more approach is to "pack" the trie.[16] Liang describes a space-efficient implementation of a sparse packed trie applied to automatic hyphenation, in which the descendants of each node may be interleaved in memory.

External memory tries

Several trie variants are suitable for maintaining sets of strings in external memory, including suffix trees. A combination of trie and B-tree, called the B-trie has also been suggested for this task; compared to suffix trees, they are limited in the supported operations but also more compact, while performing update operations faster.[17]

See also

{{div col|colwidth=22em}}
  • Suffix tree
  • Radix tree
  • Directed acyclic word graph (aka DAWG)
  • Acyclic deterministic finite automata
  • Hash trie
  • Deterministic finite automata
  • Judy array
  • Search algorithm
  • Extendible hashing
  • Hash array mapped trie
  • Prefix hash tree
  • Burstsort
  • Luleå algorithm
  • Huffman coding
  • Ctrie
  • HAT-trie
{{div col end}}

References

1. ^{{cite conference |first=René |last=de la Briandais |year=1959 |title=File searching using variable length keys |conference=Proc. Western J. Computer Conf. |pages=295–298}} Cited by Brass.
2. ^{{cite book |first=Peter |last=Brass |title=Advanced Data Structures |year=2008 |publisher=Cambridge University Press}}
3. ^{{cite web|url=https://xlinux.nist.gov/dads/HTML/trie.html|title=trie|first=Paul E.|last=Black|date=2009-11-16|work=Dictionary of Algorithms and Data Structures|publisher=National Institute of Standards and Technology|archiveurl=https://www.webcitation.org/5pqUULy24|archivedate=2010-05-19}}
4. ^{{cite book|last=Knuth|first=Donald|authorlink=Donald Knuth|title=The Art of Computer Programming Volume 3: Sorting and Searching|edition=2nd|year=1997|publisher=Addison-Wesley|isbn=0-201-89685-0|page=492|chapter=6.3: Digital Searching}}
5. ^{{cite journal |last1=Bentley|first1=Jon|last2=Sedgewick|first2=Robert|authorlink2=Robert Sedgewick (computer scientist)| title=Ternary Search Trees| journal=Dr. Dobb's Journal|date=1998-04-01|publisher=Dr Dobb's|url=http://www.ddj.com/windows/184410528|archiveurl=https://web.archive.org/web/20080623071352/http://www.ddj.com/windows/184410528 |archivedate=2008-06-23}}
6. ^{{cite journal | author=Edward Fredkin| authorlink=Edward Fredkin| title=Trie Memory| journal=Communications of the ACM| year=1960| volume=3| issue=9| pages=490–499| doi=10.1145/367390.367400 }}
7. ^{{Cite journal | last1 = Aho | first1 = Alfred V. | last2 = Corasick | first2 = Margaret J. | date = Jun 1975 | title = Efficient String Matching: An Aid to Bibliographic Search | journal = Communications of the ACM | volume = 18 | issue = 6 | pages = 333–340 | publisher = | jstor = | url = https://pdfs.semanticscholar.org/3547/ac839d02f6efe3f6f76a8289738a22528442.pdf | format = | accessdate = | doi=10.1145/360825.360855}}
8. ^John W. Wheeler; Guarionex Jordan."An Empirical Study of Term Indexing in the Darwin Implementation of the Model Evolution Calculus".2004.p. 5.
9. ^{{Cite web|url=http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf|format=PDF|title=Cache-Efficient String Sorting Using Copying|accessdate=2008-11-15}}
10. ^{{Cite journal|title=Engineering Radix Sort for Strings.|doi=10.1007/978-3-540-89097-3_3 | journal=Lecture Notes in Computer Science | pages=3–14}}
11. ^{{cite web |first=Lloyd |last=Allison |title=Tries |url=http://www.allisons.org/ll/AlgDS/Tree/Trie/ |accessdate=18 February 2014}}
12. ^{{cite web |website=Data Structures, Algorithms, & Applications in Java |title=Tries |first=Sartaj |last=Sahni |publisher=University of Florida |url=https://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/tries.htm |accessdate=18 February 2014}}
13. ^{{cite book|last1=Bellekens|first1=Xavier|title=A Highly-Efficient Memory-Compression Scheme for GPU-Accelerated Intrusion Detection Systems|date=2014|publisher=ACM|location=Glasgow, Scotland, UK|isbn=978-1-4503-3033-6|pages=302:302--302:309|url=http://doi.acm.org/10.1145/2659651.2659723|accessdate=21 October 2015}}
14. ^{{cite journal|url = http://www.pg.gda.pl/~jandac/daciuk98.ps.gz|title = Incremental Construction of Minimal Acyclic Finite-State Automata|year = 2000|author1=Jan Daciuk |author2=Stoyan Mihov |author3=Bruce W. Watson |author4=Richard E. Watson |journal = Computational Linguistics |publisher = Association for Computational Linguistics|pages = 3|doi = 10.1162/089120100561601|archiveurl = http://www.mitpressjournals.org/doi/abs/10.1162/089120100561601|archivedate= 2006-03-13|quote = This paper presents a method for direct building of minimal acyclic finite states automaton which recognizes a given finite list of words in lexicographical order. Our approach is to construct a minimal automaton in a single phase by adding new strings one by one and minimizing the resulting automaton on-the-fly|accessdate = 2009-05-28|volume = 26|arxiv= cs/0007009}}
15. ^{{Cite web|url=http://www.aclweb.org/anthology/W/W09/W09-1505.pdf|format=PDF|title = Tightly packed tries: how to fit large models into memory, and make them load fast, too|year = 2009|author1=Ulrich Germann |author2=Eric Joanis |author3=Samuel Larkin |work = ACL Workshops: Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing|publisher = Association for Computational Linguistics|pages = 31–39|quote = We present Tightly Packed Tries (TPTs), a compact implementation of read-only, compressed trie structures with fast on-demand paging and short load times. We demonstrate the benefits of TPTs for storing n-gram back-off language models and phrase tables for statistical machine translation. Encoded as TPTs, these databases require less space than flat text file representations of the same data compressed with the gzip utility. At the same time, they can be mapped into memory quickly and be searched directly in time linear in the length of the key, without the need to decompress the entire file. The overhead for local decompression during search is marginal.}}
16. ^{{cite thesis|degree=Doctor of Philosophy|title=Word Hy-phen-a-tion By Com-put-er|url=http://www.tug.org/docs/liang/liang-thesis.pdf|author=Franklin Mark Liang|year=1983|publisher=Stanford University|accessdate=2010-03-28|archiveurl=https://www.webcitation.org/5pqOfzlIA|archivedate=2010-05-19}}
17. ^{{cite journal |title=B-tries for Disk-based String Management |first1=Nikolas |last1=Askitis |first2=Justin |last2=Zobel |year=2008 |issn=1066-8888 |pages=1–26 |url=http://people.eng.unimelb.edu.au/jzobel/fulltext/vldbj09.pdf |journal=VLDB Journal}}

External links

{{Commons category}}{{wiktionary}}
  • [https://xlinux.nist.gov/dads/HTML/trie.html NIST's Dictionary of Algorithms and Data Structures: Trie]
{{CS-Trees}}{{Data structures}}{{Strings}}

4 : Trees (data structures)|Finite automata|Articles with example Python code|Articles with example Haskell code

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/15 11:24:03