请输入您要查询的百科知识:

 

词条 Flajolet–Martin algorithm
释义

  1. The algorithm

  2. Improving accuracy

  3. See also

  4. References

  5. Additional sources

{{orphan|date=November 2014}}

The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (the count-distinct problem). The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications".[1] Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet,[2] and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al.[3]

In their 2010 article "An optimal algorithm for the distinct elements problem",[4] Daniel M. Kane, Jelani Nelson and David P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimal O(1) update and reporting times.

The algorithm

Assume that we are given a hash function that maps input to integers in the range , and where the outputs are sufficiently uniformly distributed. Note that the set of integers from 0 to corresponds to the set of binary strings of length . For any non-negative integer , define to be the -th bit in the binary representation of , such that:

We then define a function that outputs the position of the least-significant set bit in the binary representation of :

where . Note that with the above definition we are using 0-indexing for the positions. For example, , since the least significant bit is a 1 (0th position), and , since the least significant bit is at the 3rd position. At this point, note that under the assumption that the output of our hash function is uniformly distributed, then the probability of observing a hash output ending with (a one, followed by zeroes) is , since this corresponds to flipping heads and then a tail with a fair coin.

Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset is as follows:

  1. Initialize a bit-vector BITMAP to be of length and contain all 0s.
  2. For each element in :
    1. Calculate the index .
    2. Set .
  3. Let denote the smallest index such that .
  4. Estimate the cardinality of as , where .

The idea is that if is the number of distinct elements in the multiset , then is accessed approximately times, is accessed approximately times and so on. Consequently, if , then is almost certainly 0, and if , then is almost certainly 1. If , then can be expected to be either 1 or 0.

The correction factor is found by calculations, which can be found in the original article.

Improving accuracy

A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times with different hash functions and combine the results from the different runs. One idea is to take the mean of the results together from each hash function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use the median, which is less prone to be influences by outliers. The problem with this is that the results can only take form , where is integer. A common solution is to combine both the mean and the median: Create hash functions and split them into distinct groups (each of size ). Within each group use the median for aggregating together the results, and finally take the mean of the group estimates as the final estimate.

The 2007 HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses the harmonic mean to combine them into an estimate for the original cardinality.[3]

See also

  • Streaming algorithm
  • HyperLogLog

References

1. ^{{Cite journal |doi=10.1016/0022-0000(85)90041-8 |title=Probabilistic counting algorithms for data base applications |journal=Journal of Computer and System Sciences |volume=31 |issue=2 |pages=182–209 |year=1985 |last1=Flajolet |first1=Philippe |last2=Martin |first2=G. Nigel |url=http://algo.inria.fr/flajolet/Publications/FlMa85.pdf |accessdate=2016-12-11}}
2. ^{{Cite book |doi=10.1007/978-3-540-39658-1_55 |chapter=Loglog Counting of Large Cardinalities |chapter-url=http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf |accessdate=2016-12-11 |title= Algorithms - ESA 2003 |volume=2832 |pages=605 |series=Lecture Notes in Computer Science |year=2003 |last1=Durand |first1=Marianne |last2=Flajolet |first2=Philippe |isbn=978-3-540-20064-2}}
3. ^{{cite journal |citeseerx=10.1.1.76.4286 |first1=Philippe |last1=Flajolet |first2=Éric |last2=Fusy |first3=Olivier |last3=Gandouet |first4=Frédéric |last4=Meunier |title=Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm |url=http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf |accessdate=2016-12-11 |year=2007 |volume=AH |pages=127–146 |journal=Discrete Mathematics and Theoretical Computer Science proceedings |location=Nancy, France}}
4. ^{{Cite book |doi=10.1145/1807085.1807094 |chapter=An optimal algorithm for the distinct elements problem|chapter-url=http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/knw11.pdf |accessdate=2016-12-11 |title=Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data - PODS '10 |page=41 |year=2010 |last1=Kane |first1=Daniel M. |last2=Nelson |first2=Jelani |last3=Woodruff |first3=David P. |isbn=978-1-4503-0033-9}}

Additional sources

  • {{cite book |last1=Rajaraman |first1=Anand |last2=Ullman |first2=Jeffrey David |title=Mining of Massive Datasets |url=https://books.google.com/books?id=OefRhZyYOb0C&pg=PA119 |accessdate=2014-11-09 |date=2011-10-27 |publisher=Cambridge University Press |isbn=9781139505345 |page=119 }}
{{DEFAULTSORT:Flajolet-Martin algorithm}}

1 : Algorithms

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/11 4:30:23