词条 | Local outlier factor |
释义 |
In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.[1] LOF shares some concepts with DBSCAN and OPTICS such as the concepts of "core distance" and "reachability distance", which are used for local density estimation.[2] Basic ideaThe local outlier factor is based on a concept of a local density, where locality is given by nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters. FormalLet be the distance of the object to the k-th nearest neighbor. Note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than k objects. We denote the set of k nearest neighbors as . This distance is used to define what is called reachability distance: In words, the reachability distance of an object from is the true distance of the two objects, but at least the of . Objects that belong to the k nearest neighbors of (the "core" of , see DBSCAN cluster analysis) are considered to be equally distant. The reason for this distance is to get more stable results. Note that this is not a distance in the mathematical definition, since it is not symmetric. (While it is a common mistake[3] to always use the , this yields a slightly different method, referred to as Simplified-LOF[3]) The local reachability density of an object is defined by which is the inverse of the average reachability distance of the object from its neighbors. Note that it is not the average reachability of the neighbors from (which by definition would be the ), but the distance at which A can be "reached" from its neighbors. With duplicate points, this value can become infinite. The local reachability densities are then compared with those of the neighbors using which is the average local reachability density of the neighbors divided by the object's own local reachability density. A value of approximately indicates that the object is comparable to its neighbors (and thus not an outlier). A value below indicates a denser region (which would be an inlier), while values significantly larger than indicate outliers. LOF(k) ~ 1 means Similar density as neighbors, LOF(k) < 1 means Higher density than neighbors (Inlier), LOF(k) > 1 means Lower density than neighbors (Outlier) AdvantagesDue to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. While the geometric intuition of LOF is only applicable to low-dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection[3] and on processed classification benchmark data.[4] The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams or authorship networks.[3] Disadvantages and ExtensionsThe resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. There exist extensions of LOF that try to improve over LOF in these aspects:
References1. ^{{Cite conference| doi = 10.1145/335191.335388| title = LOF: Identifying Density-based Local Outliers| year = 2000| last1 = Breunig | first1 = M. M.| last2 = Kriegel | first2 = H.-P. | authorlink2 = Hans-Peter Kriegel| last3 = Ng | first3 = R. T.| last4 = Sander | first4 = J.| work = Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data| series = SIGMOD| isbn = 1-58113-217-4| pages = 93–104| url = http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf}} 2. ^{{Cite book| last1 = Breunig | first1 = M. M. | last2 = Kriegel | first2 = H.-P. | authorlink2 = Hans-Peter Kriegel| last3 = Ng | first3 = R. T. | last4 = Sander | first4 = J. R. | chapter = OPTICS-OF: Identifying Local Outliers | doi = 10.1007/978-3-540-48247-5_28 | title = Principles of Data Mining and Knowledge Discovery | series = Lecture Notes in Computer Science | volume = 1704 | pages = 262 | year = 1999 | isbn = 978-3-540-66490-1 | pmid = | pmc = | url = http://www.dbs.ifi.lmu.de/Publikationen/Papers/PKDD99-Outlier.pdf}} 3. ^{{cite journal | title=A comparative study of anomaly detection schemes in network intrusion detection | year=2003 | authors=Lazarevic, A.; Ozgur, A.; Ertoz, L.; Srivastava, J.; Kumar, V.; | journal=Proc. 3rd SIAM International Conference on Data Mining | url=http://www.siam.org/proceedings/datamining/2003/dm03_03LazarevicA.pdf | pages=25–36}} 4. ^{{cite journal|last1=Campos|first1=Guilherme O.|last2=Zimek|first2=Arthur|last3=Sander|first3=Jörg|last4=Campello|first4=Ricardo J. G. B.|last5=Micenková|first5=Barbora|last6=Schubert|first6=Erich|last7=Assent|first7=Ira|last8=Houle|first8=Michael E.|title=On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study|journal=Data Mining and Knowledge Discovery|year=2016|issn=1384-5810|doi=10.1007/s10618-015-0444-8}} 5. ^{{Cite journal| doi = 10.1145/1081870.1081891| title = Feature bagging for outlier detection| year = 2005| last1 = Lazarevic | first1 = A.| last2 = Kumar | first2 = V.| pages = 157–166| journal = Proc. 11th ACM SIGKDD international conference on Knowledge Discovery in Data Mining}} 6. ^{{Cite journal | doi = 10.1145/2594473.2594476| title = Ensembles for unsupervised outlier detection| journal = ACM SIGKDD Explorations Newsletter| volume = 15| pages = 11| year = 2014| last1 = Zimek | first1 = A. | last2 = Campello | first2 = R. J. G. B. | last3 = Sander | first3 = J. R. }} 7. ^{{Cite conference| doi = 10.1145/1645953.1646195| isbn = 978-1-60558-512-3| title = LoOP: Local Outlier Probabilities| series = CIKM '09| year = 2009| last1 = Kriegel | first1 = H.-P. | authorlink1 =Hans-Peter Kriegel| last2 = Kröger | first2 = P.| last3 = Schubert | first3 = E.| last4 = Zimek | first4 = A.| pages = 1649–1652| journal = Proceedings of the 18th ACM conference on Information and knowledge management| url = http://www.dbs.ifi.lmu.de/Publikationen/Papers/LoOP1649.pdf}} 8. ^{{Cite conference | doi = 10.1137/1.9781611972818.2| title = Interpreting and Unifying Outlier Scores| conference = Proceedings of the 2011 SIAM International Conference on Data Mining| pages = 13–24| year = 2011| last1 = Kriegel | first1 = H. P. | authorlink1 = Hans-Peter Kriegel| last2 = Kröger | first2 = P. | last3 = Schubert | first3 = E. | last4 = Zimek | first4 = A. | isbn = 978-0-89871-992-5| url = http://epubs.siam.org/doi/pdf/10.1137/1.9781611972818.2 | format=PDF}} 9. ^{{Cite conference | doi = 10.1137/1.9781611972825.90| title = On Evaluation of Outlier Rankings and Outlier Scores| conference = Proceedings of the 2012 SIAM International Conference on Data Mining| pages = 1047–1058| year = 2012| last1 = Schubert | first1 = E. | last2 = Wojdanowski | first2 = R. | last3 = Zimek | first3 = A. | last4 = Kriegel | first4 = H. P. | authorlink4 = Hans-Peter Kriegel| isbn = 978-1-61197-232-0| url = http://epubs.siam.org/doi/pdf/10.1137/1.9781611972825.90 | format = PDF| citeseerx = 10.1.1.300.7205}} 10. ^1 2 3 {{Cite journal | last1 = Schubert | first1 = E. | last2 = Zimek | first2 = A. | last3 = Kriegel | first3 = H. -P. | authorlink3 = Hans-Peter Kriegel| doi = 10.1007/s10618-012-0300-z | title = Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection | journal = Data Mining and Knowledge Discovery | year = 2012 | pmid = | pmc = }} 3 : Statistical outliers|Data mining|Machine learning algorithms |
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。