Download presentation
Presentation is loading. Please wait.
Published byMoris Lucas Modified over 9 years ago
1
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A
2
Introduction Related Work Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities. Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). Not efficient. Contributions of this paper 1. a relative density factor (RDF) RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7] but RDF is easier to compute; 2. RDF-based outlier detection method it efficiently prunes the data points which are deep in clusters It detects outliers only within the remaining small subset of the data; 3. a vertical data representation in P-trees P-Trees improve the efficiency of the method further.
3
Definitions Direct DiskNbr x Indirect DiskNbr Definition 1: Disk Neighborhood --- DiskNbr(x,r) Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x ’ X | d(x-x ’ ) r}, where d(x-x ’ ) is the distance of x and x ’ Direct & indirect neighbors of x Definition 2: Density of DiskNbr(x, r) --- Dens (x,r), where dim is the number of dimensions
4
Definitions (Continued) Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r) RDF is used to measure outlierness. Outliers are points with high RDF values. Direct DiskNbr x Indirect DiskNbr Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)}
5
The Proposed Outlier Detection Method Given a dataset X, the proposed outlier detection method is processed by: Find Outliers Prune Non-outliers Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers.
6
Finding Outliers Three possible distributions with regard to RDF: (a) prune all neighbors, call “ Pruning Non-outliers ” procedure; (b) prune all direct neighbors of x, calculate RDF for each indirect neighbor. (c) x is an outlier, prune indirect neighbors of x. (a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF (1+ε) x
7
Finding Outliers using P-Trees P-Tree based direct neighbors --- PDN x r For point x, let X= (x 1,x 2, …,x n ) or X = (x 1,m-1, … x 1,0 ), (x 2,m-1, … x 2,0 ), … (x n,m-1, … x n,0 ), where x i,j is the j th bit value in the i th attribute. For the i th attribute, PDN xi r = P x ’ >xi-r AND P x ’ xi+r For muti-attributes, |DiskNbr(x,r)|= rc(PDN x r ) P-Tree based indirect neighbors --- PIN x r PIN x r = (OR q Nbr(x,r) PDN q r ) AND PDN x’ r Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDN x r AND PIN x ’ r (b): PU = PU AND PDN x r ; where PU is a P-tree representing unprocessed data points
8
Pruning Non-outliers 1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius. RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “ Finding Outliers ” Process; RDF > (1+ε) (significant increase of density): stop expanding and call “ Pruning Non-outliers ”. The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.
9
Pruning Non-outliers Using P-Trees We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, 2... 8 if x is an 8-bit value For point x, let X= (x 1,x 2, …,x n ) or X = (x 1,m, … x 1,0 ), (x 2,m, … x 2,0 ), … (x n,m, … x n,0 ), where x i,j is the j th bit value in the i th attribute. For the i th attribute, ξ- neighbors of x is calculated by The pruning is accomplished by: PU = PU AND P X ξ ’, where P X ξ ’ is the complement set of P X ξ,where
10
RDF-based Outlier Detection Process Algorithm: RDF-based Outlier Detection using P-Trees Input: Dataset X, radius r, distribution parameter ε. Output: An outlier set Ols. // PU — unprocessed points represented by P-Trees; // |PU| — number of points in PU // PO --- outliers; //Build up P-Trees for Dataset X PU createP-Trees(X); i 1; WHILE |PU| > 0 DO x PU.first; //pick an arbitrary point x PO FindOutliers (x, r, ε) ; i i+1 ENDWHILE
11
“ Find Outliers ” and “ Prune Non-Outliers ” Procedures
12
Experimental Study NHL data set (1996) Compare with LOF, aLOCI LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method Run Time Comparison Scalability Comparison Start from 16,384, outperform in terms of scalability and speed
13
Reference 1. V.BARNETT, T.LEWIS, “ Outliers in Statistic Data ”, John Wiley ’ s Publisher 2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. 3. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. 4. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp. 211-222. 5. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “ Efficient algorithms for mining outliers from large datasets ”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163- 5808 Sridhar Ramaswamy 6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, J ö rg Sander, “ LOF: Identifying Density-based Local Outliers ”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000 7. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India19th International Conference on Data Engineering 8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999 9. Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164- 169. 10. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98. 11. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. 12. W. Perrizo, “ Peano Count Tree Technology, ” Technical Report NDSU-CSOR-TR-01-1, 2001. 13. M. Khan, Q. Ding and W. Perrizo, “ k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees ”, Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, 2002 14. Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 2003 15. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003
14
Thank you!
15
Determination of Parameters Determination of r Breunig et al. shows choosing miniPt = 10-30 work well in general [6] (miniPt-Neighborhood) Choosing miniPts=20, get the average radius of 20- neighborhood, r average. In our algorithm, r = r average =0.5 Determination of ε Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig ’ s, but much faster. The results shown in the experimental part is based on ε=0.8.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.