RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

Slides:



Advertisements
Similar presentations
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Advertisements

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Random Forest Predrag Radenković 3237/10
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Finding Shortest Paths on Terrains by Killing Two Birds with One Stone Manohar Kaul (Aarhus University) Raymond Chi-Wing Wong (Hong Kong University of.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Cluster Analysis.
Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.
Overview Of Clustering Techniques D. Gunopulos, UCR.
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
2015/7/21 Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester Hans-Peter Kriegel J.Sander Michael Wimmer Xiaowei Xu Proceedings.
An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Outlier Detection & Analysis
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
A simple method for multi-relational outlier detection Sarah Riahi and Oliver Schulte School of Computing Science Simon Fraser University Vancouver, Canada.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
A Vertical Outlier Detection Algorithm with Clusters as by-product Dongmei Ren, Imad Rahal, and William Perrizo Computer Science and Operations Research.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Outlier Detection Lian Duan Management Sciences, UIOWA.
Density-Based Clustering Algorithms
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
DB group seminar 2006/06/29The University of Hong Kong, Dept. of Computer Science Neighborhood based detection of anomalies in high dimensional spatio-temporal.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
Topic9: Density-based Clustering
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
A Generalized Version Space Learning Algorithm for Noisy and Uncertain Data T.-P. Hong, S.-S. Tseng IEEE Transactions on Knowledge and Data Engineering,
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
International Conference on Mathematical Modelling and Computational Methods in Science and Engineering, Alagappa University, Karaikudi, India February.
Dr. Hongqin FAN Department of Building and Real Estate
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Parallel Density-based Hybrid Clustering
Proximal Support Vector Machine for Spatial Data Using P-trees1
Machine Learning University of Eastern Finland
Yue (Jenny) Cui and William Perrizo North Dakota State University
Overview Of Clustering Techniques
A Linear Method for Deviation Detection in Large databases
University of Crete Department Computer Science CS-562
Taufik Abidin and William Perrizo
Promising “Newer” Technologies to Cope with the
Clustering methods: Part 10
Presentation transcript:

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A

Introduction  Related Work Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities. Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). Not efficient.  Contributions of this paper 1. a relative density factor (RDF)  RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7]  but RDF is easier to compute; 2. RDF-based outlier detection method  it efficiently prunes the data points which are deep in clusters  It detects outliers only within the remaining small subset of the data; 3. a vertical data representation in P-trees  P-Trees improve the efficiency of the method further.

Definitions Direct DiskNbr x Indirect DiskNbr Definition 1: Disk Neighborhood --- DiskNbr(x,r)  Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x ’  X | d(x-x ’ )  r}, where d(x-x ’ ) is the distance of x and x ’  Direct & indirect neighbors of x Definition 2: Density of DiskNbr(x, r) --- Dens (x,r), where dim is the number of dimensions

Definitions (Continued) Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r) RDF is used to measure outlierness. Outliers are points with high RDF values. Direct DiskNbr x Indirect DiskNbr Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)}

The Proposed Outlier Detection Method  Given a dataset X, the proposed outlier detection method is processed by: Find Outliers Prune Non-outliers Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers.

Finding Outliers Three possible distributions with regard to RDF: (a) prune all neighbors, call “ Pruning Non-outliers ” procedure; (b) prune all direct neighbors of x, calculate RDF for each indirect neighbor. (c) x is an outlier, prune indirect neighbors of x. (a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF (1+ε) x

Finding Outliers using P-Trees  P-Tree based direct neighbors --- PDN x r For point x, let X= (x 1,x 2, …,x n ) or X = (x 1,m-1, … x 1,0 ), (x 2,m-1, … x 2,0 ), … (x n,m-1, … x n,0 ), where x i,j is the j th bit value in the i th attribute. For the i th attribute, PDN xi r = P x ’ >xi-r AND P x ’ xi+r For muti-attributes, |DiskNbr(x,r)|= rc(PDN x r )  P-Tree based indirect neighbors --- PIN x r PIN x r = (OR q  Nbr(x,r) PDN q r ) AND PDN x’ r  Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDN x r AND PIN x ’ r (b): PU = PU AND PDN x r ; where PU is a P-tree representing unprocessed data points

Pruning Non-outliers  1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.  RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “ Finding Outliers ” Process;  RDF > (1+ε) (significant increase of density): stop expanding and call “ Pruning Non-outliers ”. The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.

Pruning Non-outliers Using P-Trees  We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, if x is an 8-bit value  For point x, let X= (x 1,x 2, …,x n ) or X = (x 1,m, … x 1,0 ), (x 2,m, … x 2,0 ), … (x n,m, … x n,0 ), where x i,j is the j th bit value in the i th attribute. For the i th attribute, ξ- neighbors of x is calculated by  The pruning is accomplished by: PU = PU AND P X ξ ’, where P X ξ ’ is the complement set of P X ξ,where

RDF-based Outlier Detection Process Algorithm: RDF-based Outlier Detection using P-Trees Input: Dataset X, radius r, distribution parameter ε. Output: An outlier set Ols. // PU — unprocessed points represented by P-Trees; // |PU| — number of points in PU // PO --- outliers; //Build up P-Trees for Dataset X PU  createP-Trees(X); i  1; WHILE |PU| > 0 DO x  PU.first; //pick an arbitrary point x PO  FindOutliers (x, r, ε) ; i  i+1 ENDWHILE

“ Find Outliers ” and “ Prune Non-Outliers ” Procedures

Experimental Study  NHL data set (1996)  Compare with LOF, aLOCI LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method  Run Time Comparison  Scalability Comparison Start from 16,384, outperform in terms of scalability and speed

Reference 1. V.BARNETT, T.LEWIS, “ Outliers in Statistic Data ”, John Wiley ’ s Publisher 2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “ Efficient algorithms for mining outliers from large datasets ”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN: Sridhar Ramaswamy 6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, J ö rg Sander, “ LOF: Identifying Density-based Local Outliers ”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March , 2003, Bangalore, India19th International Conference on Data Engineering 8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3): , Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT' Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, W. Perrizo, “ Peano Count Tree Technology, ” Technical Report NDSU-CSOR-TR-01-1, M. Khan, Q. Ding and W. Perrizo, “ k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees ”, Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003

Thank you!

Determination of Parameters  Determination of r Breunig et al. shows choosing miniPt = work well in general [6] (miniPt-Neighborhood) Choosing miniPts=20, get the average radius of 20- neighborhood, r average. In our algorithm, r = r average =0.5  Determination of ε Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig ’ s, but much faster. The results shown in the experimental part is based on ε=0.8.