RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Fast Algorithms For Hierarchical Range Histogram Constructions
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
A Vertical Outlier Detection Algorithm with Clusters as by-product Dongmei Ren, Imad Rahal, and William Perrizo Computer Science and Operations Research.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Density-Based Clustering Algorithms
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
CSC 211 Data Structures Lecture 13
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Transformation: Normalization
Data Mining Motivation: “Necessity is the Mother of Invention”
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
Parallel Density-based Hybrid Clustering
Yue (Jenny) Cui and William Perrizo North Dakota State University
Proximal Support Vector Machine for Spatial Data Using P-trees1
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
3. Vertical Data LECTURE 2 Section 3.
Vertical K Median Clustering
A Linear Method for Deviation Detection in Large databases
Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.
Vertical K Median Clustering
3. Vertical Data LECTURE 2 Section 3.
North Dakota State University Fargo, ND USA
Vertical K Median Clustering
Similarity Search: A Matching Based Approach
North Dakota State University Fargo, ND USA
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Integrating Query Processing and Data Mining in Relational DBMSs
Clustering methods: Part 10
Presentation transcript:

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A

Introduction  Related Work Breunig et al. [6] proposed a density-based approach to mining outliers over datasets with different densities. Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI).  Contributions in this paper 1. a relative density factor (RDF)  RDF expresses density information similar to LOF (local outlier factor)[6] and MDEF (multi-granularity deviation factor)[7]  RDF is easier to compute using vertical data 2. RDF-based outlier detection method  efficiently prunes the data points which are deep in clusters  detects outliers within the remaining small subset of the data; 3. vertical data representation is used (Predicate-trees = P-trees)

Definitions Direct DiskNbr x Indirect DiskNbr Definition 1: Disk Neighborhood --- DiskNbr(x,r)  Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x ’  X | d(x,x ’ )  r}, where d(x,x ’ ) is the distance of x and x ’  Indirect disk neighborhood of x (disk neighborhood centered at any point in DiskNbr(x, r) Definition 2: Density of DiskNbr(x, r) --- Density(x,r), where dim is the number of dimensions

Definitions (Continued) Definition 3: Relative Density Factor for a point, x, and radius, r. Direct DiskNbr x Indirect DiskNbr RDF(x,r) close to 1 means x is a deep-cluster point. RDF(x,r) close to 0 means x is a borderline cluster point. RDF(x,r) much larger than 1 means x is an outlier.

The Outlier Detection Method Given a dataset X, a radius, r and a threshold, ε. let R be RemainingSet of points yet to consider (initially, R=X) Let O be the set of Outliers identified so far (initially empty). 1. Pick any xR. 2. Decide if DiskNbr(x,r) are outliers or not, by: If 1+ε < RDF(x,r), O:=O  DiskNbr(x,r) (outliers) and R = R - DiskNbr(x,r) If RDF(x,r)<1/(1+ε), R= R - DiskNbr(x,r) (borderline cluster points) If 1/(1+ε) < RDF(x,r) < 1+ε, (deep cluster points) but before updating R, double r while 1/(1+ε)<RDF(x,2 n r)<1+ε true, then increment r while 1/(1+ε)<RDF(x,2 n r+m)<1+ε true, then R = R - RDF(x,2 n r+m).

The Outlier Detection Method

Finding Outliers (a)1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF (1+ε) deep within clusters hill-tops and bddrys valley (outliers) x

Finding Outliers using Predicate-Trees  P-Tree based direct neighbors --- PDN x,r PDN x,r = P x ’ >x-r OR P x ’ x+r |DiskNbr(x,r)|= rc(PDN x,r )  P-Tree based indirect neighbors --- PIN x,r PIN x,r = (OR q  Nbr(x,r) PDN q,r ) AND PDN’ x,r  Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDN ’ x,r AND PIN ’ x,r (b): PU = PU AND PDN ’ x,r ; where PU is a P-tree representing unprocessed data points

Pruning Non-outliers  1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.  RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “ Finding Outliers ” Process;  RDF > (1+ε) (significant increase of density): stop expanding and call “ Pruning Non-outliers ”. The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.

Pruning Non-outliers Using P-Trees  We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, if x is an 8-bit value  For point x, let X= (x 1,x 2, …,x n ) or X = (x 1,m, … x 1,0 ), (x 2,m, … x 2,0 ), … (x n,m, … x n,0 ), where x i,j is the j th bit value in the i th attribute. For the i th attribute, ξ- neighbors of x is calculated by  The pruning is accomplished by: PU = PU AND P X ξ ’, where P X ξ ’ is the complement set of P X ξ,where

RDF-based Outlier Detection Process Algorithm: RDF-based Outlier Detection using P-Trees Input: Dataset X, radius r, distribution parameter ε. Output: An outlier set Ols. // PU — unprocessed points represented by P-Trees; // |PU| — number of points in PU // PO --- outliers; //Build up P-Trees for Dataset X PU  createP-Trees(X); i  1; WHILE |PU| > 0 DO x  PU.first; //pick an arbitrary point x PO  FindOutliers (x, r, ε) ; i  i+1 ENDWHILE

“ Find Outliers ” and “ Prune Non-Outliers ” Procedures

Experimental Study  NHL data set (1996)  Compare with LOF, aLOCI LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method  Run Time Comparison  Scalability Comparison Start from 16,384, outperform in terms of scalability and speed

Reference 1. V.BARNETT, T.LEWIS, “ Outliers in Statistic Data ”, John Wiley ’ s Publisher 2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “ Efficient algorithms for mining outliers from large datasets ”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN: Sridhar Ramaswamy 6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, J ö rg Sander, “ LOF: Identifying Density-based Local Outliers ”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March , 2003, Bangalore, India19th International Conference on Data Engineering 8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3): , Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT' Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, W. Perrizo, “ Peano Count Tree Technology, ” Technical Report NDSU-CSOR-TR-01-1, M. Khan, Q. Ding and W. Perrizo, “ k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees ”, Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003

Thank you!

Determination of Parameters  Determination of r Breunig et al. shows choosing miniPt = works well in general [6] (miniPt-Neighborhood) Choosing miniPts=20, get the average radius of 20- neighborhood, r average. In our algorithm, r = r average =0.5  Determination of ε Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig ’ s, but much faster. The results shown in the experimental part is based on ε=0.8.

Vertical Data Structures History  In the 1980 ’ s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al)  Attribute Transposed File (ATF) Bit Transposed File (BTF, Wang et al); Viper Band Sequential Format (BSQ) for Remotely Sensed Imagery DSM and BTF initiatives have disappeared. Why? (next slide)  Vertical auxiliary and system structures Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al)  vertical system structures (query optimization & synchronization) Bit Mapped Indexes (BMI s - very popular in Data Warehouses)  all indexes are vertical auxiliary structures really  BMI ’ s use bit maps (positional approach to IDing records)  other indexes use RID lists (keyword or value approach)

6. Lf half of lf of rt? true  Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  R Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) Top-down construction of the 1-dimensional Ptree representation of R 11, denoted, P 11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  Rt half of lf of rt? false  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ ^ P 11 pure1? false=0 pure1? true=1 And it’s pure so branch ends pure1? false=0 R (A 1 A 2 A 3 A 4 ) Horizontally structured records Scanned vertically = Base 10 Base 2

R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] To count occurrences of 7,0,1,4 use pure : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = ^ P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s make this level has the only 1-bit so the 1-count = 1*2 1 = 2

R Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient. Bottom-up construction of 1-Dim, P 11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P

2-Dimensional P-trees: natural choice for, e.g., image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “ Peano ” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity For an image bit-file (e.g., hi-order bit of the red band of an image file): Which, in spatial raster order is: Top-down construction of its 2-dimensional P-tree is built by recording the truth of the universal predicate “pure 1” in a fanout=4 tree recursively on quarters, until purity is achieved. Pure-1? False=0 Pure! pure!

From here on we will take 4 bit positions at a time, for efficiency Bottom-up construction of the 2-Dimensional P-tree is done using in-order traversal of a fanout=4, log 4 (64)=4-level tree and the collapsing pure siblings, as follow: Start here

Node ID (NID) = Tree levels (going down): 3, 2, 1, 0, with purity-factors of respectively Fan-out = 2 dimension = 2 2 = =111 ( 7, 1 ) ( 111, 001 ) =001 Some aspects of 2-D P-trees: level-3 (pure=4 3 ) 1001 level-2 (pure=4 2 ) level-1 (pure=4 1 ) level-0 (pure=4 0 ) ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * * * 4 2 = 55

3-Dimensional Ptrees

Ptree dimension is a user parameter and can be chosen to fit the data; default=1-D Ptrees (recursive halving); Images  2-D Ptrees (recursive quartering); 3-D Solids  3-D Ptrees (recursive eighth-ing) Or the dimension can be chosen based on other considerations (to optimize compression, increase processing speed...) Logical Operations on Ptrees (are used to get counts of any pattern) Ptree AND is faster than bit-by-bit AND since, any pure0 operand node means result node is pure0. e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Using logical operators on the basic P-trees (Predicate = universal predicate “purely 1-bits”), can construct, for any domain: constant-P-trees (predicate: “value=const”), range-Ptrees (predicate: “value  range”), interval-P-tree (pred: “value  interval). In fact, there is a domain P-tree for every predicate defined on it ANDing domain-predicate P-trees, tuple-predicate P-trees, e.g., rectangle-P-tree (pred: tuple  rectangle). The next slide shows some of these constructions. Ptree 1 Ptree 2 AND result OR result

Basic, Value and Tuple Ptrees Tuple Ptrees (predicate: quad is purely target tuple) e.g., P (1, 2, 3) = P (001, 010, 111) = P 1, 001 AND P 2, 010 AND P 3, 111 AND Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P 1, 5 = P 1, 101 = P 11 AND P 12 ’ AND P 13 AND Target Attribute Target Value Basic Ptrees for a 7 column, 8 bit table e.g., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Target Attribute Target Bit Position Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P ([13],, [0.2]) = (P 1,1 OR P 1,2 OR P 1,3 ) AND (P 3,0 OR P 3,1 OR P 3,2 ) AND/OR

Horizontal Processing of Vertical Structures for Record-based Workloads  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R R( A 1 A 2 A 3 A 4 ) R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 1  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?

But even for some standard SQL queries, vertical data may be faster (evaluating when this is true would be an excellent research project)  For example, the SQL query,  SELECT Count * FROM purchases WHERE price  $4, AND 1000  sales  500.  The answer is the root-count of the P-tree resulting from ANDing the price-interval-P-tree, P price  [4000,  ) and the sales-interval-P-tree, P sales  [500,1000].

Architecture for the DataMIME ™ System (DataMIME tm = data mining, NO NOISE) (PDMS = P-tree Data Mining System) Internet DII (Data Integration Interface) Data Integration Language DIL YOUR DATA Data Repository lossless, compressed, distributed, vertically- structured database DMI (Data Mining Interface) YOUR DATA MINING Ptree (Predicates) Query Language PQL

Raster Sorting: Attributes 1 st Bit position 2 nd Generalized Raster and Peano Sorting : generalizes to any table with numeric attributes (not just images). Peano Sorting: Bit position 1 st Attributes 2 nd Decimal Binary Unsorted relation

Generalize Peano Sorting adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (using 5 UCI Machine Learning Repository data sets)