RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A

Introduction  Related Work Breunig et al. [6] proposed a density-based approach to mining outliers over datasets with different densities. Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI).  Contributions in this paper 1. a relative density factor (RDF)  RDF expresses density information similar to LOF (local outlier factor)[6] and MDEF (multi-granularity deviation factor)[7]  RDF is easier to compute using vertical data 2. RDF-based outlier detection method  efficiently prunes the data points which are deep in clusters  detects outliers within the remaining small subset of the data; 3. vertical data representation is used (Predicate-trees = P-trees)

Definitions Direct DiskNbr x Indirect DiskNbr Definition 1: Disk Neighborhood --- DiskNbr(x,r)  Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x ’  X | d(x,x ’ )  r}, where d(x,x ’ ) is the distance of x and x ’  Indirect disk neighborhood of x (disk neighborhood centered at any point in DiskNbr(x, r) Definition 2: Density of DiskNbr(x, r) --- Density(x,r), where dim is the number of dimensions

Definitions (Continued) Definition 3: Relative Density Factor for a point, x, and radius, r. Direct DiskNbr x Indirect DiskNbr RDF(x,r) close to 1 means x is a deep-cluster point. RDF(x,r) close to 0 means x is a borderline cluster point. RDF(x,r) much larger than 1 means x is an outlier.

The Outlier Detection Method Given a dataset X, a radius, r and a threshold, ε. let R be RemainingSet of points yet to consider (initially, R=X) Let O be the set of Outliers identified so far (initially empty). 1. Pick any xR. 2. Decide if DiskNbr(x,r) are outliers or not, by: If 1+ε < RDF(x,r), O:=O  DiskNbr(x,r) (outliers) and R = R - DiskNbr(x,r) If RDF(x,r)<1/(1+ε), R= R - DiskNbr(x,r) (borderline cluster points) If 1/(1+ε) < RDF(x,r) < 1+ε, (deep cluster points) but before updating R, double r while 1/(1+ε)<RDF(x,2 n r)<1+ε true, then increment r while 1/(1+ε)<RDF(x,2 n r+m)<1+ε true, then R = R - RDF(x,2 n r+m).

The Outlier Detection Method

Finding Outliers (a)1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF (1+ε) deep within clusters hill-tops and bddrys valley (outliers) x

Finding Outliers using Predicate-Trees  P-Tree based direct neighbors --- PDN x,r PDN x,r = P x ’ >x-r OR P x ’ x+r |DiskNbr(x,r)|= rc(PDN x,r )  P-Tree based indirect neighbors --- PIN x,r PIN x,r = (OR q  Nbr(x,r) PDN q,r ) AND PDN’ x,r  Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDN ’ x,r AND PIN ’ x,r (b): PU = PU AND PDN ’ x,r ; where PU is a P-tree representing unprocessed data points

Pruning Non-outliers  1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.  RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “ Finding Outliers ” Process;  RDF > (1+ε) (significant increase of density): stop expanding and call “ Pruning Non-outliers ”. The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.

Pruning Non-outliers Using P-Trees  We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, 2... 8 if x is an 8-bit value  For point x, let X= (x 1,x 2, …,x n ) or X = (x 1,m, … x 1,0 ), (x 2,m, … x 2,0 ), … (x n,m, … x n,0 ), where x i,j is the j th bit value in the i th attribute. For the i th attribute, ξ- neighbors of x is calculated by  The pruning is accomplished by: PU = PU AND P X ξ ’, where P X ξ ’ is the complement set of P X ξ,where

RDF-based Outlier Detection Process Algorithm: RDF-based Outlier Detection using P-Trees Input: Dataset X, radius r, distribution parameter ε. Output: An outlier set Ols. // PU — unprocessed points represented by P-Trees; // |PU| — number of points in PU // PO --- outliers; //Build up P-Trees for Dataset X PU  createP-Trees(X); i  1; WHILE |PU| > 0 DO x  PU.first; //pick an arbitrary point x PO  FindOutliers (x, r, ε) ; i  i+1 ENDWHILE

“ Find Outliers ” and “ Prune Non-Outliers ” Procedures

Experimental Study  NHL data set (1996)  Compare with LOF, aLOCI LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method  Run Time Comparison  Scalability Comparison Start from 16,384, outperform in terms of scalability and speed

Reference 1. V.BARNETT, T.LEWIS, “ Outliers in Statistic Data ”, John Wiley ’ s Publisher 2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. 3. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. 4. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp. 211-222. 5. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “ Efficient algorithms for mining outliers from large datasets ”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163- 5808 Sridhar Ramaswamy 6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, J ö rg Sander, “ LOF: Identifying Density-based Local Outliers ”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000 7. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India19th International Conference on Data Engineering 8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999 9. Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164- 169. 10. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98. 11. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. 12. W. Perrizo, “ Peano Count Tree Technology, ” Technical Report NDSU-CSOR-TR-01-1, 2001. 13. M. Khan, Q. Ding and W. Perrizo, “ k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees ”, Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, 2002 14. Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 2003 15. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003

Thank you!

Determination of Parameters  Determination of r Breunig et al. shows choosing miniPt = 10-30 works well in general [6] (miniPt-Neighborhood) Choosing miniPts=20, get the average radius of 20- neighborhood, r average. In our algorithm, r = r average =0.5  Determination of ε Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig ’ s, but much faster. The results shown in the experimental part is based on ε=0.8.

Vertical Data Structures History  In the 1980 ’ s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al)  Attribute Transposed File (ATF) Bit Transposed File (BTF, Wang et al); Viper Band Sequential Format (BSQ) for Remotely Sensed Imagery DSM and BTF initiatives have disappeared. Why? (next slide)  Vertical auxiliary and system structures Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al)  vertical system structures (query optimization & synchronization) Bit Mapped Indexes (BMI s - very popular in Data Warehouses)  all indexes are vertical auxiliary structures really  BMI ’ s use bit maps (positional approach to IDing records)  other indexes use RID lists (keyword or value approach)

6. Lf half of lf of rt? true  1 0 0 1 1 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole is pure1? false  0 5. Rt half of right half? true  1 0 0 1 R 11 0 1 0 1 Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) Top-down construction of the 1-dimensional Ptree representation of R 11, denoted, P 11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  0 0 7. Rt half of lf of rt? false  0 0 0 1 10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ 0 0 0 0 1 10 ^ P 11 pure1? false=0 pure1? true=1 And it’s pure so branch ends pure1? false=0 R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 Horizontally structured records Scanned vertically 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = Base 10 Base 2

0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 To count occurrences of 7,0,1,4 use pure111000001100 : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 01 ^ 7 0 1 4 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s make this 1 2 1 -level has the only 1-bit so the 1-count = 1*2 1 = 2

R 11 0 1 0 1 Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient. Bottom-up construction of 1-Dim, P 11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P 11 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0

2-Dimensional P-trees: natural choice for, e.g., image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “ Peano ” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity. 0 1000 00101101 11100010110 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 For an image bit-file (e.g., hi-order bit of the red band of an image file): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: Top-down construction of its 2-dimensional P-tree is built by recording the truth of the universal predicate “pure 1” in a fanout=4 tree recursively on quarters, until purity is achieved. Pure-1? False=0 Pure! pure!

1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 11 1 1 1 1 111 1 11 11 1 1 1 1 1 11 0 00000 0 From here on we will take 4 bit positions at a time, for efficiency. 1 1 10 0 0 10 0 1 1 1 11 1 0 101 1 1 0 0 0 0 0 0 0 0 0 0000 0 Bottom-up construction of the 2-Dimensional P-tree is done using in-order traversal of a fanout=4, log 4 (64)=4-level tree and the collapsing pure siblings, as follow: Start here

Node ID (NID) = 2.2.3 Tree levels (going down): 3, 2, 1, 0, with purity-factors of 4 3 4 2 4 1 4 0 respectively Fan-out = 2 dimension = 2 2 = 4 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 7=111 ( 7, 1 ) ( 111, 001 ) 10.10.11 1=001 Some aspects of 2-D P-trees: 0 1001 00100001 11100010110 1 0 level-3 (pure=4 3 ) 1001 level-2 (pure=4 2 ) 0010110 1 level-1 (pure=4 1 ) 11100010110 1 level-0 (pure=4 0 ) 01232 3 2. 2. 3 ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * 4 0 + 4 * 4 1 + 2 * 4 2 = 55

3-Dimensional Ptrees

Ptree dimension is a user parameter and can be chosen to fit the data; default=1-D Ptrees (recursive halving); Images  2-D Ptrees (recursive quartering); 3-D Solids  3-D Ptrees (recursive eighth-ing) Or the dimension can be chosen based on other considerations (to optimize compression, increase processing speed...) Logical Operations on Ptrees (are used to get counts of any pattern) Ptree AND is faster than bit-by-bit AND since, any pure0 operand node means result node is pure0. e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Using logical operators on the basic P-trees (Predicate = universal predicate “purely 1-bits”), can construct, for any domain: constant-P-trees (predicate: “value=const”), range-Ptrees (predicate: “value  range”), interval-P-tree (pred: “value  interval). In fact, there is a domain P-tree for every predicate defined on it ANDing domain-predicate P-trees, tuple-predicate P-trees, e.g., rectangle-P-tree (pred: tuple  rectangle). The next slide shows some of these constructions. Ptree 1 Ptree 2 AND result OR result

Basic, Value and Tuple Ptrees Tuple Ptrees (predicate: quad is purely target tuple) e.g., P (1, 2, 3) = P (001, 010, 111) = P 1, 001 AND P 2, 010 AND P 3, 111 AND Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P 1, 5 = P 1, 101 = P 11 AND P 12 ’ AND P 13 AND Target Attribute Target Value Basic Ptrees for a 7 column, 8 bit table e.g., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Target Attribute Target Bit Position Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P ([13],, [0.2]) = (P 1,1 OR P 1,2 OR P 1,3 ) AND (P 3,0 OR P 3,1 OR P 3,2 ) AND/OR

Horizontal Processing of Vertical Structures for Record-based Workloads  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 1  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?

But even for some standard SQL queries, vertical data may be faster (evaluating when this is true would be an excellent research project)  For example, the SQL query,  SELECT Count * FROM purchases WHERE price  $4,000.00 AND 1000  sales  500.  The answer is the root-count of the P-tree resulting from ANDing the price-interval-P-tree, P price  [4000,  ) and the sales-interval-P-tree, P sales  [500,1000].

Architecture for the DataMIME ™ System (DataMIME tm = data mining, NO NOISE) (PDMS = P-tree Data Mining System) Internet DII (Data Integration Interface) Data Integration Language DIL YOUR DATA Data Repository lossless, compressed, distributed, vertically- structured database DMI (Data Mining Interface) YOUR DATA MINING Ptree (Predicates) Query Language PQL

Raster Sorting: Attributes 1 st Bit position 2 nd Generalized Raster and Peano Sorting : generalizes to any table with numeric attributes (not just images). Peano Sorting: Bit position 1 st Attributes 2 nd Decimal Binary Unsorted relation

Generalize Peano Sorting 0 20 40 60 80 100 120 adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (using 5 UCI Machine Learning Repository data sets)

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

Similar presentations

Presentation on theme: "RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

Similar presentations

Presentation on theme: "RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,"— Presentation transcript:

Similar presentations

About project

Feedback