Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.

Fast Algorithms For Hierarchical Range Histogram Constructions

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Evaluating Performance for Data Mining Techniques

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Intelligent Database Systems Lab 1 Advisor ： Dr. Hsu Graduate ： Jian-Lin Kuo Author ： Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.

Digital Image Processing CCS331 Relationships of Pixel 1.

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Interpreting Data for use in Charts and Graphs. V

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Overview Data Mining - classification and clustering

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

What Is Cluster Analysis?

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Bag-of-Visual-Words Based Feature Extraction

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.

Efficient Image Classification on Vertically Decomposed Data

Efficient Ranking of Keyword Queries Using P-trees

Parallel Density-based Hybrid Clustering

Yue (Jenny) Cui and William Perrizo North Dakota State University

Mean Shift Segmentation

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Efficient Image Classification on Vertically Decomposed Data

A Fast and Scalable Nearest Neighbor Based Classification

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

Data Mining extracting knowledge from a large amount of data

Vertical K Median Clustering

North Dakota State University Fargo, ND USA

Vertical K Median Clustering

Review Given a training space T(A1,…,An, C) and its features subspace X(A1,…,An) = T[A1,…,An], a functional f:X Reals, distance d(x,y)  |f(x)-f(y)| and.

North Dakota State University Fargo, ND USA

A Block Based MAP Segmentation for Image Compression

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

Presentation transcript:

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo 20 th International Conference on Computer and Their Applications March 16-18, 2005 Louisiana, New Orleans

20 th International Conference on CATA 2005 Outline:  Introduction  Motivation  Contribution  P-tree Vertical Data Structure  Vertical Set Square Distance (VSSD)  Experimental Results  Conclusion and Future Works

20 th International Conference on CATA 2005 Introduction  The determination of closeness (distance) of a point to another point or to a set of points (average distance) is required in data mining tasks.  For example:  Distance-based clustering. Distance is used to determine in which cluster a certain point should be placed, i.e. K-means.  Near nbr or case-based classification The assignment of class label to an unclassified point is based on the majority of class label of the nearest neighbor points. “Nearest” here implies the closest distance of points.

20 th International Conference on CATA 2005  One way to measure the distance from a certain point to a set of points is by examining the actual total separation of the set of points from the point in question  However, if the examination is done point by point when the set is very large, scanning the entire space causes the approach to be very costly, slow and non-scalable.  In this paper, we introduce a new technique called the Vertical Set Square Distance (VSSD) that can scalably, quickly and accurately compute the total length of separation (total variation) of a set of points about a point. Introduction (cont)

20 th International Conference on CATA 2005  In data mining, efforts have focused on finding techniques that can scalably deal with large cardinality of dataset due to the explosive growth of data stored in digital form.  Most existing approaches for measuring closeness of points are slow and computationally intensive (often relying upon sampling to become useable).  The availability of P-tree technology (vertical data structure) is an appropriate data structure to solve this curse of cardinality. Motivation

20 th International Conference on CATA 2005  We introduce a new vertical technique that scalably, quickly and accurately computes the total length of separation of a set of points about a fixed point.  We present empirical results based on real and synthetic datasets and show the one-to-one comparison in terms of scalability and speed (time) between our new method that uses the vertical approach (P-tree vertical data structure) and other method that uses the horizontal approach (record-based). Our Contributions

20 th International Conference on CATA 2005  P-tree vertical data representation consists of set structures essentially representing the data column-by-column rather than row-by-row (i.e. relational data).  The construction of P-trees from an existing relational table is typically by decomposing each attribute in the table into separate bit vectors (e.g., one for each bit position of a numeric attribute or one bitmap for each category in a categorical attribute). The construction for raw data coming from a sensor platform is done directly (without creating a horizontal relational table first. Note that many sensor platforms (e.g., RSI sensors), product raw vertical data. P-Tree Vertical Data Structure

20 th International Conference on CATA 2005 The Construction of P-Tree The construction of the P- Tree: 1. Vertically project each attribute. 2. Vertically project each bit position. 3. Compress each piece of bit slice into a P-tree. Logical operations are AND (  ), OR (  ) and complement ('). Root count operation is the count of 1- bits from the resulting P-trees logical operations R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ = R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] R (A 1 A 2 A 3 A 4 )

20 th International Conference on CATA 2005 Binary Representation Let x be a numerical domain of attribute A i of relation R(A 1, A 2, …, A d ), then x written in b bits binary number is shown in the left-hand side term of the equation below. The first subscript of x is the index of the attribute to which x belong. The second subscript of x indicates the bit position. The summation on the right-hand side of the equation is equal to the actual value of x as a decimal number. Vertical Set Square Distance

20 th International Conference on CATA 2005 Let X be a set of vectors in a relation R(A 1, A 2, …, A d ), let P X be the P-tree class mask of X, x is a vector  : and a is a target vector (also in d-dimensional space), then the vectors x and a in binary can be written as: The total variation of a set of vectors in X about a vector a can be computed vertically using Vertical Set Square Distance (VSSD) defined as follows: Vertical Set Square Dist. (Cont)

20 th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) Term T 1, T 2 and T 3 can be computed separately and their summation is actually the total variation (the sum of the square length of separation of the vectors connecting X and a.

20 th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) Or we can write T 1, expressing the diagonal terms (j=k) separately (noting also that x i,j 2 = x i,j as they are bits)

20 th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) And T 3 is simply:

20 th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) The root count operation is obviously independent of the vector a. These include the root count of single P-tree operand, P X, and the root count of two basic P-tree operands. and the root count of three P-tree operands which appear in T 1, T 2 and T 3.

20 th International Conference on CATA 2005 Vertical Set Square Dist. (Cont)  The independency allows us to pre-compute the root counts once in advance, during the construction of the P-tree, and can use them repeatedly as needed, regardless of the number of target vectors a.  This amortizes the cost of P-tree ANDing for high value datasets, e.g. cancer analysis, RSI, etc.

20 th International Conference on CATA 2005 Horizontal Set Square Distance (HSSD) as a comparison method Let X, a set of vectors in R(A 1,A 2 … A d ) and x = (x 1, x 2, …, x d ) is a vector belong to X and a = (a 1, a 2, …, a d ) is a target vector Then the horizontal set square distance (HSSD) is defined as follows:

20 th International Conference on CATA 2005 Experimental Results  The experiments were conducted using both real and synthetic datasets.  The goals are to compare the execution time (speed) and scalability of our algorithm employing a vertical approach (vertical data structure and horizontal bitwise AND operation) with a horizontal approach (horizontal data structure and vertical scan operation).  Performance of both algorithms was observed under different machine specifications, including an SGI Altix machine.

20 th International Conference on CATA 2005 Experimental Results (Cont.) The specification of machines used in the experiments. MachineSpecification AMD  1GB AMD Athlon K7 1.4GHz, 1GB RAM P4  2GB Intel P4 2.4GHz processor, 2GB RAM SGI Altix SGI Altix CC-NUMA 12 processor shared memory (12 x 4 GB RAM)

20 th International Conference on CATA 2005 Datasets  A set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, ND.  The original image size (dataset) is 1024x1024 pixels (cardinality of 1,048,576), contains 3 bands (RGB) plus synchronized data for soil moisture, soil nitrate and crop yield (total 6 attributes).

20 th International Conference on CATA 2005 Datasets (Cont.)  The datasets contain 4 different yield classes:  Low Yield ( 0 < intensity <= 63)  Medium Low Yield ( 63 < intensity <= 127)  Medium High Yield(127< intensity <= 191)  High Yield(191< intensity <= 255)  With super sampling five synthetic datasets are created:  2,097,152 rows  4,194,304 rows (2048x2048 pixels)  8,388,608 rows  16,777,216 rows (4096x4096 pixels)  25,160,256 rows (5016x5016 pixels)

20 th International Conference on CATA 2005 Timing and Scalability Results Observation 1.  The first performance evaluation was done on a P4 2 GB RAM machine using synthetic datasets having 4.1 and 8.3 million data tuples.  We use 100 arbitrary unclassified target vectors (resulting in 400 classification computations as to predicted yeild).  Datasets of sizes greater than 8.3 million rows cannot be executed in this machine due to out of memory problems when running Horizontal Set Square Distance (HSSD).

20 th International Conference on CATA 2005 Results Dataset size (# of Rows) Average Time to Compute Total Variation for Each Test Case in (Seconds) VSSDHSSD 4,193, ,388,

20 th International Conference on CATA 2005 Timing and Scalability Results Observation 1 (Continue) Dataset (Rows) Time (Seconds) VSSDHSSD Root Count Pre-Computation and P-trees Loading Horizontal Dataset Loading 1,048, ,097, ,194, ,388, This table shows the comparison of time required for loading the vertical data structure to memory and one time root count operations for VSSD, and loading horizontal records to memory needed for HSSD.

20 th International Conference on CATA 2005 Timing and Scalability Results Observation 2 This table shows the algorithm’s timing and scalability performances when they are executed on different machines. Cardinality of Dataset Average Running Time to Compute Total Variation for Each Test Case (Seconds) HSSDVSSD AMD 1GBP4 2GBSGI Altix 12x4GBAMD 1GB 1,048, ,097, ,194, ,388,608  ,777,216  ,160,256 

20 th International Conference on CATA 2005 Timing and Scalability Results Observation 2 (Continue)

20 th International Conference on CATA 2005 Conclusion  Vertical Set Square Distance (VSSD) is fast and accurate way to compute total variation for classification, clustering and rule mining, and scale well to very large datasets compare to the traditional horizontal approach (HSSD).  The complexity of the VSSD is O(d * b 2 ) where d is the number of dimensions and b is the maximum bit-width of the attributes (depends on the width of the data set only).  The VSSD is very fast because of the independency of root counts of P-tree operands to target vector a which allows the pre-computation of counts once in advance, during the construction of the P-tree.

20 th International Conference on CATA 2005 Future Work  Comprehensive study on the Vertical Set Square Distance (VSSD) in many data mining tasks, i.e. classification, clustering or outlier detection.  For classification using VSSD in the voting phase has already been shown to greatly accelerate the assignment of class since the calculation of class votes can be done entirely in one computation without having to visit each individual point, as is the case of the horizontal-based approach.

20 th International Conference on CATA 2005 Thank You…