Vertical K Median Clustering

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

PARTITIONAL CLUSTERING

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.

What is Cluster Analysis?

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Birch: An efficient data clustering method for very large databases

Evaluating Performance for Data Mining Techniques

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

Presented by Tienwei Tsai July, 2005

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Intelligent Database Systems Lab 1 Advisor ： Dr. Hsu Graduate ： Jian-Lin Kuo Author ： Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Clustering CSC 600: Data Mining Class 21.

Computational Intelligence: Methods and Applications

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Decision Tree Induction for High-Dimensional Data Using P-Trees

Data Mining K-means Algorithm

Efficient Ranking of Keyword Queries Using P-trees

Parallel Density-based Hybrid Clustering

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

CS 685: Special Topics in Data Mining Jinze Liu

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

DATA MINING Introductory and Advanced Topics Part II - Clustering

Outline Introduction Background Our Approach Experimental Results

North Dakota State University Fargo, ND USA

Dimension reduction : PCA and Clustering

Clustering Wei Wang.

Vertical K Median Clustering

North Dakota State University Fargo, ND USA

Text Categorization Berlin Chen 2003 Reference:

Pasi Fränti and Sami Sieranoja

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

Vertical K Median Clustering Amal Perera, William Perrizo {amal.perera, william.perrizo}@ndsu.edu Dept. of CS, North Dakota State University. CATA 2006 – Seattle Washington 12/4/2018

Vertical K Median Clustering Outline Introduction Background Our Approach Results Conclusions 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Introduction Clustering: Automated identification of groups of objects based on similarity. Application areas include: Datamining, Search engine indexing, Pattern recognition, Image processing, Trend analysis and many other areas Clustering Algorithms: Partition, Hierarchical, Density, Grid Based Major Problem: Scalability with respect to data set size We propose: A Partition Based Vertical K Median Clustering 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Background Many clustering algorithms work well on small datasets. Current approaches for Large data sets include: Sampling eg. CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. Preserve summary statistics eg. BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters with prior knowledge Above techniques may lead to sub optimal solutions. 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Background Partition Clustering (k): n objects in the original data set Broken into k partitions (iteratively, each time resulting in an improved k-clustering), to achieve a certain optimality criterion Computational Steps: Find a representative for each cluster component assign others to be in cluster of best representative Calculate error (repeat if error is too high) 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Our Approach Scalability is addressed it is a partition based approach it uses a vertical data structure (P-tree) the computation is efficient: selects the partition representative using a simple directed search across bit slices rather than down rows, assigns membership using bit slices with geometric reasoning computes error using position based manipulation of bit slices Solution quality is improved or maintained while increasing speed and scalability. Uses a median rather than mean 12/4/2018 Vertical K Median Clustering

P-tree* Vertical Data Structure Predicate-trees (P-trees) Lossless , Compressed, Data-mining-ready Successfully used in KNN, ARM, Bayesian Classification, etc. A basic P-tree represents one attribute bit slice, reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each subdivision. Each level of the tree contains truth-bits that represent pure sub-trees Construction is continued recursively down each tree path until a pure sub-division is reached. * Predicate Tree (Ptree) technology is patented by North Dakota State University (William Perrizo, primary inventor of record); patent number 6,941,303 issued September 6, 2005. 12/4/2018 Vertical K Median Clustering

But it is pure (pure0) so this branch ends A file, R(A1..An), contains horizontal structures (horizontal records) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Horizontal structures (records) Scanned vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 5. 2nd half of 2nd half is  1 0 0 0 1 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 But it is pure (pure0) so this branch ends Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level

Centroid: Partition Representative Mean Vs Median Median is usually thought to be a better Estimator ! handles outliers much better, for example Finding the Median Vector: NP-hard problem Existing median (medoid) based solutions PAM Exhaustive search CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Vector of Medians Vector of medians (Hayford,1902): All the median values from each individual dimension Vector of Medians mean N and 3N w.r.t. computational cost With a traditional Horizontal approach Mean N scans Median 3N scans (requires a partial sort) 12/4/2018 Vertical K Median Clustering

1 Pi,2 1 Pi,1 1 Pi,0 [1] Median with Ptrees Starting from the Most Significant Bit, Repeatedly AND appropriate Bit Slice until the Least Significant Bit is reached while building the Median pattern. Median Pattern Distribution of values _,_,_ 1,_,_ 0,_,_ 4 5 000 001 010 011 100 101 110 111 0,1,_ 0,0,_ 2 3 1 rc =4 Pi,2 1 P'i,2 1 Pi,1 1 Pi,0 0,1,1 0,1,0 2 OneCnt ZeroCnt Corresp bit of Median 4 5 < 0=hi bit 6 3 > 1 010 4 5 < Scalability? e.g., if the cardinality= 232=4,294,967,296 Rather than scan 4 billion records, we AND log2=32 P-trees.

[2] Bulk Membership Assignment (not 1-by-1) Find perpendicular Bi-Sector boundaries from centroids (vectors of attribute medians which are easily computed as in previous slide) Assign membership to all the points within these boundary Assignment is done using AND & OR of respective Bit Slices without a scan. Data points in the red rectangular boxes can be assigned to the respective cluster in bulk. It can be proved that any point within will be closer to the respective cluster center compared to other cluster centers. Initial iterations has less bulk assignments compared to later iterations when the algorithm is edging towards actual cluster centers. d2 d2 12/4/2018 Vertical K Median Clustering

Reverse HOBBit membership Assume red points (7) are not assigned C1,C2 C3 Starting from the higher order bit zoom into each HOBBit rectangle where centroids exist and assign all the points to the centroids in the HBOBit Rectangles. Stop before the total assigned points is smaller than available points. May lead to multiple assignments. Motivation is efficiency over accuracy. 0111 0101 Pat C1 C2 C3 T 0011 0101 1000 0111 - 7 21 5 2 12 1 3X 0011 Previous step may leave data points without cluster membership. We can either do a scan for those remaining points or use this quick fuzzy assignment. The best selection is chosen before the Hype rectangular boxes get too small leaving a lot of data points out. REMEMBER this is not a perfect membership assignment. T=Total points assigned with current approach. Compared to the actual number of points to be assigned, if T is too small that means we have gone too far. If it is too large we have a lot of multiple assignments. Initial observations show that success depends on the dataset. Most of the experimental results were carried out using the reverse HOBBIT approach with a scan for the last iteration. This approach is faster than a scan but we loose on accuracy. Would suggest the use of this approach when scanning for the remaining data points is not possible due to scalability issues (i.e. very large datasets) 0101 1000 Best 0011 d2 d2

[3] Efficient Error computation Error = Sum Squared Distance from Centroid (a) to Points in Cluster (x) Where: Pi,j : P-tree for the jth bit of ith attribute COUNT(P) : count of the number of truth bits. PX : Ptree (mask) for the cluster subset X 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Algorithm Input: DataSet, K, Threshold Output: K Clusters Initialize K clusters for DataSet Repeat Assign membership using Hyper Rec. Pruning Assign membership for points outside the boundary with Reverse HOBBIT OR a DB scan Find Error = Sum of Sq.Dist(SetCi , Centroidi) for all i Find new centroid= Vector of Median Until (Threshold < QualityGain | Max Iteration < Iteration ) Quality gain is the difference in the Error computation (rate of improvement) 12/4/2018 Vertical K Median Clustering

F=1 for perfect clustering Experimental Results Objective : Quality and Scalability Datasets Synthetic data - Quality Iris Plant Data - Quality KDD-99 Network Intrusion Data - Quality Remotely Sensed Image Data - Scalability Quality Measured with Where: F=1 for perfect clustering F = F measure and used for quality in all the experimental results. Be some cluster Original cluster 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Results: Iterations Synthetic Data (exec. until F-measure =1) Iteration count for approach / dataset DataSet VKClust 6 4 KMeans PAM >300 3 separate 2 dimensional synthetically generated datasets (200 data points) to test the clustering capabilities. Pictures show the spatial location of the data. Numbers are iteration counts. We see a clear advantage w.r.t. the number of required iterations in the use of the vertical median approach to arrive at the same solution compared to PAM. 12/4/2018 Vertical K Median Clustering

Results: Quality and Iterations IRIS data for 3 classes F Iterations VKClust 0.85 5 KMeans 0.80 8 PAM 0.86 >300 Low iterations mean quick convergence. High F measure indicates better cluster quality. We can expect the best possible clustering from PAM compared to the other 2 medoid approaches for a given dataset. So we can argue that we are comparing our results with the best available partition algorithm w.r.t. quality. In this slide we can clearly see that we can expect comparable (quality) results from the our K-median approach at a very low number of iterations. NOTE: Quality(PAM) > Quality(CLARANS) > Quality (CLARA) 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering This slide is shown as an example to see the possibility of bulk assignment with Perpendicular Bi-Sector boundaries. It shows the clusters and bisector boundaries for one attribute. It also shows the final cluster assignments by our approach. Setosa is clearly identified without any false +ves. But in the case of the other two clusters the boundary cases have a few false +ves giving us F measure which is not 1.0 . 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Results : Quality UCI network data for 2,4,6 classes DataSet 2-Class 4-Class 6-Class F VKClust 0.91 0.86 0.77 KMeans 0.75 0.72 PAM 0.85 A Network intrusion data set. The objective is to be able to identify the type of network intrusion by looking at properties of a TCP/IP dump. Our objective is to show the feasibility of the clustering algorithm on real life data sets. This slide shows only the quality. Though PAM shows better results it is about 60 times more expensive than a K-Means or our K-Median approach. 12/4/2018 Vertical K Median Clustering

Results: Unit Performance time in Sec. (1M RSI data for k=4) on P4 2.4 GHz & 4GB Unit Horiz Vert Median 1.17 std::nth_element()+ 0.35 Root Mean Sqrd Error 0.63* 0.44 Find Membership 2.48 0.19 Total (each Iter.) 3.55* 0.98 Shows a break down of the computational cost involved in the algorithm proposed in this paper. It also shows a time comparison for a horizontal approach and the suggested vertical approach. We can clearly see that the vertical approach is about 4 times faster than the horizontal approach. An this speed gain is achieved in the computation of the median and the membership assignment. NOTE: 0.63 secs for error computation in the horizontal approach is not added to the total to be fair to the horizontal approach, because the computation can be done while scanning through the dataset for membership. * Root MeanSqrd Error calculation overlaps with Find Membership + Best C++ implementation from std template algorithm library 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Results: Scalability Shows an RSI data set. Shows a clear advantage in the use of the vertical method over a horizontal method for the 2 key steps in a typical K-median algorithm. 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering Conclusions Vertical bit slice based computation of median is computationally less expensive than the best horizontal approach. Hyper rectangular quarries can be used to make bulk cluster membership assignments. Position based manipulation and accumulation of vertical bit slices can be used to compute the squared error for the entire cluster without having to scan the DB for individual data points. Completely-Vertical K Median Clustering is a Scalable Technique that can produce High Quality Clusters at a lower cost. 12/4/2018 Vertical K Median Clustering