Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Slides:



Advertisements
Similar presentations
PARTITIONAL CLUSTERING
Advertisements

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Overview Of Clustering Techniques D. Gunopulos, UCR.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Birch: An efficient data clustering method for very large databases
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Overview Data Mining - classification and clustering
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Experimental Study on Item-based P-Tree Collaborative Filtering for Netflix Prize.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
Proximal Support Vector Machine for Spatial Data Using P-trees1
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
CS 685: Special Topics in Data Mining Jinze Liu
Dr. William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Efficient Image Classification on Vertically Decomposed Data
A Fast and Scalable Nearest Neighbor Based Classification
K Nearest Neighbor Classification
CSE572, CBS598: Data Mining by H. Liu
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
CS 685: Special Topics in Data Mining Jinze Liu
Vertical K Median Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Outline Introduction Background Our Approach Experimental Results
North Dakota State University Fargo, ND USA
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Vertical K Median Clustering
Similarity Search: A Matching Based Approach
Nearest Neighbors CSC 576: Data Mining.
North Dakota State University Fargo, ND USA
Clustering Large Datasets in Arbitrary Metric Space
CSE572: Data Mining by H. Liu
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Notes from 02_CAINE conference
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University. George Hamer George Hamer, Dept. of CS South Dakota State University. William Perrizo, Dept. CS North Dakota State University.

Outline Introduction Background Approach Results Conclusions

Introduction Clustering: Automated identification of groups of objects based on similarity. Two Major Problems in Clustering Scalability. Need for Input parameters. We propose: Vertical Set Squared Distance based Clustering.

Background Clustering algorithms work well on small datasets. Current approaches for Large data sets Sampling eg. CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. Preserve summary statistics eg. BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters Above techniques may lead to sub optimal solutions.

Background (Cont.) Current Clustering algorithms require Input Parameters Eg. DENCLU : grid cell size ? DBSCAN : neighborhood radius? And minimum in core ? K-Means / K-Medoid : K ? Results are sensitive to the input parameters

Background (Cont.) Some approaches for Parameter-less Clustering algorithms OPTICS: Computes an augmented cluster ordering Costs O (n log n) G-Means: Use Gaussian properties in the data. If they exist ! ACE : Maps the search space to a weighted grid Depends on a heuristic search

Our Approach Scalability addressed Partition based aproach Vertical data structure (P-tree) Efficent Computation of Set Squared Distance (VSSD) for entire data set (influence function) Need for Parameter K addressed Observing the difference in the influence for each data point.

Influence Functions Influence: Describes the impact of a data point within its neighborhood Eg. Of Influence functions

P-tree Vertical Data Structure Predicate-trees (P-trees) Lossless, Compressed, Data-mining-ready Successfully used in KNN, ARM, Bayesian Classification, SVM, etc. A basic P-tree represents One attribute bit reorganized into a tree structure By recursively sub-dividing, while recording the predicate truth value regarding purity for each division. Each level of the tree contains truth-bits that represent pure sub-trees. Construction is continued recursively down each tree path until a pure sub-division is reached.

6. 1 st half of 1 st of 2 nd is  st half of 2 nd half not  st half is not pure1  Whole file is not pure1  0 Horizontal structures (records) Scanned vertically P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P nd half of 2 nd half is  R horizontally process these basic Ptrees using one multi-operand logical AND. Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; R( A 1 A 2 A 3 A 4 ) A file, R(A 1..A n ), contains horizontal structures (a set of horizontal records) processed vertically (vertical scans) 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P 11 : 3. 2 nd half is not pure1  nd half of 1 st of 2 nd not  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] Eg, to count, s, use “pure ”: level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = level = level But it is pure (pure0) so this branch ends

Efficeint Computation of Influence Use Vertical Set Squared Distance to Compute the Parabolic Influence. ),(Xa f Dist VertSetSqr where T 1 = T2T2 T3T3 Pi,j : P-tree for jth bit of ith attribute. rc(P) : root count of a P-tree (number of truth bits). PX: P-tree (mask) for the subset X. Above, Count operations are independent of ‘a’ Can pre-compute and reuse. For different PX counts can be precomputed fast with the use of P-trees.

Algorithm 1.Compute SetSqd dist for all points 2.Sort SetSqd Dist 3.Find the difference between i,i+1 (gap) 4.Compute mean(gap) stdev(gap) 5.Identify gap > mean + 3 * stdev (large Gaps) 6.Break into clusters using large gaps as partition boundaries. 7.Compute Set Sqd Dist for all tuples(a), against each Cluster (C i ). 8.Re-assign clusters based on min [SetSqdDist(a, Ci )] 9.Iterate until max iteration OR No change in Cluster sizes OR Oscillation > mean+3*StdDev

Cluster separation with VSSD eg. Gap > µ+3σ

Experimental Results Objective : Quality and Scalability Datasets Synthetic data - Quality Iris Plant Data - Quality KDD-99 Network Intrusion Data - Quality Remotely Sensed Image Data - Scalability Quality Measured with Be some clusterOriginal cluster F=1 for perfect clustering Where:

Results :Cluster Quality Synthetic Data (exec. until F-measure =1) DataSet VSSD Req. DB Scans 226 KMeans Given k Req. DB Scans 8814

Results :Cluster Quality Iris Data 3 known clusters KDD-99 Network Intrusion Data VSSDK-Means -K =3K=4K=5 Iterations F-Measure With 6 Clust.4 Clust.2 Clu. VDK- meansVDK meansVDKm K= Iterations F-Measure

Results :Scalability RSI data 6x8 bit attributes per datapoint

Conclusions Ordering based on Set Squared Distance may be used to part ion a data set into natural clusters (Finding K) Set Squared Distance can be efficiently computed using P-trees. Vertical Set Squared Distance Clustering is a Scalable Technique that can produce High Quality Clusters without the need for user parameters.