CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.

Slides:



Advertisements
Similar presentations
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Random Forest Predrag Radenković 3237/10
PARTITIONAL CLUSTERING
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
K Means Clustering , Nearest Cluster and Gaussian Mixture
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Algorithms in Exponential Time. Outline Backtracking Local Search Randomization: Reducing to a Polynomial-Time Case Randomization: Permuting the Evaluation.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Evaluating Hypotheses
A Hierarchical Energy-Efficient Framework for Data Aggregation in Wireless Sensor Networks IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 55, NO. 3, MAY.
Clustering.
Unsupervised Learning and Data Mining
Cluster Analysis (1).
What is Cluster Analysis?
Visual Recognition Tutorial
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture 10: Robust fitting CS4670: Computer Vision Noah Snavely.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Component Reliability Analysis
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
Image segmentation Prof. Noah Snavely CS1114
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School.
Flat clustering approaches
Vector Quantization CAP5015 Fall 2005.
K means ++ and K means Parallel Jun Wang. Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Central Limit Theorem Let X 1, X 2, …, X n be n independent, identically distributed random variables with mean  and standard deviation . For large n:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 667 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 2117 Lecture.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Lesson Topic: The Mean Absolute Deviation (MAD) Lesson Objective: I can…  I can calculate the mean absolute deviation (MAD) for a given data set.  I.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
Data Mining – Algorithms: K Means Clustering
Department of Mathematics
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
K-Means Seminar Social Media Mining University UC3M Date May 2017
Spectral Clustering.
CSE 5243 Intro. to Data Mining
Image Processing for Physical Data
POINT ESTIMATOR OF PARAMETERS
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Introduction to Machine learning
Presentation transcript:

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering

Clustering in high dimension Most known clustering algorithms cluster the data base on the distance of the data. Problem: the data may be near in a few dimensions, but not all dimensions. Such information will be failed to be achieved.

Example X Y Z X Y

Other way to solve this problem Find the closely correlated dimensions for all the data and find clusters in such dimensions. Problem: It is sometimes not possible to find such a closed correlated dimensions

Example Y X Z

Cross Section for the Example Z X X Y

PROCLUS This paper is related to solve the above problem. The method is called PROCLUS (Projected Clustering)

Objective of PROCLUS Defines an algorithm to find out the clusters and the dimensions for the corresponding clusters Also it is needed to split out those Outliers (points that do not cluster well) from the clusters.

Input and Output for PROCLUS Input: –The set of data points –Number of clusters, denoted by k –Average number of dimensions for each clusters, denoted by L Output: –The clusters found, and the dimensions respected to such clusters

PROCLUS Three Phase for PROCLUS: –Initialization Phase –Iterative Phase –Refinement Phase

Initialization Phase Choose a sample set of data point randomly. Choose a set of data point which is probably the medoids of the clusters

Medoids Medoid for a cluster is the data point which is nearest to the center of the cluster

Initialization Phase All Data Points Random Data Sample Choose by Random Size: A × k The medoids found Choose in Iterative Phase Size: k the set of points including Medoids Choose by Greedy Algorithm Size: B × kDenoted by M

Greedy Algorithm Avoid to choose the medoids from the same clusters. Therefore the way is to choose the set of points which are most far apart. Start on a random point

Greedy Algorithm ABCDE A01367 B10245 C32052 D64501 E75210 A Randomly Choosed first Set {A} ABCDE Minimum Distance to the points in the Set ABCDE Choose E Set {A, E}

Iterative Phase From the Initialization Phase, we got a set of data points which should contains the medoids. (Denoted by M) This phase, we will find the best medoids from M. Randomly find the set of points M current, and replace the “bad” medoids from other point in M if necessary.

Iterative Phase For the medoids, following will be done: –Find Dimensions related to the medoids –Assign Data Points to the medoids –Evaluate the Clusters formed –Find the bad medoid, and try the result of replacing bad medoid The above procedure is repeated until we got a satisfied result

Iterative Phase- Find Dimensions For each medoid m i, let D be the nearest distance to the other medoid All the data points within the distance will be assigned to the medoid m i A B C δ

Iterative Phase- Find Dimensions For the points assigned to medoid m i, calculate the average distance X i,j to the medoid in each dimension j A B C

Iterative Phase- Find Dimensions Calculate the mean Y i and standard deviation  i of X i, j along j Calculate Z i,j = (X i,j - Y i ) /  i Choose k  L most negative of Z i,j with at least 2 chosen for each medoids

Iterative Phase- Find Dimensions Suppose k = 3, L = 3 Result: D1 D2 D3

Iterative Phase - Assign Points For each data point, assign it to the medoid m i if its Manhattan Segmental Distance for Dimension D i is minimum, the point will be assigned to m i

Manhattan Segmental Distance Manhattan Segmental Distance is defined relative to a dimension. The Manhattan Segmental Distance between the point x 1 and x 2 for the dimension D is defined as:

Example for Manhattan Segmental Distance x1x1 x2x2 X Y Z a b Manhattan Segmental Distance for Dimension (X, Y) = (a + b) / 2

Iterative Phase-Evaluate Clusters For each data points in the cluster i, find the average distance Y i,j to the centroid along the dimension j, where j is one of the dimension for the cluster. Calculate the follows:

Iterative Phase-Evaluate Clusters The value will be used to evaluate the clusters. The lesser the value, the better the clusters. Try to compare the case when a bad medoid is replaced, and replace the result if the value calculated above is better The bad medoid is the medoid with least number of points.

Refinement Phase Redo the process in Iterative Phase once by using the data points distributed by the result cluster, but not the distance from medoids Improve the quality of the result In iterative phase, we don’t handle the outliers, and now we will handle it.

Refinement Phase-Handle Outliers For each medoid m i with the dimension D i, find the smallest Manhattan segmental distance  i to any of the other medoids with respect to the set of dimensions D i

Refinement Phase-Handle Outliers  i is the sphere of influence of the medoid m i A data point is an outlier if it is not under any spheres of influence.

Result of PROCLUS Result Accuracy Actual clusters 5000-Outliers , 4, 9, 12, 14, 16, 17E , 7, 9, 13, 14, 16, 17D , 6, 11, 13, 14, 17, 19C , 4, 7, 12, 13, 14, 17B , 4, 7, 9, 14, 16, 17A PointsDimensionsInput PROCLUS results 2396-Outliers , 4, 9, 12, 14, 16, , 7, 9, 13, 14, 16, , 4, 7, 12, 13, 14, , 4, 7, 9, 14, 16, , 6, 11, 13, 14, 17, 191 PointsDimensionsFound