Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster.

Slides:



Advertisements
Similar presentations
Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.
Advertisements

Unsupervised Learning
The Stability of a Good Clustering Marina Meila University of Washington
K Means Clustering , Nearest Cluster and Gaussian Mixture
Unsupervised learning
K-means clustering Hongning Wang
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Lecture 6 Image Segmentation
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
CS 376b Introduction to Computer Vision 04 / 08 / 2008 Instructor: Michael Eckmann.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Segmentation and Clustering. Segmentation: Divide image into regions of similar contentsSegmentation: Divide image into regions of similar contents Clustering:
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
CS 376b Introduction to Computer Vision 04 / 04 / 2008 Instructor: Michael Eckmann.
Clustering Color/Intensity
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Segmentation and Clustering. Déjà Vu? Q: Haven’t we already seen this with snakes?Q: Haven’t we already seen this with snakes? A: no, that was about boundaries,
Unsupervised Learning
What is Cluster Analysis?
What is Cluster Analysis?
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Gene expression & Clustering (Chapter 10)
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Medical Imaging Dr. Mohammad Dawood Department of Computer Science University of Münster Germany.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CS654: Digital Image Analysis
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
Level-0 FAUST for Satlog(landsat) is from a small section (82 rows, 100 cols) of a Landsat image: 6435 rows, 2000 are Tst, 4435 are Trn. Each row is center.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Medical Image Analysis Dr. Mohammad Dawood Department of Computer Science University of Münster Germany.
01/26/05© 2005 University of Wisconsin Last Time Raytracing and PBRT Structure Radiometric quantities.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
In each epoch, start by clustering those points whose numeric difference from the mean is minimum in every dimension. Finish the other points with HDkM.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining K-means Algorithm
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster similarity. A heuristic is used (the method isn’t really optimal) 1.Partition the points to be clustered into k subsets (or pick k initial mean points and create initial k-subsets by putting each other point in the cluster with the closest mean). 2.Compute the mean of each cluster of the current partition. Assign each non-mean point to the cluster with the most similar (closest) mean. 3.Go back to Step 2 (compute the means of the new clusters). 4.Stop when the new set of means doesn’t change [much] (or use some other stopping condition?). It's O(tkn), n=#_of_objs k=#_of_clusts t=#_of_iterations. Normally, k, t << n, so close to O(n) Suggestion: Use pTrees. It should be possible to calculate all distances between the sample and the centroids at once (and to parallelize that?) using L 1 -distance (other distances?, other correlations?). It may also payoff to do it for multiple test points at a time?, or all non-mean points at one time?, so that there is just one pass across the pTrees? and to parallelize this? In almost all cases, it is possible to re-compute centroids using one pTree pass (see Yue Cui's thesis in the library (departmental or University). Also it should be possible, with multilevel pTrees, to tell when one can early exit using the top level pTrees only, the top 2 levels only, then the top 3 levels only, etc. Does something like this work for k-means image pixel classification too? (There are many videos on Utube describing k-means image classification). Does k-means works for Netflix data also, to predict r(U,M)?. E.g., in user-vote.C, forevery V in supM use k-means to vote according to the strongest class (largest?, largest with vote weighted according to count gap with others? no loops?

R (A 1 A 2 A 3 A 4 ) Example of HDkM ( Horizontal Data k-Means) Showing looping and Partial Distance early exit. Pick initial means; Loop over rows 1st then columns. Calculate the first distance. In the second distance calculation, as soon as the accumulated distance exceeds minimum distance so far (always the 1st distance), exit column loop. m1= m2= m1, d2(d2() =91 d2(d2( m2, ) =17 so C m1, d2(d2() =90 d2(d2( m2, ) =2 so C m1, d2(d2() =99 d2(d2( m2, ) =37 so C m1, d2(d2() =20d2(d2( m2, m1, d2(d2() =30d2(d2( m2, In initial re-clustering, C1 is: and C2 is: and the new means are: m21= m22= Full column loop determines: In col loop (2-3) 2 =1(7-2) 2 =25 exceeds 20, so C1 (2-2) 2 =0(7-2) 2 =25 (6-1) 2 =25+25=50>30, so C1 early exit of column loop

R (A 1 A 2 A 3 A 4 ) The re-clustering of C1 is: and C2 is: m21= m22= m21, d2(d2( ) =79 d2(d2( m22, ) = m21, d2(d2( ) =80 d2(d2( m22, ) = m21, d2(d2( ) =65 d2(d2( m22, ) = m21, d2(d2( ) =67 d2(d2( m22, ) = m21, d2(d2( ) = 4 d2(d2( m22, ) = m21, d2(d2( ) = 9 d2(d2( m22, ) = m21, d2(d2( ) = 6 d2(d2( m22, ) = m21, d2(d2( ) = 6 d2(d2( m22, ) = Which is the same as the previous clustering so the process has completely converged and we are done. Second epoch

One pTree calculation per epoch? This should be able to be done for all the samples at once. Would make k-means clustering fast (and k-means image classification, what TreeMiner is doing for DoD). Dr. Fei Pan's theorem (pg 39) Let A be the j th column of a data-table. The number of bits in A is m+1, and P j,m,..., P j,0 are the basic pTrees of A. For any constant, c = ( b m...b 0 ) 2 P A>c = P j,m o m... P j,k+1 o k+1 P j,k c = b m... b k+1... b 0 1. o i is the AND (  ) operation if b i =1. o i is the OR (  ) operation if b i =0. 2. k is the rightmost bit position with bit-value "0" 3. the operators are right binding. Horizontal-Data_k-Means (HDkM) is approx. O(n), n=#_of_points to be clustered. So for very large datasets (e.g., billions or trillions of points) this is far to slow to handle all the data coming at us.) 1. Use EIN Theory to create k pTree masks (i.e., partially cluster data into k clusters), one for each set of pts closest in all dimensions to a cluster representer. Then finish the other points with HDkM (example follows) 2. Try Partial Distance variation to improve speed (reduce leftover pts to be clustered by HDkM further. 3. Fully cluster using pTrees? 4. Parallelize 1-3 above. By DeMorgan's law (says AND distributes over OR and OR distributes over AND), P A  c = (P A<c )' = (P m o m (P m-1 o m-1...(P k+1 o k+1 P k )...))' c = b m... b m-1... b k+1... b 0 = (P' m o' m (P m-1 o m-1...(P k+1 o k+1 P k )...)') = (P' m o' m (P' m-1 o' m-1...(P k+1 o k+1 P k )...')) = (P' m o' m (P' m-1 o' m-1...(P' k+1 o' k+1 P' k )...)) Apply to k-means classification of images (e.g., land use). (see Utubes on k-means image classification