Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 13: K-Means Clustering Martin Russell.

Slides:



Advertisements
Similar presentations
1 Radio Maria World. 2 Postazioni Transmitter locations.
Advertisements

EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.
Números.
Un percorso realizzato da Mario Malizia
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
/ /17 32/ / /
Advanced Piloting Cruise Plot.
Reflection nurulquran.com.
1
EuroCondens SGB E.
Worksheets.
Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.
Slide 1Fig 24-CO, p.737. Slide 2Fig 24-1, p.740 Slide 3Fig 24-2, p.741.
& dding ubtracting ractions.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
ALGEBRA Number Walls
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
1 When you see… Find the zeros You think…. 2 To find the zeros...
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
1 1  1 =.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Summative Math Test Algebra (28%) Geometry (29%)
Year 6 mental test 5 second questions
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
1 Session 7 Standard errors, Estimation and Confidence Intervals.
ZMQS ZMQS
The basics for simulations
PP Test Review Sections 6-1 to 6-6
ABC Technology Project
MM4A6c: Apply the law of sines and the law of cosines.
VOORBLAD.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Squares and Square Root WALK. Solve each problem REVIEW:
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
When you see… Find the zeros You think….
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
Benjamin Banneker Charter Academy of Technology Making AYP Benjamin Banneker Charter Academy of Technology Making AYP.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.
Addition 1’s to 20.
25 seconds left…...
Subtraction: Adding UP
Week 1.
Statistical Inferences Based on Two Samples
We will resume in: 25 Minutes.
Static Equilibrium; Elasticity and Fracture
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Resistência dos Materiais, 5ª ed.
A SMALL TRUTH TO MAKE LIFE 100%
PSSA Preparation.
& dding ubtracting ractions.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 13: K-Means Clustering Martin Russell

Slide 2 EE3J2 Data Mining Objectives  To explain the for K-means clustering  To understand the K-means clustering algorithm  To understand the relationships between: –Clustering and statistical modelling using GMMs –K-means clustering and E-M estimation for GMMs

Slide 3 EE3J2 Data Mining Clustering so far  Agglomerative clustering –Begin by assuming that every data point is a separate centroid –Combine closest centroids until the desired number of clusters is reached –See agglom.c on the course website  Divisive clustering –Begin by assuming that there is just one centroid/cluster –Split clusters until the desired number of clusters is reached

Slide 4 EE3J2 Data Mining Optimality  Neither agglomerative clustering nor divisive clustering is optimal  In other words, the set of centroids which they give is not guaranteed to minimise distortion

Slide 5 EE3J2 Data Mining Optimality continued  For example: –In agglomerative clustering, a dense cluster of data points will be combined into a single centroid –But to minimise distortion, need several centroids in a region where there are many data points –A single ‘outlier’ may get its own cluster  Agglomerative clustering provides a useful starting point, but further refinement is needed

Slide 6 EE3J2 Data Mining 12 centroids

Slide 7 EE3J2 Data Mining K-means Clustering  Suppose that we have decided how many centroids we need - denote this number by K  Suppose that we have an initial estimate of suitable positions for our K centroids  K-means clustering is an iterative procedure for moving these centroids to reduce distortion

Slide 8 EE3J2 Data Mining K-means clustering - notation  Suppose there are T data points, denoted by:  Suppose that the initial K clusters are denoted by:  One iteration of K-means clustering will produce a new set of clusters Such that

Slide 9 EE3J2 Data Mining K-means clustering (1)  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )  Now, for each centroid c 0 k define:  In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster

Slide 10 EE3J2 Data Mining K-means clustering (2)  Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0  In other words, c 1 k is the average value of the samples which were closest to c 0 k

Slide 11 EE3J2 Data Mining K-means clustering (3)  Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges  Each new set of centroids has smaller distortion than the previous set

Slide 12 EE3J2 Data Mining Initialisation  An outstanding problem is to choose the initial cluster set C 0  Possibilities include: –Choose C 0 randomly –Choose C 0 using agglomerative clustering –Choose C 0 using divisive clustering  Choice of C 0 can be important –K-means clustering is a “hill-climbing” algorithm –It only finds a local minimum of the distortion function –This local minimum is determined by C 0

Slide 13 EE3J2 Data Mining Local optimality Distortion Cluster set C 0 C 1..C n Dist(C 0 ) Local minimum Global minimum N.B: I’ve drawn the cluster set space as 1 dimensional for simplicity. In reality it is a very high dimensional space

Slide 14 EE3J2 Data Mining Example

Slide 15 EE3J2 Data Mining Example - distortion

Slide 16 EE3J2 Data Mining C programs on website  agglom.c –Agglomerative clustering –agglom dataFile centFile numCent –Runs agglomerative clustering on the data in dataFile until the number of centroids is numCent. Writes the centroid (x,y) coordinates to centFile

Slide 17 EE3J2 Data Mining C programs on website  k-means.c –K-means clustering –k-means dataFile centFile opFile –Runs 10 iterations of k-means clustering on the data in dataFile starting with the centroids in centFile. –After each iteration writes distortion and new centroids to opFile  sa1.txt –This is the data file that I have used in the demonstrations

Slide 18 Relationship with GMMs  The set of centroids in clustering corresponds to the set of means in a GMM  Measuring distances using Euclidean distance in clustering corresponds to assuming that the GMM variances are all equal to 1  k-means clustering corresponds to the mean estimation part of the E-M algorithm, but: –In k-means samples are allocated 100% to the closest centroid –In E-M samples are shared between GMM components according to posterior probabilities EE3J2 Data Mining

Slide 19 K-means clustering - example EE3J2 Data Mining Data Centroids (0)

Slide 20 EE3J2 Data Mining Distance to centroidsClosest centroid d(x(n),c(1))d(x(n),c(2))d(x(n),c(3))c(1)c(2)c(3) Data Totals924 Centroids (0) Distortion(0)15.52 First iteration of k-means

Slide 21 EE3J2 Data Mining Distance to centroidsClosest centroidc(1)c(2)c(3) d(x(n),c(1)) d(x(n),c(2) )d(x(n),c(3))c(1)c(2)c(3)xyxyxy Data Totals Centroids (0) Dist’n(0)15.52 First iteration of k-means

Slide 22 EE3J2 Data Mining Data Centroids (0) Centroids (1) First iteration of k-means

Slide 23 EE3J2 Data Mining Second iteration of k-means Distance to centroidsClosest centroid d(x(n),c(1))d(x(n),c(2))d(x(n),c(3))c(1)c(2)c(3) Data Centroids(1)

Slide 24 EE3J2 Data Mining Second iteration of k-means Data Centroids(1) Centroids (2)

Slide 25 EE3J2 Data Mining

Slide 26 EE3J2 Data Mining Summary  The need for k-means clustering  The k-means clustering algorithm  Example of k-means clustering