Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

A Fast PTAS for k-Means Clustering
Slide 1 Insert your own content. Slide 2 Insert your own content.
Chapter 6 Cost and Choice. Copyright © 2001 Addison Wesley LongmanSlide 6- 2 Figure 6.1 A Simplified Jam-Making Technology.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Chapter 4: Informed Heuristic Search
Informed Search CS 171/271 (Chapter 4)
1 The tiling algorithm Learning in feedforward layered networks: the tiling algorithm writed by Marc M é zard and Jean-Pierre Nadal.
O X Click on Number next to person for a question.
© S Haughton more than 3?
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
In the game Twenty Questions, the instructor creates a PowerPoint slide with hyperlinks to separate slides containing questions. The Instructor or a student.
Twenty Questions Subject: Twenty Questions
Linking Verb? Action Verb or. Question 1 Define the term: action verb.
Squares and Square Root WALK. Solve each problem REVIEW:
Energy & Green Urbanism Markku Lappalainen Aalto University.
K-means Clustering Ke Chen.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 13: K-Means Clustering Martin Russell.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
This, that, these, those Number your paper from 1-10.
Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.
Addition 1’s to 20.
25 seconds left…...
EC-111 Algorithms & Computing Lecture #11 Instructor: Jahan Zeb Department of Computer Engineering (DCE) College of E&ME NUST.
Test B, 100 Subtraction Facts
11 = This is the fact family. You say: 8+3=11 and 3+8=11
Week 1.
We will resume in: 25 Minutes.
1 Ke – Kitchen Elements Newport Ave. – Lot 13 Bethesda, MD.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
1 Unit 1 Kinematics Chapter 1 Day
O X Click on Number next to person for a question.
Unsupervised Classification
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining – Algorithms: K Means Clustering
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
K-Means Seminar Social Media Mining University UC3M Date May 2017
AIM: Clustering the Data together
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell

Slide 2 EE3J2 Data Mining Objectives  To explain the for K-means clustering  To understand the K-means clustering algorithm

Slide 3 EE3J2 Data Mining Clustering so far  Agglomerative clustering –Begin by assuming that every data point is a separate centroid –Combine closest centroids until the desired number of clusters is reached –See agglom.c on the course website  Divisive clustering –Begin by assuming that there is just one centroid/cluster –Split clusters until the desired number of clusters is reached

Slide 4 EE3J2 Data Mining Optimality  Neither agglomerative clustering nor divisive clustering is optimal  In other words, the set of centroids which they give is not guaranteed to minimise distortion

Slide 5 EE3J2 Data Mining Optimality continued  For example: –In agglomerative clustering, a dense cluster of data points will be combined into a single centroid –But to minimise distortion, need several centroids in a region where there are many data points –A single ‘outlier’ may get its own cluster  Agglomerative clustering provides a useful starting point, but further refinement is needed

Slide 6 EE3J2 Data Mining 12 centroids

Slide 7 EE3J2 Data Mining K-means Clustering  Suppose that we have decided how many centroids we need - denote this number by K  Suppose that we have an initial estimate of suitable positions for our K centroids  K-means clustering is an iterative procedure for moving these centroids to reduce distortion

Slide 8 EE3J2 Data Mining K-means clustering - notation  Suppose there are T data points, denoted by:  Suppose that the initial K clusters are denoted by:  One iteration of K-means clustering will produce a new set of clusters Such that

Slide 9 EE3J2 Data Mining K-means clustering (1)  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )  Now, for each centroid c 0 k define:  In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster

Slide 10 EE3J2 Data Mining K-means clustering (2)  Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0  In other words, c 1 k is the average value of the samples which were closest to c 0 k

Slide 11 EE3J2 Data Mining K-means clustering (3)  Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges  Each new set of centroids has smaller distortion than the previous set

Slide 12 EE3J2 Data Mining Initialisation  An outstanding problem is to choose the initial cluster set C 0  Possibilities include: –Choose C 0 randomly –Choose C 0 using agglomerative clustering –Choose C 0 using divisive clustering  Choice of C 0 can be important –K-means clustering is a “hill-climbing” algorithm –It only finds a local minimum of the distortion function –This local minimum is determined by C 0

Slide 13 EE3J2 Data Mining Local optimality Distortion Cluster set C 0 C 1..C n Dist(C 0 ) Local minimum Global minimum N.B: I’ve drawn the cluster set space as 1 dimensional for simplicity. In reality it is a very high dimensional space

Slide 14 EE3J2 Data Mining Example

Slide 15 EE3J2 Data Mining Example - distortion

Slide 16 EE3J2 Data Mining C programs on website  agglom.c –Agglomerative clustering –agglom dataFile centFile numCent –Runs agglomerative clustering on the data in dataFile until the number of centroids is numCent. Writes the centroid (x,y) coordinates to centFile

Slide 17 EE3J2 Data Mining C programs on website  k-means.c –K-means clustering –k-means dataFile centFile opFile –Runs 10 iterations of k-means clustering on the data in dataFile starting with the centroids in centFile. –After each iteration writes distortion and new centroids to opFile  sa1.txt –This is the data file that I have used in the demonstrations

Slide 18 EE3J2 Data Mining Summary  The need for k-means clustering  The k-means clustering algorithm  Example of k-means clustering