K-Means Seminar Social Media Mining University UC3M Date May 2017

Slides:



Advertisements
Similar presentations
Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds.
Advertisements

K-means Clustering Given a data point v and a set of points X,
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
Clustering.
K-MEANS ALGORITHM Jelena Vukovic 53/07
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Cluster Analysis: Basic Concepts and Algorithms
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
PARTITIONAL CLUSTERING
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Bahman Bahmani Stanford University
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
K means ++ and K means Parallel Jun Wang. Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Data Mining – Algorithms: K Means Clustering
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Unsupervised Learning
Chapter 5 Unsupervised learning
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
CSSE463: Image Recognition Day 21
Data Mining K-means Algorithm
Classification of unlabeled data:
Lecture 8: Dispatch Rules
Clustering Seminar Social Media Mining University UC3M Date May 2017
Hierarchical Clustering
Unsupervised Learning
Spectral Clustering.
Clustering (3) Center-based algorithms Fuzzy k-means
CSE 5243 Intro. to Data Mining
AIM: Clustering the Data together
Jianping Fan Dept of CS UNC-Charlotte
CSSE463: Image Recognition Day 23
Simple Kmeans Examples
Monte Carlo I Previous lecture Analytical illumination formula
Gyozo Gidofalvi Uppsala Database Laboratory
Text Categorization Berlin Chen 2003 Reference:
CSSE463: Image Recognition Day 23
Clustering.
Unsupervised Learning: Clustering
CSSE463: Image Recognition Day 23
Solving Quadratic Equations by Factorisation
CS5760: Computer Vision Lecture 9: RANSAC Noah Snavely
Unsupervised Learning
Presentation transcript:

K-Means Seminar Social Media Mining University UC3M Date May 2017 Lecturer Carlos Castillo http://chato.cl/ Sources: Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download] Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html

The k-means problem consider set X={x1,...,xn} of n points in Rd Boston University Slideshow Title Goes Here consider set X={x1,...,xn} of n points in Rd assume that the number k is given problem: find k points c1,...,ck (named centers or means) so that the cost is minimized

The k-means problem k=1 and k=n are easy special cases (why?) an NP-hard problem if the dimension of the data is at least 2 (d≥2) in practice, a simple iterative algorithm works quite well Boston University Slideshow Title Goes Here

The k-means algorithm voted among the top-10 algorithms in data mining Boston University Slideshow Title Goes Here voted among the top-10 algorithms in data mining one way of solving the k- means problem

K-means algorithm

The k-means algorithm Boston University Slideshow Title Goes Here randomly (or with another method) pick k cluster centers {c1,...,ck} for each j, set the cluster Xj to be the set of points in X that are the closest to center cj for each j let cj be the center of cluster Xj (mean of the vectors in Xj) repeat (go to step 2) until convergence

Sample execution Boston University Slideshow Title Goes Here

1-dimensional clustering exercise For the data in the figure Run k-means with k=2 and initial centroids u1=2, u2=4 (Verify: last centroids are 18 units apart) Try with k=3 and initialization 2,3,30 http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1

Limitations of k-means Clusters of different size Clusters of different density Clusters of non-globular shape Sensitive to initialization

Limitations of k-means: different sizes Boston University Slideshow Title Goes Here

Limitations of k-means: different density Boston University Slideshow Title Goes Here

Limitations of k-means: non-spherical shapes Boston University Slideshow Title Goes Here

Effects of bad initialization Boston University Slideshow Title Goes Here

k-means algorithm finds a local optimum often converges quickly Boston University Slideshow Title Goes Here finds a local optimum often converges quickly but not always the choice of initial points can have large influence in the result tends to find spherical clusters outliers can cause a problem different densities may cause a problem

Advanced: k-means initialization

Initialization random initialization Boston University Slideshow Title Goes Here random initialization random, but repeat many times and take the best solution helps, but solution can still be bad pick points that are distant to each other k-means++ provable guarantees

k-means++ David Arthur and Sergei Vassilvitskii Boston University Slideshow Title Goes Here David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007

k-means algorithm: random initialization Boston University Slideshow Title Goes Here

k-means algorithm: random initialization Boston University Slideshow Title Goes Here

k-means algorithm: initialization with further-first traversal Boston University Slideshow Title Goes Here 2 1 4 3

k-means algorithm: initialization with further-first traversal Boston University Slideshow Title Goes Here

but... sensitive to outliers Boston University Slideshow Title Goes Here 2 1 3

but... sensitive to outliers Boston University Slideshow Title Goes Here

Here random may work well Boston University Slideshow Title Goes Here

k-means++ algorithm a = 0 random initialization interpolate between the two methods let D(x) be the distance between x and the nearest center selected so far choose next center with probability proportional to (D(x))a = Da(x) Boston University Slideshow Title Goes Here a = 0 random initialization a = ∞ furthest-first traversal a = 2 k-means++

k-means++ algorithm initialization phase: choose the first center uniformly at random choose next center with probability proportional to D2(x) iteration phase: iterate as in the k-means algorithm until convergence Boston University Slideshow Title Goes Here

k-means++ initialization Boston University Slideshow Title Goes Here 3 1 2

k-means++ result Boston University Slideshow Title Goes Here

k-means++ provable guarantee Boston University Slideshow Title Goes Here approximation guarantee comes just from the first iteration (initialization) subsequent iterations can only improve cost

Lesson learned no reason to use k-means and not k-means++ k-means++ : Boston University Slideshow Title Goes Here no reason to use k-means and not k-means++ k-means++ : easy to implement provable guarantee works well in practice

k-means-- Algorithm 4.1 in [Chawla & Gionis SDM 2013]