K means ++ and K means Parallel Jun Wang. Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Lecture 3 Outline: Thurs, Sept 11 Chapters Probability model for 2-group randomized experiment Randomization test p-value Probability model for.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Geo479/579: Geostatistics Ch14. Search Strategies.
Local Clustering Algorithm DISCOVIR Image collection within a client is modeled as a single cluster. Current Situation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Segmentation and Clustering. Segmentation: Divide image into regions of similar contentsSegmentation: Divide image into regions of similar contents Clustering:
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ensemble Learning: An Introduction
Clustering Color/Intensity
A Hierarchical Energy-Efficient Framework for Data Aggregation in Wireless Sensor Networks IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 55, NO. 3, MAY.
Clustering.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Cluster Analysis?
Visual Recognition Tutorial
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Evaluating Performance for Data Mining Techniques
Collaborative Filtering Matrix Factorization Approach
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Medical Imaging Dr. Mohammad Dawood Department of Computer Science University of Münster Germany.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Basic Sampling & Review of Statistics. Basic Sampling What is a sample?  Selection of a subset of elements from a larger group of objects Why use a sample?
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Chapter 10 – Sampling Distributions Math 22 Introductory Statistics.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling Distributions.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
Bahman Bahmani Stanford University
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
MCMC in practice Start collecting samples after the Markov chain has “mixed”. How do you know if a chain has mixed or not? In general, you can never “proof”
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Advanced Computer Graphics Optimization Part 2 Spring 2002 Professor Brogan.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Medical Image Analysis Dr. Mohammad Dawood Department of Computer Science University of Münster Germany.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Data Mining – Algorithms: K Means Clustering
Combinatorial clustering algorithms. Example: K-means clustering
Data Mining K-means Algorithm
PCB 3043L - General Ecology Data Analysis.
Classification of unlabeled data:
K-Means Seminar Social Media Mining University UC3M Date May 2017
Data Clustering Michael J. Watts
Advanced Artificial Intelligence
Collaborative Filtering Matrix Factorization Approach
Clustering.
Multivariate Statistical Methods
CHAPTER 12 More About Regression
Parallel k-means++ for Multiple Shared-Memory Architectures
Clustering The process of grouping samples so that the samples are similar within each group.
Presentation transcript:

K means ++ and K means Parallel Jun Wang

Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers

K means ++ Actually you are using it Spend some time on choosing k centers(seeding) Save time on clustering

K means ++ algorithm Seeding Choose a center from X randomly For k-1 times Sample one center each time from X with probability p Update center matrix Clustering

d i 2 =min(euclidean distance b/t Xi to each Ci )

How to choose K centers

Choose a point from X randomly

Calculate all d i 2

Calculate Pi D=d 1 2 +d 2 2 +d 3 2 +…+d n 2 P i =d i 2 / D ∑P i =1 Points further away from red point have better chance to be chosen

Pick up point with probability p

Keep doing the following: Update center matrix Calculate d i 2 Calculate pi Until k centers are found

K means || algorithm Seeding Choose a small subset C from X Assign weight to points in C Cluster C and get k centers Clustering

Choose subset C from X Let D=Sum of square distance=d 1 2 +d 2 2 +d 3 2 +…+d n 2 Let L be f(k) like 0.2k or 1.5k for ln(D) times Pick up each point in X using Bernoulli distribution P(chosen)=L*d i 2 /D Update the C

How many data in C?

Ln(D) iterations Each iteration there suppose to be 1*P1+1*P2+…+1*Pn =L points Total Ln(D)*L points

Cluster the subset C Red points are in subset C

Cluster the sample C Calculate distances between point A to other points in C, and find the smallest distance In this case,d_c 1

Cluster the sample C Calculate distances between point A and all points in X, and get d_x i

Cluster the sample C Compare d_x i to d_c 1, and let W A =number of d_x i <d_c 1 Then we get weight matrix, W Cluster W into k clusters, get k centers

Difference among three methods K meansK means ++K means || seedingchoose k centers randomly Choose k centers proportionally Choose subset C and get k centers from C clustering

Hypothesis K meansK means ++K means || seedingchoose k centers randomly fast Choose k centers proportionally slow Choose subset C and get k centers from C slower clustering

Hypothesis K meansK means ++K means || seedingchoose k centers randomly fast Choose k centers proportionally slow Choose subset C and get k centers from C slower clusteringslowfastfaster

Test Hypothesis Toy data one – very small Cloud data – small Spam data – moderate Toy data two – very large

Simple data set N=100; d=2; k=2; Iteration=100

Executive time

Cloud data consists of 1024 points in 10 dimension k=6

Executive time (in seconds)

Total scatter

Spam base data represents features available to an spam detection system consists of 4601 points in 58 dimensions K=10

Executive time

Total scatter

Complicate data set N=20,000 d=40 K=40

Executive time

Clustered figure with true label

Clustered figure with computed label

summary K meansK means ++K means || Small size dataFastVery fastSlow Moderate size Large size data SlowVery slowFast

Select L Does not matter much when data is small Try on large data set

Questions