Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.

Slides:



Advertisements
Similar presentations
Trees for spatial indexing
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
K-means algorithm 1)Pick a number (k) of cluster centers 2)Assign every gene to its nearest cluster center 3)Move each cluster center to the mean of its.
PARTITIONAL CLUSTERING
Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Unsupervised learning
K-means clustering Hongning Wang
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Hadoop Technical Workshop Module III: MapReduce Algorithms.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Introduction to Bioinformatics - Tutorial no. 12
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
How to think in Map-Reduce Paradigm Ayon Sinha
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Image segmentation Prof. Noah Snavely CS1114
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
CS654: Digital Image Analysis
Clustering.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Unsupervised learning introduction Machine Learning.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Canopy Clustering Given a distance measure and two threshold distances T1>T2, 1. Determine canopy centers - go through The list of input points to form.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Clustering – Definition and Basic Algorithms Seminar on Geometric Approximation Algorithms, spring 11/12.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Lecture 5. MapReduce and HDFS
Clustering Hongfei Yan
K-means and Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Dr. Unnikrishnan P.C. Professor, EEE
Clustering.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
Cluster analysis Presented by Dr.Chayada Bhadrakom
Presentation transcript:

Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Google News They didn’t pick all 3,400,217 related articles by hand… Or Amazon.com Or Netflix…

Other less glamorous things... Hospital Records Scientific Imaging –Related genes, related stars, related sequences Market Research –Segmenting markets, product positioning Social Network Analysis Data mining Image segmentation…

The Distance Measure How the similarity of two elements in a set is determined, e.g. –Euclidean Distance –Manhattan Distance –Inner Product Space –Maximum Norm –Or any metric you define over the space…

Hierarchical Clustering vs. Partitional Clustering Types of Algorithms

Hierarchical Clustering Builds or breaks up a hierarchy of clusters.

Partitional Clustering Partitions set into all clusters simultaneously.

Partitional Clustering Partitions set into all clusters simultaneously.

K-Means Clustering Simple Partitional Clustering Choose the number of clusters, k Choose k points to be cluster centers Then…

K-Means Clustering iterate { Compute distance from all points to all k- centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages }

But! The complexity is pretty high: –k * n * O ( distance metric ) * num (iterations) Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.

Furthermore There are three big ways a data set can be large: –There are a large number of elements in the set. –Each element can have many features. –There can be many clusters to discover Conclusion – Clustering can be huge, even when you distribute it.

Canopy Clustering Preliminary step to help parallelize computation. Clusters data into overlapping Canopies using super cheap distance metric. Efficient Accurate

Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold }

After the canopy clustering… Resume hierarchical or partitional clustering as usual. Treat objects in separate clusters as being at infinite distances.

MapReduce Implementation: Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.

The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$)

Steps! Get Data into a form you can use (MR) Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) –Iterate!

Data Massage This isn’t interesting, but it has to be done.

Selecting Canopy Centers

Assigning Points to Canopies

K-Means Map

Elbow Criterion Choose a number of clusters s.t. adding a cluster doesn’t add interesting information. Rule of thumb to determine what number of Clusters should be chosen. Initial assignment of cluster seeds has bearing on final model performance. Often required to run clustering several times to get maximal performance

Clustering Conclusions Clustering is slick And it can be done super efficiently And in lots of different ways

Homework Lab 4 - Clustering the Netflix Movie Data Hw4 – Read IIR chapter 16 –Flat clustering