Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Segmentation and Fitting Using Probabilistic Methods
K-means clustering Hongning Wang
Introduction to Bioinformatics
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering.
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
ECE 5984: Introduction to Machine Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Gaussian Mixture Models and Expectation Maximization.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Lecture 19: More EM Machine Learning April 15, 2010.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, March 9, 2015 Session 6: MapReduce – Data Mining This work is licensed under a Creative.
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Flat clustering approaches
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 7, 2013 Session 7: Clustering and Classification This work is.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Lecture 2-2 Data Exploration: Understanding Data
Machine Learning Lecture 9: Clustering
Latent Variables, Mixture Models and EM
Probabilistic Models with Latent Variables
Information Organization: Clustering
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Data-Intensive Distributed Computing
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
EM Algorithm and its Applications
Presentation transcript:

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo March 10, 2016 These slides are available at

What’s the Problem? Arrange items into clusters High similarity (low distance) between items in the same cluster Low similarity (high distance) between items in different clusters Cluster labeling is a separate problem

Compare/Contrast Finding similar items Focus on individual items Clustering Focus on groups of items Relationship between items in a cluster is of interest

Evaluation? Classification Finding similar items Clustering

Source: Wikipedia (Star cluster)

Clustering Specify distance metric Jaccard, Euclidean, cosine, etc. Compute representation Shingling, tf.idf, etc. Apply clustering algorithm

Distances Source:

Distance Metrics 1. Non-negativity: 2. Identity: 3. Symmetry: 4. Triangle Inequality

Distance: Jaccard Given two sets A, B Jaccard similarity:

Distance: Hamming Given two bit vectors Hamming distance: number of elements which differ

Distance: Norms Given: Euclidean distance (L 2 -norm) Manhattan distance (L 1 -norm) L r -norm

Distance: Cosine Given: Idea: measure distance between the vectors Thus: Advantages over others?

Representations

Representations: Text Unigrams (i.e., words) Shingles = n-grams At the word level At the character level Feature weights boolean tf.idf BM25 …

Representations: Beyond Text For recommender systems: Items as features for users Users as features for items For graphs: Adjacency lists as features for vertices With log data: Behaviors (clicks) as features

General Clustering Approaches Hierarchical K-Means Gaussian Mixture Models

Hierarchical Agglomerative Clustering Start with each document in its own cluster Until there is only one cluster: Find the two clusters c i and c j, that are most similar Replace c i and c j with a single cluster c i  c j The history of merges forms the hierarchy

HAC in Action ABCDEFGHABCDEFGH

Cluster Merging Which two clusters do we merge? What’s the similarity between two clusters? Single Link: similarity of two most similar members Complete Link: similarity of two least similar members Group Average: average similarity between members

Link Functions Single link: Uses maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect Complete link: Use minimum similarity of pairs: Makes more “tight” spherical clusters

MapReduce Implementation What’s the inherent challenge?

K-Means Algorithm Let d be the distance between documents Define the centroid of a cluster to be: Select k random instances {s 1, s 2,… s k } as seeds. Until clusters converge: Assign each instance x i to the cluster c j such that d(x i, s j ) is minimal Update the seeds to the centroid of each cluster For each cluster c j, s j =  (c j )

Compute centroids   K-Means Clustering Example Pick seeds Reassign clusters   Compute centroids Reassign clusters Converged!

Basic MapReduce Implementation (Just a clever way to keep track of denominator)

MapReduce Implementation w/ IMC What about Spark?

Implementation Notes Standard setup of iterative MapReduce algorithms Driver program sets up MapReduce job Waits for completion Checks for convergence Repeats if necessary Must be able keep cluster centroids in memory With large k, large feature spaces, potentially an issue Memory requirements of centroids grow over time! Variant: k-medoids

Clustering w/ Gaussian Mixture Models Model data as a mixture of Gaussians Given data, recover model parameters Source: Wikipedia (Cluster analysis)

Gaussian Distributions Univariate Gaussian (i.e., Normal): A random variable with such a distribution we write as: Multivariate Gaussian: A vector-value random variable with such a distribution we write as:

Univariate Gaussian Source: Wikipedia (Normal Distribution)

Multivariate Gaussians Source: Lecture notes by Chuong B. Do (IIT Delhi)

Gaussian Mixture Models Model parameters Number of components: “Mixing” weight vector: For each Gaussian, mean and covariance matrix: Varying constraints on co-variance matrices Spherical vs. diagonal vs. full Tied vs. untied The generative story? (yes, that’s a technical term)

Learning for Simple Univariate Case Problem setup: Given number of components: Given points: Learn parameters: Model selection criterion: maximize likelihood of data Introduce indicator variables: Likelihood of the data:

EM to the Rescue! We’re faced with this: It’d be a lot easier if we knew the z’s! Expectation Maximization Guess the model parameters E-step: Compute posterior distribution over latent (hidden) variables given the model parameters M-step: Update model parameters using posterior distribution computed in the E-step Iterate until convergence

EM for Univariate GMMs Initialize: Iterate: E-step: compute expectation of z variables M-step: compute new model parameters

MapReduce Implementation z 1,1 z 2,1 z 3,1 z N,1 z 1,2 z 2,2 z 3,3 z N,2 z 1,K z 2,K z 3,K z N,K … … x1x1 x2x2 x3x3 xNxN Map Reduce What about Spark?

K-Means vs. GMMs Map Reduce K-MeansGMM Compute distance of points to centroids Recompute new centroids E-step: compute expectation of z indicator variables M-step: update values of model parameters

Source: Wikipedia (k-means clustering)

Source: Wikipedia (Japanese rock garden) Questions?