Data-Intensive Distributed Computing

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
K Means Clustering , Nearest Cluster and Gaussian Mixture
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering.
Cluster Analysis (1).
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, March 9, 2015 Session 6: MapReduce – Data Mining This work is licensed under a Creative.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 7, 2013 Session 7: Clustering and Classification This work is.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Big Data Infrastructure
Big Data Infrastructure
Clustering Anna Reithmeir Data Mining Proseminar 2017
Data Mining: Basic Cluster Analysis
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Big Data Infrastructure
Machine Learning Lecture 9: Clustering
K-means and Hierarchical Clustering
Probabilistic Models with Latent Variables
Revision (Part II) Ke Chen
Clustering.
Information Organization: Clustering
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
CS 188: Artificial Intelligence Fall 2008
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
INTRODUCTION TO Machine Learning
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
CSE572: Data Mining by H. Liu
Radial Basis Functions: Alternative to Back Propagation
EM Algorithm and its Applications
Introduction to Machine learning
Presentation transcript:

Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 6: Data Mining (4/4) November 6, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2018f/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Structure of the Course Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining “Core” framework features and algorithm design

Theme: Similarity How similar are two items? How “close” are two items? Equivalent formulations: large distance = low similarity Lots of applications! Problem: find similar items Offline variant: extract all similar pairs of objects from a large collection Online variant: is this object similar to something I’ve seen before? Last time! Problem: arrange similar items into clusters Offline variant: entire static collection available at once Online variant: objects incrementally available Today!

Clustering Criteria How to form clusters? High similarity (low distance) between items in the same cluster Low similarity (high distance) between items in different clusters Cluster labeling is a separate (difficult) problem!

Supervised Machine Learning training testing/deployment training data Model ? Machine Learning Algorithm

Unsupervised Machine Learning If supervised learning is function induction… what’s unsupervised learning? Learning something about the inherent structure of the data What’s it good for?

Applications of Clustering Clustering images to summarize search results Clustering customers to infer viewing habits Clustering biological sequences to understand evolution Clustering sensor logs for outlier detection

Evaluation How do we know how well we’re doing? Classification Nearest neighbor search Clustering Inherent challenges of unsupervised techniques!

Clustering Clustering Source: Wikipedia (Star cluster)

Clustering Specify distance metric Compute representation Jaccard, Euclidean, cosine, etc. Compute representation Shingling, tf.idf, etc. Apply clustering algorithm

Distance Metrics Source: www.flickr.com/photos/thiagoalmeida/250190676/

Distance Metrics Non-negativity: Identity: Symmetry: Triangle Inequality

Distance: Jaccard Given two sets A, B Jaccard similarity:

Distance: Norms Given Euclidean distance (L2-norm) Manhattan distance (L1-norm) Lr-norm

Distance: Cosine Given Idea: measure distance between the vectors Thus: Advantages over others?

Representations

Representations (Text) Unigrams (i.e., words) Shingles = n-grams At the word level At the character level Feature weights boolean tf.idf BM25 …

Representations (Beyond Text) For recommender systems: For graphs: Items as features for users Users as features for items For graphs: Adjacency lists as features for vertices For log data: Behaviors (clicks) as features

Clustering Algorithms Agglomerative (bottom-up) Divisive (top-down) K-Means Gaussian Mixture Models

Hierarchical Agglomerative Clustering Start with each object in its own cluster Until there is only one cluster: Find the two clusters ci and cj, that are most similar Replace ci and cj with a single cluster ci ∪ cj The history of merges forms the hierarchy

HAC in Action Step 1: {1}, {2}, {3}, {4}, {5}, {6}, {7} Source: Slides by Ryan Tibshirani

Dendrogram Source: Slides by Ryan Tibshirani

Cluster Merging Which two clusters do we merge? What’s the similarity between two clusters? Single Linkage: similarity of two most similar members Complete Linkage: similarity of two least similar members Average Linkage: average similarity between members

Uses maximum similarity (min distance) of pairs: Single Linkage Uses maximum similarity (min distance) of pairs: Source: Slides by Ryan Tibshirani

Uses minimum similarity (max distance) of pairs: Complete Linkage Uses minimum similarity (max distance) of pairs: Source: Slides by Ryan Tibshirani

Uses average of all pairs: Average Linkage Uses average of all pairs: Source: Slides by Ryan Tibshirani

Link Functions Single linkage: Complete linkage: Average linkage: Uses maximum similarity (min distance) of pairs Weakness: “straggly” (long and thin) clusters due to chaining effect Clusters may not be compact Complete linkage: Uses minimum similarity (max distance) of pairs Weakness: crowding effect – points closer to other clusters than own cluster Clusters may not be far apart Average linkage: Uses average of all pairs Tries to strike a balance – compact and far apart Weakness: similarity more difficult to interpret

MapReduce Implementation What’s the inherent challenge? Practicality as in-memory final step

Clustering Algorithms Agglomerative (bottom-up) Divisive (top-down) K-Means Gaussian Mixture Models

K-Means Algorithm Select k random instances {s1, s2,… sk} as initial centroids Iterate: Assign each instance to closest centroid Update centroids based on assigned instances

K-Means Clustering Example Pick seeds Reassign clusters Compute centroids Reassign clusters   Compute centroids Reassign clusters Converged!

Basic MapReduce Implementation class Mapper { def setup() = { clusters = loadClusters() } def map(id: Int, vector: Vector) = { context.write(clusters.findNearest(vector), vector) class Reducer { def reduce(clusterId: Int, values: Iterable[Vector]) = { for (vector <- values) { sum += vector cnt += 1 emit(clusterId, sum/cnt)

Basic MapReduce Implementation Conceptually, what’s happening? Given current cluster assignment, assign each vector to closest cluster Group by cluster Compute updated clusters What’s the cluster update? Computing the mean! Remember IMC and other optimizations? What about Spark?

Implementation Notes Standard setup of iterative MapReduce algorithms Driver program sets up MapReduce job Waits for completion Checks for convergence Repeats if necessary Must be able keep cluster centroids in memory With large k, large feature spaces, potentially an issue Memory requirements of centroids grow over time! Variant: k-medoids How do you select initial seeds? How do you select k?

Clustering w/ Gaussian Mixture Models Model data as a mixture of Gaussians Given data, recover model parameters What’s with models? Source: Wikipedia (Cluster analysis)

Gaussian Distributions Univariate Gaussian (i.e., Normal): A random variable with such a distribution we write as: Multivariate Gaussian: A random variable with such a distribution we write as:

Univariate Gaussian Source: Wikipedia (Normal Distribution)

Multivariate Gaussians Source: Lecture notes by Chuong B. Do (IIT Delhi)

Gaussian Mixture Models Model Parameters Number of components: “Mixing” weight vector: For each Gaussian, mean and covariance matrix: The generative story? (yes, that’s a technical term) Problem: Given the data, recover the model parameters Varying constraints on co-variance matrices Spherical vs. diagonal vs. full Tied vs. untied

Learning for Simple Univariate Case Problem setup: Given number of components: Given points: Learn parameters: Model selection criterion: maximize likelihood of data Introduce indicator variables: Likelihood of the data:

EM to the Rescue! We’re faced with this: Expectation Maximization It’d be a lot easier if we knew the z’s! Expectation Maximization Guess the model parameters E-step: Compute posterior distribution over latent (hidden) variables given the model parameters M-step: Update model parameters using posterior distribution computed in the E-step Iterate until convergence

EM for Univariate GMMs Initialize: Iterate: E-step: compute expectation of z variables M-step: compute new model parameters

MapReduce Implementation x1 z1,1 z1,2 … z1,K Map x2 z2,1 z2,2 z2,K x3 z3,1 z3,3 z3,K … xN zN,1 zN,2 zN,K Reduce What about Spark?

K-Means vs. GMMs K-Means GMM Map Reduce Compute distance of points to centroids E-step: compute expectation of z indicator variables Map M-step: update values of model parameters Reduce Recompute new centroids

Source: Wikipedia (k-means clustering)

Source: Wikipedia (Japanese rock garden)