Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 7, 2013 Session 7: Clustering and Classification This work is.

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
K-means clustering Hongning Wang
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering.
Expectation Maximization Algorithm
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Visual Recognition Tutorial
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Collaborative Filtering Matrix Factorization Approach
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Today Ensemble Methods. Recap of the course. Classifier Fusion
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, March 9, 2015 Session 6: MapReduce – Data Mining This work is licensed under a Creative.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Flat clustering approaches
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Big Data Infrastructure Week 8: Data Mining (2/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Big Data Infrastructure
Big Data Infrastructure
Semi-Supervised Clustering
Deep Feedforward Networks
Classification of unlabeled data:
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Machine Learning Basics
Probabilistic Models with Latent Variables
Collaborative Filtering Matrix Factorization Approach
CSCI B609: “Foundations of Data Science”
Logistic Regression & Parallel SGD
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Data-Intensive Distributed Computing
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 7, 2013 Session 7: Clustering and Classification This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

Today’s Agenda Personalized PageRank Clustering Classification

Clustering Source: Wikipedia (Star cluster)

Problem Setup Arrange items into clusters High similarity between objects in the same cluster Low similarity between objects in different clusters Cluster labeling is a separate problem

Applications Exploratory analysis of large collections of objects Image segmentation Recommender systems Cluster hypothesis in information retrieval Computational biology and bioinformatics Pre-processing for many other algorithms

Three Approaches Hierarchical K-Means Gaussian Mixture Models

Hierarchical Agglomerative Clustering Start with each document in its own cluster Until there is only one cluster: Find the two clusters c i and c j, that are most similar Replace c i and c j with a single cluster c i  c j The history of merges forms the hierarchy

HAC in Action ABCDEFGHABCDEFGH

Cluster Merging Which two clusters do we merge? What’s the similarity between two clusters? Single Link: similarity of two most similar members Complete Link: similarity of two least similar members Group Average: average similarity between members

Link Functions Single link: Uses maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect Complete link: Use minimum similarity of pairs: Makes more “tight” spherical clusters

MapReduce Implementation What’s the inherent challenge? One possible approach: Iteratively use LSH to group together similar items When dataset is small enough, run HAC in memory on a single machine Observation: structure at the leaves is not very important

K-Means Algorithm Let d be the distance between documents Define the centroid of a cluster to be: Select k random instances {s 1, s 2,… s k } as seeds. Until clusters converge: Assign each instance x i to the cluster c j such that d(x i, s j ) is minimal Update the seeds to the centroid of each cluster For each cluster c j, s j =  (c j )

Compute centroids   K-Means Clustering Example Pick seeds Reassign clusters   Compute centroids Reassign clusters Converged!

Basic MapReduce Implementation (Just a clever way to keep track of denominator)

MapReduce Implementation w/ IMC

Implementation Notes Standard setup of iterative MapReduce algorithms Driver program sets up MapReduce job Waits for completion Checks for convergence Repeats if necessary Must be able keep cluster centroids in memory With large k, large feature spaces, potentially an issue Memory requirements of centroids grow over time! Variant: k-medoids

Clustering w/ Gaussian Mixture Models Model data as a mixture of Gaussians Given data, recover model parameters Source: Wikipedia (Cluster analysis)

Gaussian Distributions Univariate Gaussian (i.e., Normal): A random variable with such a distribution we write as: Multivariate Gaussian: A vector-value random variable with such a distribution we write as:

Univariate Gaussian Source: Wikipedia (Normal Distribution)

Multivariate Gaussians Source: Lecture notes by Chuong B. Do (IIT Delhi)

Gaussian Mixture Models Model parameters Number of components: “Mixing” weight vector: For each Gaussian, mean and covariance matrix: Varying constraints on co-variance matrices Spherical vs. diagonal vs. full Tied vs. untied

Learning for Simple Univariate Case Problem setup: Given number of components: Given points: Learn parameters: Model selection criterion: maximize likelihood of data Introduce indicator variables: Likelihood of the data:

EM to the Rescue! We’re faced with this: It’d be a lot easier if we knew the z’s! Expectation Maximization Guess the model parameters E-step: Compute posterior distribution over latent (hidden) variables given the model parameters M-step: Update model parameters using posterior distribution computed in the E-step Iterate until convergence

EM for Univariate GMMs Initialize: Iterate: E-step: compute expectation of z variables M-step: compute new model parameters

MapReduce Implementation z 1,1 z 2,1 z N,1 z 1,2 z 2,2 z 2,3 z N,2 z 1,K z 2,K z N,K … … x1x1 x2x2 x3x3 xNxN Map Reduce

K-Means vs. GMMs Map Reduce K-MeansGMM Compute distance of points to centroids Recompute new centroids E-step: compute expectation of z indicator variables M-step: update values of model parameters

Summary Hierarchical clustering Difficult to implement in MapReduce K-Means Straightforward implementation in MapReduce Gaussian Mixture Models Implementation conceptually similar to k-means, more “bookkeeping”

Source: Wikipedia (Sorting) Classification

Supervised Machine Learning The generic problem of function induction given sample instances of input and output Classification: output draws from finite discrete labels Regression: output is a continuous value Focus here on supervised classification Suffices to illustrate large-scale machine learning This is not meant to be an exhaustive treatment of machine learning!

Applications Spam detection Content (e.g., movie) classification POS tagging Friendship recommendation Document ranking Many, many more!

Supervised Binary Classification Restrict output label to be binary Yes/No 1/0 Binary classifiers form a primitive building block for multi- class problems One vs. rest classifier ensembles Classifier cascades

Limits of Supervised Classification? Why is this a big data problem? Isn’t gathering labels a serious bottleneck? Solution: user behavior logs Learning to rank Computational advertising Link recommendation The virtuous cycle of data-driven products

Induce Such that loss is minimized Given Typically, consider functions of a parametric form: The Task (sparse) feature vector label loss function model parameters

Key insight: machine learning as an optimization problem! (closed form solutions generally not possible)

Gradient Descent: Preliminaries Rewrite: Compute gradient: “Points” to fastest increasing “direction” So, at any point: * * caveats

Gradient Descent: Iterative Update Start at an arbitrary point, iteratively update: We have: Lots of details: Figuring out the step size Getting stuck in local minima Convergence rate …

Gradient Descent Repeat until convergence:

Intuition behind the math… Old weights Update based on gradient New weights

Gradient Descent Source: Wikipedia (Hills)

Lots More Details… Gradient descent is a “first order” optimization technique Often, slow convergence Conjugate techniques accelerate convergence Newton and quasi-Newton methods: Intuition: Taylor expansion Requires the Hessian (square matrix of second order partial derivatives): impractical to fully compute

Source: Wikipedia (Hammer) Logistic Regression

Logistic Regression: Preliminaries Given Let’s define: Interpretation:

Relation to the Logistic Function After some algebra: The logistic function:

Training an LR Classifier Maximize the conditional likelihood: Define the objective in terms of conditional log likelihood: We know so: Substituting:

LR Classifier Update Rule Take the derivative: General form for update rule: Final update rule:

Lots more details… Regularization Different loss functions … Want more details? Take a real machine-learning course!

mapper reducer compute partial gradient single reducer mappers update model iterate until convergence MapReduce Implementation

Shortcomings Hadoop is bad at iterative algorithms High job startup costs Awkward to retain state across iterations High sensitivity to skew Iteration speed bounded by slowest task Potentially poor cluster utilization Must shuffle all data to a single reducer Some possible tradeoffs Number of iterations vs. complexity of computation per iteration E.g., L-BFGS: faster convergence, but more to compute

Gradient Descent Source: Wikipedia (Hills)

Stochastic Gradient Descent Source: Wikipedia (Water Slide)

Gradient Descent Stochastic Gradient Descent (SGD) “batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance In practice… just as good! Batch vs. Online

Practical Notes Most common implementation: Randomly shuffle training instances Stream instances through learner Single vs. multi-pass approaches “Mini-batching” as a middle ground between batch and stochastic gradient descent We’ve solved the iteration problem! What about the single reducer problem?

Source: Wikipedia (Orchestra) Ensembles

Ensemble Learning Learn multiple models, combine results from different models to make prediction Why does it work? If errors uncorrelated, multiple classifiers being wrong is less likely Reduces the variance component of error A variety of different techniques: Majority voting Simple weighted voting: Model averaging …

Practical Notes Common implementation: Train classifiers on different input partitions of the data Embarassingly parallel! Contrast with bagging Contrast with boosting

MapReduce Implementation training data model training data model training data model training data model mapper

MapReduce Implementation: Details Shuffling/resort training instances before learning Two possible implementations: Mappers write model out as “side data” Mappers emit model as intermediate output

Sentiment Analysis Case Study Binary polarity classification: {positive, negative} sentiment Independently interesting task Illustrates end-to-end flow Use the “emoticon trick” to gather data Data Test: 500k positive/500k negative tweets from 9/1/2011 Training: {1m, 10m, 100m} instances from before (50/50 split) Features: Sliding window byte-4grams Models: Logistic regression with SGD (L2 regularization) Ensembles of various sizes (simple weighted voting) Lin and Kolcz, SIGMOD 2012

“for free” Ensembles with 10m examples better than 100m single classifier! Diminishing returns… single classifier10m ensembles100m ensembles

Takeaway Lesson Big data “recipe” for problem solving Simple technique Simple features Lots of data Usually works very well!

Today’s Agenda Personalized PageRank Clustering Classification

Source: Wikipedia (Japanese rock garden) Questions?