Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce for Machine Learning on Multicore

Similar presentations


Presentation on theme: "MapReduce for Machine Learning on Multicore"— Presentation transcript:

1 MapReduce for Machine Learning on Multicore
Cheng-Tao Chu and Sang Kyun Kim et al. Stanford University, Stanford CA Presented by Inna Rytsareva

2 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

3 Motivation Problem: lots of data
Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read MB/sec from disk ~four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data

4 Motivation Solution: spread the work over many machines
Same problem with 1000 machines, < 3 hours programming work communication and coordination recovering from machine failure status reporting debugging optimization locality repeat for every problem you want to solve

5 Cluster Computing Many racks of computers, thousands of machines per cluster Limited bisection bandwidth between racks

6 MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness

7 Programming model Input & Output: each a set of key/value pairs
Programmer specifies two functions: map() reduce() map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

8 Example Page 1: the weather is good Page 2: today is good
Page 3: good weather is good.

9 Map Output Worker 1: Worker 2: Worker 3:
(the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).

10 Reduce Input Worker 1: Worker 2: Worker 3: Worker 4: Worker 5: (the 1)
(is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1)

11 Reduce Output Worker 1: Worker 2: Worker 3: Worker 4: Worker 5:
(the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)

12

13 Fault tolerance On worker failure: Master failure:
Detect failure via periodic heartbeats (ping) Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely)

14 MapReduce Transparencies
Parallelization Fault-tolerance Locality optimization Load balancing

15 Suitable for your task if
Have a cluster Working with large dataset Working with independent data (or assumed) Can be cast into map and reduce

16 References Original paper J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, 2008, pp ( On wikipedia ( Hadoop – MapReduce in Java ( Starfish - MapReduce in Ruby (

17 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

18 Motivations Industry-wide shift to multicore
No good framework for parallelize ML algorithms Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors

19 Idea Statistical Query Model Summation Form MapReduce

20 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

21 Valiant Model [Valiant’84]
x is the input y is a function of x that we want to learn In Valiant model, the learning algorithm uses randomly drawn examples <x, y> to learn the target function

22 Statistical Query Model [Kearns’98]
A restriction on Valiant model A learning algorithm uses some aggregates over the examples, not the individual examples Given a function f(x,y) over instances (data points x and labels y), a statistical oracle will return an estimate of the expectation of f(x,y) Any model that computes gradients or sufficient statistics over f(x,y) fits this model Typically this is achieved by summing over the data.

23 Summation Form Aggregate over the data:
Divide the data set into pieces Compute aggregates on each cores Combine all results at the end

24 Example: Linear Regression
Model: Goal: Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2, …ym). Then the solution is Parallel computation:

25 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

26 Lighter Weight MapReduce for Multicore

27 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Conclusion and Discussion Map-Reduce for Machine Learning on Multicore

28 Locally Weighted Linear Regression (LWLR)
Solve: When wi == 1, this is least squares. Mappers: one sets compute subgroups of A, the other set compute subgroups b Two reducers for computing A and b Finally compute the solution

29 Naïve Bayes (NB) Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) and P(y)
Computation: count the occurrence of (xj=k, y=1) and (xj=k, y=0), count the occurrence of (y=1) and (y=0) Mappers: count a subgroup of training samples Reducer: aggregate the intermediate counts, and calculate the final result

30 Gaussian Discriminative Analysis (GDA)
Goal: classification of x into classes of y assuming each class is a Gaussian Mixture model with different means but same covariance. Computation: Mappers: compute for a subgroup of training samples Reducer: aggregate intermediate results

31 K-means Computing the Euclidean distance between sample vectors and centroids Recalculating the centroids Divide the computation to subgroups to be handled by map-reduce

32 Neural Network (NN) Back-propagation, 3-layer network
Input, middle, 2 output nodes Goal: compute the weights in the NN by back propagation Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights Reducer: sums the partial gradients and does a batch gradient descent to update the weights

33 Principal Components Analysis (PCA)
Compute the principle eigenvectors of the covariance matrix Clearly, we can compute the summation form using map-reduce Express the mean vector as a sum

34 Other Algorithms Logistic Regression Independent Component Analysis
Support Vector Machine Expectation Maximization (EM)

35 Time Complexity Basically: Linear speed up with increasing number of cores

36 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

37 Setup Compare map-reduce version and sequential version 10 data sets
Machines: Dual-processor Pentium-III 700MHz, 1GB RAM 16-way Sun Enterprise 6000

38 Dual-Processor SpeedUps

39 SpeedUp for 2-16 processors
Bold – average Error Bars – max/min Dashed - variance

40 Multicore Simulator Results
Multicore simulator over the sensor dataset Better results – reported for NN & LR NN 16 cores 15.5x 32 cores 29x 64 cores 54x LR 16 cores 15x 32 cores 29.5x 64 cores 53x Could be because of less communication cost

41 Conclusion Parallelize summation forms
NO change in the underlying algorithm NO approximation Use map-reduce on a single machine

42 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

43 Apache Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License Why Mahout? Community Documentation and Examples Scalability the Apache License Not-specific research-oriented

44 Focus: Scalable Goal: Be as fast and efficient as the possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a MapReduce framework like Apache Hadoop Still others will need other distributed programming models Most Mahout implementations are MapReduce enabled Work in Progress

45 Sampling of Who uses Mahout?

46 Focus: Machine Learning
Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop

47 Resources “Mahout in Action” “Introducing Apache Mahout”
Owen, Anil, Dunning and Friedman “Introducing Apache Mahout” “Taming Text” by Ingersoll, Morton, Farris “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer

48 Map-Reduce for Machine Learning on Multicore
Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

49 Discussion What are other alternatives to MapReduce?
What to do if “summation form” is not applicable? Does the dataset quality effect implementation and performance of parallel machine learning algorithms? Multicore processors… future? Predicting Structural and Functional Sites in Proteins by Searching for Maximum-Weight Cliques


Download ppt "MapReduce for Machine Learning on Multicore"

Similar presentations


Ads by Google