MapReduce for Machine Learning on Multicore

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce.
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Image taken from: slideshare
Introducing Apache Mahout
Map Reduce.
Introduction to MapReduce and Hadoop
Applying Twister to Scientific Applications
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to MapReduce
MapReduce: Simplified Data Processing on Large Clusters
Introducing Apache Mahout
Presentation transcript:

MapReduce for Machine Learning on Multicore Cheng-Tao Chu and Sang Kyun Kim et al. Stanford University, Stanford CA Presented by Inna Rytsareva

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Motivation Problem: lots of data Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data

Motivation Solution: spread the work over many machines Same problem with 1000 machines, < 3 hours programming work communication and coordination recovering from machine failure status reporting debugging optimization locality repeat for every problem you want to solve

Cluster Computing Many racks of computers, thousands of machines per cluster Limited bisection bandwidth between racks http://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_P_supercomputer.jpg

MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness

Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map() reduce() map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good.

Map Output Worker 1: Worker 2: Worker 3: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).

Reduce Input Worker 1: Worker 2: Worker 3: Worker 4: Worker 5: (the 1) (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1)

Reduce Output Worker 1: Worker 2: Worker 3: Worker 4: Worker 5: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)

http://4.bp.blogspot.com/_j6mB7TMmJJY/STAYW9gC-NI/AAAAAAAAAGY/lLKo7sBp5i8/s1600-h/P1.png

Fault tolerance On worker failure: Master failure: Detect failure via periodic heartbeats (ping) Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely)

MapReduce Transparencies Parallelization Fault-tolerance Locality optimization Load balancing

Suitable for your task if Have a cluster Working with large dataset Working with independent data (or assumed) Can be cast into map and reduce

References Original paper J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, 2008, pp. 107-113. (http://labs.google.com/papers/mapreduce.html) On wikipedia (http://en.wikipedia.org/wiki/MapReduce) Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/) Starfish - MapReduce in Ruby (http://rufy.com/starfish/)

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Motivations Industry-wide shift to multicore No good framework for parallelize ML algorithms Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors http://upload.wikimedia.org/wikipedia/commons/a/af/E6750bs8.jpg

Idea … Statistical Query Model Summation Form MapReduce

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Valiant Model [Valiant’84] x is the input y is a function of x that we want to learn In Valiant model, the learning algorithm uses randomly drawn examples <x, y> to learn the target function

Statistical Query Model [Kearns’98] A restriction on Valiant model A learning algorithm uses some aggregates over the examples, not the individual examples Given a function f(x,y) over instances (data points x and labels y), a statistical oracle will return an estimate of the expectation of f(x,y) Any model that computes gradients or sufficient statistics over f(x,y) fits this model Typically this is achieved by summing over the data.

Summation Form Aggregate over the data: Divide the data set into pieces Compute aggregates on each cores Combine all results at the end

Example: Linear Regression Model: Goal: Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2, …ym). Then the solution is Parallel computation:

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Lighter Weight MapReduce for Multicore

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Conclusion and Discussion Map-Reduce for Machine Learning on Multicore

Locally Weighted Linear Regression (LWLR) Solve: When wi == 1, this is least squares. Mappers: one sets compute subgroups of A, the other set compute subgroups b Two reducers for computing A and b Finally compute the solution

Naïve Bayes (NB) Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) and P(y) Computation: count the occurrence of (xj=k, y=1) and (xj=k, y=0), count the occurrence of (y=1) and (y=0) Mappers: count a subgroup of training samples Reducer: aggregate the intermediate counts, and calculate the final result

Gaussian Discriminative Analysis (GDA) Goal: classification of x into classes of y assuming each class is a Gaussian Mixture model with different means but same covariance. Computation: Mappers: compute for a subgroup of training samples Reducer: aggregate intermediate results

K-means Computing the Euclidean distance between sample vectors and centroids Recalculating the centroids Divide the computation to subgroups to be handled by map-reduce

Neural Network (NN) Back-propagation, 3-layer network Input, middle, 2 output nodes Goal: compute the weights in the NN by back propagation Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights Reducer: sums the partial gradients and does a batch gradient descent to update the weights

Principal Components Analysis (PCA) Compute the principle eigenvectors of the covariance matrix Clearly, we can compute the summation form using map-reduce Express the mean vector as a sum

Other Algorithms Logistic Regression Independent Component Analysis Support Vector Machine Expectation Maximization (EM)

Time Complexity Basically: Linear speed up with increasing number of cores

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Setup Compare map-reduce version and sequential version 10 data sets Machines: Dual-processor Pentium-III 700MHz, 1GB RAM 16-way Sun Enterprise 6000

Dual-Processor SpeedUps

SpeedUp for 2-16 processors Bold – average Error Bars – max/min Dashed - variance

Multicore Simulator Results Multicore simulator over the sensor dataset Better results – reported for NN & LR NN 16 cores 15.5x 32 cores 29x 64 cores 54x LR 16 cores 15x 32 cores 29.5x 64 cores 53x Could be because of less communication cost

Conclusion Parallelize summation forms NO change in the underlying algorithm NO approximation Use map-reduce on a single machine

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Apache Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Community Documentation and Examples Scalability the Apache License Not-specific research-oriented http://dictionary.reference.com/browse/mahout

Focus: Scalable Goal: Be as fast and efficient as the possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a MapReduce framework like Apache Hadoop Still others will need other distributed programming models Most Mahout implementations are MapReduce enabled Work in Progress

Sampling of Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

Focus: Machine Learning Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Resources “Mahout in Action” “Introducing Apache Mahout” Owen, Anil, Dunning and Friedman http://awe.sm/5FyNe “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/ “Taming Text” by Ingersoll, Morton, Farris “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer

Map-Reduce for Machine Learning on Multicore Outline What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form Architecture (inspired by MapReduce) Adopted ML Algorithms Experiments Future of MapReduce for Machine Learning Discussion Map-Reduce for Machine Learning on Multicore

Discussion What are other alternatives to MapReduce? What to do if “summation form” is not applicable? Does the dataset quality effect implementation and performance of parallel machine learning algorithms? Multicore processors… future? Predicting Structural and Functional Sites in Proteins by Searching for Maximum-Weight Cliques