Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Slides:



Advertisements
Similar presentations
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Advertisements

Supervised Learning Recap
Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Neural Networks II CMPUT 466/551 Nilanjan Ray. Outline Radial basis function network Bayesian neural network.
Paper Discussion: “Simultaneous Localization and Environmental Mapping with a Sensor Network”, Marinakis et. al. ICRA 2011.
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
Radial Basis Functions
Speaker Adaptation for Vowel Classification
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Unsupervised Learning
Optimal Adaptation for Statistical Classifiers Xiao Li.
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Supervised Learning Networks. Linear perceptron networks Multi-layer perceptrons Mixture of experts Decision-based neural networks Hierarchical neural.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Dan Simon Cleveland State University
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
MapReduce for Machine Learning on Multicore
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Learning with large datasets Machine Learning Large scale machine learning.
Data mining and machine learning A brief introduction.
Least-Mean-Square Training of Cluster-Weighted-Modeling National Taiwan University Department of Computer Science and Information Engineering.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Classification / Regression Neural Networks 2
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 GMDH and Neural Network Application for Modeling Vital Functions of Green Algae under Toxic Impact Oleksandra Bulgakova, Volodymyr Stepashko, Tetayna.
Non-Bayes classifiers. Linear discriminants, neural networks.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CpSc 881: Machine Learning
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Chapter 8: Adaptive Networks
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Background on Classification
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Artificial Intelligence Chapter 3 Neural Networks
Word Embedding Word2Vec.
Artificial Intelligence Chapter 3 Neural Networks
Backpropagation.
Generally Discriminant Analysis
Artificial Intelligence Chapter 3 Neural Networks
Parametric Methods Berlin Chen, 2005 References:
Artificial Neural Networks
Artificial Intelligence 10. Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Backpropagation.
Linear Discrimination
Using Clustering to Make Prediction Intervals For Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading Group

Motivations Industry-wide shift to multicore No good framework for parallelize ML algorithms Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors

Idea Statistical Query Model Summation Form Map-Reduce

Outline Introduction Statistical Query Model and Summation Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion

Valiant Model [Valiant’84] x is the input y is a function of x that we want to learn In Valiant model, the learning algorithm uses randomly drawn examples to learn the target function

Statistical Query Model [Kearns’98] A restriction on Valiant model A learning algorithm uses some aggregates over the examples, not the individual examples More precisely, the learning algorithm interacts with a statistical query oracle Learning algorithm asks about f(x,y) Oracle returns the expectation that f(x,y) is true

Summation Form Aggregate over the data: Divide the data set into pieces Compute aggregates on each cores Combine all results at the end

Example: Linear Regression using Least Squares Model: Goal: Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2, …ym). Then the solution is Parallel computation: Cut to m/num_processor pieces

Outline Introduction Statistical Query Model and Summation Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion

Lighter Weight Map-Reduce for Multicore

Outline Introduction Statistical Query Model and Summation Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion

Locally Weighted Linear Regression (LWLR) Mappers: one sets compute A, the other set compute b Two reducers for computing A and b Finally compute the solution When wi==1, this is least squares. Solve:

Naïve Bayes (NB) Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) Computation: count the occurrence of (xj=k, y=1) and (xj=k, y=0), count the occurrence of (y=1) and (y=0), the compute division Mappers: count a subgroup of training samples Reducer: aggregate the intermediate counts, and calculate the final result

Gaussian Discriminative Analysis (GDA) Goal: classification of x into classes of y assuming each class is a Gaussian Mixture model with different means but same covariance. Computation: Mappers: compute for a subset of training samples Reducer: aggregate intermediate results

K-means Computing the Euclidean distance between sample vectors and centroids Recalculating the centroids Divide the computation to subgroups to be handled by map-reduce

Expectation Maximization (EM) E-step computes some prob or counts per training example M-step combines these values to update the parameters Both of them can be parallelized using map-reduce

Neural Network (NN) Back-propagation, 3-layer network Input, middle, 2 output nodes Goal: compute the weights in the NN by back propagation Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights Reducer: sums the partial gradients and does a batch gradient descent to update the weights

Principal Components Analysis (PCA) Compute the principle eigenvectors of the covariance matrix Clearly, we can compute the summation form using map-reduce

Other Algorithms Logistic Regression Independent Component Analysis Support Vector Machine

Time Complexity

Outline Introduction Statistical Query Model and Summation Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion

Setup Compare map-reduce version and sequential version 10 data sets Machines: Dual-processor Pentium-III 700MHz, 1GB RAM 16-way Sun Enterprise 6000 (these are SMP, not multicore)

Dual-Processor SpeedUps

2-16 processor speedups More data in the paper

Multicore Simulator Results A paragraph on this Basically, says that results are better than multiprocessor machines. Could be because of less communication cost

Conclusion Parallelize summation forms Use map-reduce on a single machine