PETUUM A New Platform for Distributed Machine Learning on Big Data

Slides:



Advertisements
Similar presentations
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Advertisements

SCALING SGD to Big dATA & Huge Models
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Distributed Parameter Synchronization in DNN
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
GraphLab A New Framework for Parallel Machine Learning
Ch 4. The Evolution of Analytic Scalability
Pregel: A System for Large-Scale Graph Processing
SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
SU YUXIN JAN 20, 2014 Petuum: An Iterative-Convergent Distributed Machine Learning Framework.
IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Classification Ensemble Methods 1
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.
Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,
Experimental Perspectives on Lasso-related Algorithms on Parallel Computing Frameworks
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Saskatoon SAS user group
Big Data Analytics and HPC Platforms
Big data classification using neural network
TensorFlow– A system for large-scale machine learning
World’s fastest Machine Learning Engine
Large-scale Machine Learning
Distributed Computation Framework for Machine Learning
Multimodal Learning with Deep Boltzmann Machines
COMP61011 : Machine Learning Ensemble Models
Machine Learning Basics
Introduction to Spark.
Distributed Systems CS
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
Deep Learning Packages
CS110: Discussion about Spark
Scalable Parallel Interoperable Data Analytics Library
Ch 4. The Evolution of Analytic Scalability
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Logistic Regression & Parallel SGD
Overview of big data tools
Declarative Transfer Learning from Deep CNNs at Scale
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
TensorFlow: A System for Large-Scale Machine Learning
Fast, Interactive, Language-Integrated Cluster Computing
Distributed Systems (15-440)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

PETUUM A New Platform for Distributed Machine Learning on Big Data Eric P. Xing, Qirong Ho Wei Dai, Jin Kyu Kim Jinliang Wei, Seunghak Lee Xun Zheng, Pengtao Xie Abhimanu Kumar, Yaoliang Yu

What they think… Machine Learning is becoming a primary mechanism for extracting information from data. Need ML methods to scale beyond single machine. Flickr, Instagram and Facebook are anecdotally known to possess 10s of billions of images. Highly inefficient to use such big data sequentially in a batch or scholastic fashion in a typical iterative ML algorithm.

Despite rapid development of many new ML models and algorithms aiming at scalable application, adoption of these technologies remains generally unseen in the wider data mining, NLP, vision, and other application communities. Difficult migration from an academic implementation (small desktop PCs, small lab clusters) to a big, less predictable platform (cloud or a corporate cluster) prevents ML models and algorithms from being widely applied.

Why build a new Framework…? Find a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes). Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk- synchronous processing paradigm. Or even specialized graph-based execution that relies on graph representations of ML programs. But it remains difficult to find a universal platform applicable to a wide range of ML programs at scale.

Problems with other platforms Hadoop : simplicity of its Map Reduce abstraction makes it difficult to exploit ML properties such as error tolerance and its performance on many ML programs has been surpassed by alternatives. Spark : Spark does not offer fine-grained scheduling of computation and communication for fast and correct execution of advanced ML algorithms. GraphLab and Pregel efficiently partition graph-based models; but ML programs such as topic modeling and regression either do not admit obvious graph representations, or a graph representation may not be the most efficient choice. In summary, existing systems manifest a unique tradeoff on efficiency, correctness, programmability, and generality.

What they say…

Petuum in a nutshell… A distributed machine learning framework. Aims to provide a generic algorithmic and systems interface to large scale machine learning. Takes care of difficult systems "plumbing work" and algorithmic acceleration. Simplified the distributed implementation of ML programs - allowing us to focus on model perfection and Big Data Analytics. It runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

Their Philosophy… Most ML programs are defined by an explicit objective function over data (e.g., likelihood). The goal is to attain the optimality of this function, in the space defined by the model parameters and other intermediate variables. Operational objectives such as fault tolerance and strong consistency are absolute necessary. ML program’s true goal is fast, efficient convergence to an optimal solution. Petuum is built on an ML-centric optimization-theoretic principle, as opposed to various operational objectives.

So how they built it… Formalized ML algorithms as iterative-convergent programs - stochastic gradient descent MCMC for determining point estimates in latent variable models coordinate descent, variational methods for graphical methods proximal optimization for structured sparsity problems, and others Found out the shared properties across all algorithms. Key lies in the recognition of a clear dichotomy b/w DATA and MODEL This Inspired bimodal approach to parallelism: data parallel and modal parallel distribution and execution of a big ML program over cluster of machines.

Data parallel and Model parallel approach This approach exploits unique statistical nature of ML algorithms, mainly three properties – Error tolerance – iterative-convergent algorithms are robust against limited errors in intermediate calculations. Dynamic structural dependency – changing correlation strengths between model parameters critical to efficient parallelization. Non-uniform convergence – No. of steps required for a parameter to converge can be highly skewed across parameters.

Parallelization Strategies

Principle formulation for Data & Model Parallelism Iterative – Convergent ML Algorithm : Given data 𝒟 and model ℒ (i.e., a fitness function such as likelihood ), a typical ML problem can be grounded as executing the following update equation iteratively, until the model state (i.e., parameters and/or latent variables) Α reaches some stopping criteria: Α 𝓉 =Ϝ Α 𝓉−1 , ∆ ℒ Α 𝓉−1 ,𝒟 subscript (t) denotes iteration The update function ∆ ℒ () (which improve the loss) performs computation on data 𝒟 and model state Α and, Outputs intermediate results to be aggregated by Ϝ().

Data Parallelism, In which data is divided across machines

Data 𝒟 is partitioned and assigned to computational workers Assumption - function ∆() can be applied to each data set independently, yielding the equation: Α 𝓉 =Ϝ Α 𝓉−1 , ∑ 𝜌=1 Ρ ∆ Α 𝓉−1 , 𝒟 𝜌 ∆() outputs are aggregated via summation It is crucial because CPUs can produce updates must faster than they can be transmitted. Each parallel worker contributes “equally”.

Model Parallelism, In which ML model is divided across machines

Α 𝓉 =Ϝ (Α 𝓉−1 , {∆ Α 𝓉−1 , 𝒮 𝜌 𝓉−1 Α 𝓉−1 ) } 𝜌=1 𝑃 Model Α is partitioned and assigned to workers Unlike data-parallelism, each update function ∆() also takes a scheduling function 𝒮 𝜌 𝓉−1 (), which restricts ∆() to operate on a subset of the model parameters Α: Α 𝓉 =Ϝ (Α 𝓉−1 , {∆ Α 𝓉−1 , 𝒮 𝜌 𝓉−1 Α 𝓉−1 ) } 𝜌=1 𝑃 Unlike data parallelism, the model parameters Α 𝑗 are not independent. Hence, definition of model-parallelism includes a global scheduling mechanism that select carefully-chosen parameters for parallel updating.

Petuum System Design

Petuum Programming Interface

Petuum Program Structure

Performance Petuum improves the performance of ML applications by making every update or iteration more effective, without compromising on update speed. More effective updates means faster ML completion time.

Petuum topic model (LDA) Settings: 4.5GB dataset (8.2m docs, 737m tokens, 141k vocab, 1000 topics), 50 machines (800 cores), Petuum v1.0 vs YahooLDA, program completion = reached -5.8e9 log-likelihood

Petuum sparse logistic regression Settings: 29GB dataset (10m features, 50k samples), 8 machines (512 cores), Petuum v0.93 vs Shotgun Lasso, program completion = reached 0.5 loss function

Petuum multi-class logistic regression Settings: 20GB dataset (253k samples, 21k features, 5.4b nonzeros, 1000 classes), 4 machines (256 cores), Petuum v1.0 vs Synchronous parameter server, program completion = reached 0.0168 loss function

Some speed highlights Logistic regression: learn a 10m-dimensional model from 30GB of sparse data in 20 minutes, on 8 machines with 16 cores each. LDA topic model: learn 1k topics on 8m documents (140k unique words) in 17 minutes, on 25 machines with 16 cores each. Matrix Factorization (collaborative filtering): train on a 480k-by-20k matrix with rank 40 in 2 minutes, on 25 machines with 16 cores each. Convolutional Neural Network built on Caffe: Train Alexnet (60m parameters) in under 24 hours, on 8 machines with a Tesla K20 GPU each. MedLDA supervised topic model: learn 1k topics on 1.1m documents (20 labels) in 85 minutes, on 20 machines with 12 cores each. Multiclass Logistic Regression: train on the MNIST dataset (19GB, 8m samples, 784 features) in 6 minutes, on 8 machines with 16 cores each.

What it doesn’t do Primarily about Allowing ML practitioners to implement and experiment with new data/model-parallel ML algorithms on small-to- medium clusters. lacks features that are necessary for clusters with ≥ 1000 machines. Such as Automatic recovery from machine failure. Experiments focused on clusters with 10-100 machines.

Thoughts Highly efficient and fast for their target users. Good library with 10+ algorithms. Lacks features for more then >=1000 machines but no experiments even near to 500 machines, only 50 machines. Petuum is specifically designed for algorithms such as optimization algorithms and sampling algorithms. So, Not a silver bullet for all Big Data problems.

Questions?

Thank you Nitin Saroha