Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Slides:

Advertisements

Similar presentations

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Advertisements

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Distributed Shared Memory

Distributed Graph Processing Abhishek Verma CS425.

GraphChi: Big Data – small machine

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.

Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.

Distributed Processing, Client/Server, and Clusters

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Distributed Computations

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

PRASHANTHI NARAYAN NETTEM.

Big Learning with Graph Computation Joseph Gonzalez Download the talk:

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.

GraphLab A New Framework for Parallel Machine Learning

Pregel: A System for Large-Scale Graph Processing

Computer System Architectures Computer System Software

Carnegie Mellon University GraphLab Tutorial Yucheng Low.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

The Mach System Silberschatz et al Presented By Anjana Venkat.

Data Structures and Algorithms in Parallel Computing

Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.

Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.

A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

TensorFlow– A system for large-scale machine learning

Big Data: Graph Processing

Distributed Shared Memory

A New Parallel Framework for Machine Learning

Chilimbi, et al. (2014) Microsoft Research

Processes and Threads Processes and their scheduling

CSCI5570 Large Scale Data Processing Systems

Operating Systems and Systems Programming

Department of Computer Science University of California, Santa Barbara

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Distributed Systems CS

Big Data I: Graph Processing, Distributed Machine Learning

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Machine Learning in the Real World 24 Hours a Minute YouTube 13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos

Exponential Parallelism Exponentially Increasing Sequential Performance Constant Sequential Performance Processor Speed GHz Exponentially Increasing Parallel Performance Exponentially Increasing Parallel Performance Release Date

Parallelism is Difficult Wide array of different parallel architectures: Different challenges for each architecture 4 GPUsMulticoreClustersCloudsSupercomputers High Level Abstractions to make things easier.

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 MapReduce – Reduce Phase Fold/Aggregation

MapReduce and ML Excellent for large data-parallel tasks! 9 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Slow Processor Iterative Algorithms? We can implement iterative algorithms in MapReduce: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

Iterative MapReduce Only a subset of data needs computation: (multi-phase iteration) Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapReduce and ML Excellent for large data-parallel tasks! 13 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Structured Problems 14 Interdependent Computation: Not Map-Reducible Example Problem: Will I be successful in research? May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Success depends on the success of others.

Space of Problems 15 Asynchronous Iterative Computation Repeated iterations over local kernel computations Sparse Computation Dependencies Can be decomposed into local “computation- kernels”

GraphLab Data-Parallel Structured Iterative Parallel Parallel Computing and ML Not all algorithms are efficiently data parallel 16 Cross Validation Feature Extraction Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Learning Graphical Models Lasso Map Reduce Computing Sufficient Statistics Sampling ?

Common Properties 1) Sparse Local Computations 2) Iterative Updates Expectation Maximization Optimization Sampling Belief Propagation Operation A Operation B

GraphLab Goals Designed for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Addresses multiple hardware architectures Multicore Distributed GPU and others

GraphLab Goals Simple Models Complex Models Small Data Large Data Data-Parallel Now Goal

GraphLab Goals Simple Models Complex Models Small Data Large Data Data-Parallel Now GraphLab

Carnegie Mellon GraphLab A Domain-Specific Abstraction for Machine Learning

Everything on a Graph A Graph with data associated with every vertex and edge :Data

Update Functions Update Functions: operations applied on vertex  transform data in scope of vertex

Update Functions Update Function can Schedule the computation of any other update function: Scheduled computation is guaranteed to execute eventually. - FIFO Scheduling - Prioritized Scheduling - Randomized Etc. - FIFO Scheduling - Prioritized Scheduling - Randomized Etc.

Example: Page Rank multiply adjacent pagerank values with edge weights to get current vertex’s pagerank Graph = WWW Update Function: “Prioritized” PageRank Computation? Skip converged vertices.

Example: K-Means Clustering Cluster Update: compute average of data connected on a “marked” edge. Data Update: Pick the closest cluster and mark the edge. Unmark remaining edges. (Fully Connected?) Bipartite Graph Update Function: Data Clusters

Example: MRF Sampling - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex Graph = MRF Update Function:

Not Message Passing! Graph is a data-structure. Update Functions perform parallel modifications to the data-structure.

Safety If adjacent update functions occur simultaneously?

Safety If adjacent update functions occur simultaneously?

Importance of Consistency Permit Races? “Best-effort” computation? ML resilient to soft-optimization? True for some algorithms. Not true for many. May work empirically on some datasets; may fail on others.

Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares

Importance of Consistency Fast ML Algorithm development cycle: Build Test Debug Tweak Model Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism. Is the execution wrong? Or is the model wrong?

Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time

Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions which produce same result Formalization of the intuitive concept of a “correct program”. - Computation does not read outdated data from the past - Computation does not read results of computation that occurs in the future. Primary Property of GraphLab

Full Consistency Guaranteed safety for all update functions

Full Consistency Parallel update only allowed two vertices apart  Reduced opportunities for parallelism Parallel update only allowed two vertices apart  Reduced opportunities for parallelism

Obtaining More Parallelism Not all update functions will modify the entire scope! Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices

Edge Consistency

Obtaining More Parallelism “Map” operations. Feature extraction on vertex data

Vertex Consistency

Global Information What if we need global information? Sum of all the vertices? Algorithm Parameters? Sufficient Statistics?

Shared Variables Global aggregation through Sync Operation A global parallel reduction over the graph data. Synced variables recomputed at defined intervals Sync computation is Sequentially Consistent Permits correct interleaving of Syncs and Updates Sync: Sum of Vertex Values Sync: Sum of Vertex Values Sync: Loglikelihood Sync: Loglikelihood

Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions and Syncs which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time

Carnegie Mellon GraphLab in the Cloud

Moving towards the cloud… Purchasing and maintaining computers is very expensive Most computing resources seldomly used Only for deadlines… Buy time, access hundreds or thousands of processors Only pay for needed resources

Distributed GL Implementation Mixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance) Requires all data to be in memory. Move computation to data. MPI for management + TCP/IP for communication Asynchronous C++ RPC Layer Ran on 64 EC2 HPC Nodes = 512 Processors Skip Implementation

Carnegie Mellon Underlying Network RPC Controller Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data

Carnegie Mellon GraphLab RPC

Write distributed programs easily Asynchronous communication Multithreaded support Fast Scalable Easy To Use (Every machine runs the same binary)

Carnegie Mellon I C++

Features Easy RPC capabilities: rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3); Requests (call with return value) vec = rpc.remote_request( [target_machine ID], sort_vector, vec); std::vector & sort_vector(std::vector &v) { std::sort(v.begin(), v.end()); return v; } One way calls

Features Object Instance Context MPI-like primitives dc.barrier() dc.gather(...) dc.send_to([target machine], [arbitrary object]) dc.recv_from([source machine], [arbitrary object ref]) K-V Object RPC Controller MPI-Like Safety

Request Latency Ping RTT = 90us

One-Way Call Rate 1Gbps physical peak

Serialization Performance 100,000 X One way call of vector of 10 X {"hello", 3.14, 100}

Distributed Computing Challenges Q1: How do we efficiently distribute the state ? - Potentially varying #machines Q2: How do we ensure sequential consistency ? Keeping in mind: Limited Bandwidth High Latency Performance

Carnegie Mellon Distributed Graph

Two-stage Partitioning Initial Overpartitioning of the Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed

Ghosting Ghost vertices/edges act as cache for remote data. Coherency maintained using versioning. Decrease bandwidth utilization. Ghost vertices are a copy of neighboring vertices which are on remote machines.

Carnegie Mellon Distributed Engine

Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl. To improve performance: User provides some “expert knowledge” about the properties of the update function.

Full Consistency User says: update function modifies all data in scope. Limited opportunities for parallelism. Acquire write-lock on all vertices.

Edge Consistency User: update function only reads from adjacent vertices. More opportunities for parallelism. Acquire write-lock on center vertex, read-lock on adjacent.

Vertex Consistency User: update function does not touch edges nor adjacent vertices Maximum opportunities for parallelism. Acquire write-lock on current vertex.

Performance Enhancements Latency Hiding: - “pipelining” of >> #CPU update function calls. (about 1K deep pipeline) - Hides the latency of lock acquisition and cache synchronization Lock Strength Reduction: - A trick where number of locks can be decreased while still providing same guarantees

Video Cosegmentation Segments mean the same Model: 10.5 million nodes, 31 million edges Gaussian EM clustering + BP on 3D grid

Speedups

Video Segmentation

Chromatic Distributed Engine Observation : Scheduling using vertex colorings can be used to automatically satisfy consistency. Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?

Example: Edge Consistency (distance 1) vertex coloring Update functions can be executed on all vertices of the same color in parallel.

Example: Full Consistency (distance 2) vertex coloring Update functions can be executed on all vertices of the same color in parallel.

Example: Vertex Consistency (distance 0) vertex coloring Update functions can be executed on all vertices of the same color in parallel.

Chromatic Distributed Engine Time Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Data Synchronization Completion + Barrier Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 Data Synchronization Completion + Barrier

Experiments Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d

Netflix Speedup Increasing size of the matrix factorization

Netflix

Experiments Named Entity Recognition (part of Tom Mitchell’s NELL project) CoEM Algorithm Web Crawl Model: 2 million nodes, 200 million edges Graph is rather dense. A small number of vertices connect to almost all the vertices.

Named Entity Recognition (CoEM) 85

Named Entity Recognition (CoEM) 86 Bandwidth Bound

Named Entity Recognition (CoEM) 87

Future Work Distributed GraphLab Fault Tolerance  Spot Instances  Cheaper Graph using off-memory store (disk/SSD) GraphLab as a database Self-optimized partitioning Fast data  graph construction primitives GPU GraphLab ? Supercomputer GraphLab ?

Carnegie Mellon Is GraphLab the Answer to (Life the Universe and Everything?) Probably Not.

Carnegie Mellon graphlab.ml.cmu.edu Parallel/Distributed Implementation LGPL (highly probable switch to MPL in a few weeks) GraphLab bickson.blogspot.com Very fast matrix factorization implementations, other examples, installation, comparisons, etc Danny Bickson Marketing Agency Microsoft Safe

Carnegie Mellon Questions? Bayesian Tensor Factorization Gibbs Sampling Dynamic Block Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM Many Others… SVD

Video Cosegmentation Naïve Idea: Treat patches independently Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patch Does not take relationships among patches into account!

Video Cosegmentation Better Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster. Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Distributed Memory Programming APIs MPI Global Arrays GASnet ARMCI etc. …do not make it easy… Synchronous computation. Insufficient primitives for multi-threaded use. Also, not exactly easy to use… Synchronous computation. Insufficient primitives for multi-threaded use. Also, not exactly easy to use… If all your data is a n-D array Direct remote pointer access. Severe limitations depending on system architecture.