Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Machine Learning in the Real World 24 Hours a Minute YouTube 13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos

Exponential Parallelism Exponentially Increasing Sequential Performance Constant Sequential Performance Processor Speed GHz Exponentially Increasing Parallel Performance Exponentially Increasing Parallel Performance Release Date

Parallelism is Difficult Wide array of different parallel architectures: Different challenges for each architecture 4 GPUsMulticoreClustersCloudsSupercomputers High Level Abstractions to make things easier.

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 17.517.5 67.567.5 14.914.9 34.334.3 24.124.1 84.384.3 18.418.4 84.484.4 No Communication needed

CPU 1 CPU 2 MapReduce – Reduce Phase 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 17.517.5 67.567.5 14.914.9 34.334.3 22 26. 26 17 26. 31 Fold/Aggregation

MapReduce and ML Excellent for large data-parallel tasks! 9 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Slow Processor Iterative Algorithms? We can implement iterative algorithms in MapReduce: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

Iterative MapReduce Only a subset of data needs computation: (multi-phase iteration) Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapReduce and ML Excellent for large data-parallel tasks! 13 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Structured Problems 14 Interdependent Computation: Not Map-Reducible Example Problem: Will I be successful in research? May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Success depends on the success of others.

Space of Problems 15 Asynchronous Iterative Computation Repeated iterations over local kernel computations Sparse Computation Dependencies Can be decomposed into local “computation- kernels”

GraphLab Data-Parallel Structured Iterative Parallel Parallel Computing and ML Not all algorithms are efficiently data parallel 16 Cross Validation Feature Extraction Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Learning Graphical Models Lasso Map Reduce Computing Sufficient Statistics Sampling ?

Common Properties 1) Sparse Local Computations 2) Iterative Updates Expectation Maximization Optimization Sampling Belief Propagation Operation A Operation B

GraphLab Goals Designed for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Addresses multiple hardware architectures Multicore Distributed GPU and others

GraphLab Goals Simple Models Complex Models Small Data Large Data Data-Parallel Now Goal

GraphLab Goals Simple Models Complex Models Small Data Large Data Data-Parallel Now GraphLab

Carnegie Mellon GraphLab A Domain-Specific Abstraction for Machine Learning

Everything on a Graph A Graph with data associated with every vertex and edge :Data

Update Functions Update Functions: operations applied on vertex  transform data in scope of vertex

Update Functions Update Function can Schedule the computation of any other update function: Scheduled computation is guaranteed to execute eventually. - FIFO Scheduling - Prioritized Scheduling - Randomized Etc. - FIFO Scheduling - Prioritized Scheduling - Randomized Etc.

Example: Page Rank multiply adjacent pagerank values with edge weights to get current vertex’s pagerank Graph = WWW Update Function: “Prioritized” PageRank Computation? Skip converged vertices.

Example: K-Means Clustering Cluster Update: compute average of data connected on a “marked” edge. Data Update: Pick the closest cluster and mark the edge. Unmark remaining edges. (Fully Connected?) Bipartite Graph Update Function: Data Clusters

Example: MRF Sampling - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex Graph = MRF Update Function:

Not Message Passing! Graph is a data-structure. Update Functions perform parallel modifications to the data-structure.

Safety If adjacent update functions occur simultaneously?

Importance of Consistency Permit Races? “Best-effort” computation? ML resilient to soft-optimization? True for some algorithms. Not true for many. May work empirically on some datasets; may fail on others.

Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares

Importance of Consistency Fast ML Algorithm development cycle: Build Test Debug Tweak Model Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism. Is the execution wrong? Or is the model wrong?

Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time

Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions which produce same result Formalization of the intuitive concept of a “correct program”. - Computation does not read outdated data from the past - Computation does not read results of computation that occurs in the future. Primary Property of GraphLab

Full Consistency Guaranteed safety for all update functions

Full Consistency Parallel update only allowed two vertices apart  Reduced opportunities for parallelism Parallel update only allowed two vertices apart  Reduced opportunities for parallelism

Obtaining More Parallelism Not all update functions will modify the entire scope! Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices

Edge Consistency

Obtaining More Parallelism “Map” operations. Feature extraction on vertex data

Vertex Consistency

Global Information What if we need global information? Sum of all the vertices? Algorithm Parameters? Sufficient Statistics?

Shared Variables Global aggregation through Sync Operation A global parallel reduction over the graph data. Synced variables recomputed at defined intervals Sync computation is Sequentially Consistent Permits correct interleaving of Syncs and Updates Sync: Sum of Vertex Values Sync: Sum of Vertex Values Sync: Loglikelihood Sync: Loglikelihood

Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions and Syncs which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time

Carnegie Mellon GraphLab in the Cloud

Moving towards the cloud… Purchasing and maintaining computers is very expensive Most computing resources seldomly used Only for deadlines… Buy time, access hundreds or thousands of processors Only pay for needed resources

Distributed GL Implementation Mixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance) Requires all data to be in memory. Move computation to data. MPI for management + TCP/IP for communication Asynchronous C++ RPC Layer Ran on 64 EC2 HPC Nodes = 512 Processors Skip Implementation

Carnegie Mellon Underlying Network RPC Controller Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data

Carnegie Mellon GraphLab RPC

Write distributed programs easily Asynchronous communication Multithreaded support Fast Scalable Easy To Use (Every machine runs the same binary)

Carnegie Mellon I C++

Features Easy RPC capabilities: rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3); Requests (call with return value) vec = rpc.remote_request( [target_machine ID], sort_vector, vec); std::vector & sort_vector(std::vector &v) { std::sort(v.begin(), v.end()); return v; } One way calls

Features Object Instance Context MPI-like primitives dc.barrier() dc.gather(...) dc.send_to([target machine], [arbitrary object]) dc.recv_from([source machine], [arbitrary object ref]) K-V Object RPC Controller MPI-Like Safety

Request Latency Ping RTT = 90us

One-Way Call Rate 1Gbps physical peak

Serialization Performance 100,000 X One way call of vector of 10 X {"hello", 3.14, 100}

Distributed Computing Challenges Q1: How do we efficiently distribute the state ? - Potentially varying #machines Q2: How do we ensure sequential consistency ? Keeping in mind: Limited Bandwidth High Latency Performance

Carnegie Mellon Distributed Graph

Two-stage Partitioning Initial Overpartitioning of the Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed

Ghosting Ghost vertices/edges act as cache for remote data. Coherency maintained using versioning. Decrease bandwidth utilization. Ghost vertices are a copy of neighboring vertices which are on remote machines.

Carnegie Mellon Distributed Engine

Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl. To improve performance: User provides some “expert knowledge” about the properties of the update function.

Full Consistency User says: update function modifies all data in scope. Limited opportunities for parallelism. Acquire write-lock on all vertices.

Edge Consistency User: update function only reads from adjacent vertices. More opportunities for parallelism. Acquire write-lock on center vertex, read-lock on adjacent.

Vertex Consistency User: update function does not touch edges nor adjacent vertices Maximum opportunities for parallelism. Acquire write-lock on current vertex.

Performance Enhancements Latency Hiding: - “pipelining” of >> #CPU update function calls. (about 1K deep pipeline) - Hides the latency of lock acquisition and cache synchronization Lock Strength Reduction: - A trick where number of locks can be decreased while still providing same guarantees

Video Cosegmentation Segments mean the same Model: 10.5 million nodes, 31 million edges Gaussian EM clustering + BP on 3D grid

Speedups

Video Segmentation

Chromatic Distributed Engine Observation : Scheduling using vertex colorings can be used to automatically satisfy consistency. Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?

Example: Edge Consistency (distance 1) vertex coloring Update functions can be executed on all vertices of the same color in parallel.

Example: Full Consistency (distance 2) vertex coloring Update functions can be executed on all vertices of the same color in parallel.

Example: Vertex Consistency (distance 0) vertex coloring Update functions can be executed on all vertices of the same color in parallel.

Chromatic Distributed Engine Time Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Data Synchronization Completion + Barrier Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 Data Synchronization Completion + Barrier

Experiments Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d

Netflix Speedup Increasing size of the matrix factorization

Netflix

Experiments Named Entity Recognition (part of Tom Mitchell’s NELL project) CoEM Algorithm Web Crawl Model: 2 million nodes, 200 million edges Graph is rather dense. A small number of vertices connect to almost all the vertices.

Named Entity Recognition (CoEM) 85

Named Entity Recognition (CoEM) 86 Bandwidth Bound

Named Entity Recognition (CoEM) 87

Future Work Distributed GraphLab Fault Tolerance  Spot Instances  Cheaper Graph using off-memory store (disk/SSD) GraphLab as a database Self-optimized partitioning Fast data  graph construction primitives GPU GraphLab ? Supercomputer GraphLab ?

Carnegie Mellon Is GraphLab the Answer to (Life the Universe and Everything?) Probably Not.

Carnegie Mellon graphlab.ml.cmu.edu Parallel/Distributed Implementation LGPL (highly probable switch to MPL in a few weeks) GraphLab bickson.blogspot.com Very fast matrix factorization implementations, other examples, installation, comparisons, etc Danny Bickson Marketing Agency Microsoft Safe

Carnegie Mellon Questions? Bayesian Tensor Factorization Gibbs Sampling Dynamic Block Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM Many Others… SVD

Video Cosegmentation Naïve Idea: Treat patches independently Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patch Does not take relationships among patches into account!

Video Cosegmentation Better Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster. Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Distributed Memory Programming APIs MPI Global Arrays GASnet ARMCI etc. …do not make it easy… Synchronous computation. Insufficient primitives for multi-threaded use. Also, not exactly easy to use… Synchronous computation. Insufficient primitives for multi-threaded use. Also, not exactly easy to use… If all your data is a n-D array Direct remote pointer access. Severe limitations depending on system architecture.

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Similar presentations

Presentation on theme: "Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Similar presentations

Presentation on theme: "Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron."— Presentation transcript:

Similar presentations

About project

Feedback