Download presentation
Presentation is loading. Please wait.
Published byDomenic Holt Modified over 9 years ago
1
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron
2
Machine Learning in the Real World 24 Hours a Minute YouTube 13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos
3
Exponential Parallelism Exponentially Increasing Sequential Performance Constant Sequential Performance Processor Speed GHz Exponentially Increasing Parallel Performance Exponentially Increasing Parallel Performance Release Date
4
Parallelism is Difficult Wide array of different parallel architectures: Different challenges for each architecture 4 GPUsMulticoreClustersCloudsSupercomputers High Level Abstractions to make things easier.
5
CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 No Communication needed
6
CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 No Communication needed
7
CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 17.517.5 67.567.5 14.914.9 34.334.3 24.124.1 84.384.3 18.418.4 84.484.4 No Communication needed
8
CPU 1 CPU 2 MapReduce – Reduce Phase 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 17.517.5 67.567.5 14.914.9 34.334.3 22 26. 26 17 26. 31 Fold/Aggregation
9
MapReduce and ML Excellent for large data-parallel tasks! 9 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics
10
Slow Processor Iterative Algorithms? We can implement iterative algorithms in MapReduce: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier
11
Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty
12
Iterative MapReduce Only a subset of data needs computation: (multi-phase iteration) Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier
13
MapReduce and ML Excellent for large data-parallel tasks! 13 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics
14
Structured Problems 14 Interdependent Computation: Not Map-Reducible Example Problem: Will I be successful in research? May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Success depends on the success of others.
15
Space of Problems 15 Asynchronous Iterative Computation Repeated iterations over local kernel computations Sparse Computation Dependencies Can be decomposed into local “computation- kernels”
16
GraphLab Data-Parallel Structured Iterative Parallel Parallel Computing and ML Not all algorithms are efficiently data parallel 16 Cross Validation Feature Extraction Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Learning Graphical Models Lasso Map Reduce Computing Sufficient Statistics Sampling ?
17
Common Properties 1) Sparse Local Computations 2) Iterative Updates Expectation Maximization Optimization Sampling Belief Propagation Operation A Operation B
18
GraphLab Goals Designed for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Addresses multiple hardware architectures Multicore Distributed GPU and others
19
GraphLab Goals Simple Models Complex Models Small Data Large Data Data-Parallel Now Goal
20
GraphLab Goals Simple Models Complex Models Small Data Large Data Data-Parallel Now GraphLab
21
Carnegie Mellon GraphLab A Domain-Specific Abstraction for Machine Learning
22
Everything on a Graph A Graph with data associated with every vertex and edge :Data
23
Update Functions Update Functions: operations applied on vertex transform data in scope of vertex
24
Update Functions Update Function can Schedule the computation of any other update function: Scheduled computation is guaranteed to execute eventually. - FIFO Scheduling - Prioritized Scheduling - Randomized Etc. - FIFO Scheduling - Prioritized Scheduling - Randomized Etc.
25
Example: Page Rank multiply adjacent pagerank values with edge weights to get current vertex’s pagerank Graph = WWW Update Function: “Prioritized” PageRank Computation? Skip converged vertices.
26
Example: K-Means Clustering Cluster Update: compute average of data connected on a “marked” edge. Data Update: Pick the closest cluster and mark the edge. Unmark remaining edges. (Fully Connected?) Bipartite Graph Update Function: Data Clusters
27
Example: MRF Sampling - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex Graph = MRF Update Function:
28
Not Message Passing! Graph is a data-structure. Update Functions perform parallel modifications to the data-structure.
29
Safety If adjacent update functions occur simultaneously?
30
Safety If adjacent update functions occur simultaneously?
31
Importance of Consistency Permit Races? “Best-effort” computation? ML resilient to soft-optimization? True for some algorithms. Not true for many. May work empirically on some datasets; may fail on others.
32
Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares
33
Importance of Consistency Fast ML Algorithm development cycle: Build Test Debug Tweak Model Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism. Is the execution wrong? Or is the model wrong?
34
Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time
35
Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions which produce same result Formalization of the intuitive concept of a “correct program”. - Computation does not read outdated data from the past - Computation does not read results of computation that occurs in the future. Primary Property of GraphLab
36
Full Consistency Guaranteed safety for all update functions
37
Full Consistency Parallel update only allowed two vertices apart Reduced opportunities for parallelism Parallel update only allowed two vertices apart Reduced opportunities for parallelism
38
Obtaining More Parallelism Not all update functions will modify the entire scope! Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices
39
Edge Consistency
40
Obtaining More Parallelism “Map” operations. Feature extraction on vertex data
41
Vertex Consistency
42
Global Information What if we need global information? Sum of all the vertices? Algorithm Parameters? Sufficient Statistics?
43
Shared Variables Global aggregation through Sync Operation A global parallel reduction over the graph data. Synced variables recomputed at defined intervals Sync computation is Sequentially Consistent Permits correct interleaving of Syncs and Updates Sync: Sum of Vertex Values Sync: Sum of Vertex Values Sync: Loglikelihood Sync: Loglikelihood
44
Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions and Syncs which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time
45
Carnegie Mellon GraphLab in the Cloud
46
Moving towards the cloud… Purchasing and maintaining computers is very expensive Most computing resources seldomly used Only for deadlines… Buy time, access hundreds or thousands of processors Only pay for needed resources
47
Distributed GL Implementation Mixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance) Requires all data to be in memory. Move computation to data. MPI for management + TCP/IP for communication Asynchronous C++ RPC Layer Ran on 64 EC2 HPC Nodes = 512 Processors Skip Implementation
48
Carnegie Mellon Underlying Network RPC Controller Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data Distributed Graph Distributed Graph Distributed Locks Distributed Locks Execution Engine Execution Threads Cache Coherent Distributed K-V Store Shared Data
49
Carnegie Mellon GraphLab RPC
50
Write distributed programs easily Asynchronous communication Multithreaded support Fast Scalable Easy To Use (Every machine runs the same binary)
51
Carnegie Mellon I C++
52
Features Easy RPC capabilities: rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3); Requests (call with return value) vec = rpc.remote_request( [target_machine ID], sort_vector, vec); std::vector & sort_vector(std::vector &v) { std::sort(v.begin(), v.end()); return v; } One way calls
53
Features Object Instance Context MPI-like primitives dc.barrier() dc.gather(...) dc.send_to([target machine], [arbitrary object]) dc.recv_from([source machine], [arbitrary object ref]) K-V Object RPC Controller MPI-Like Safety
54
Request Latency Ping RTT = 90us
55
One-Way Call Rate 1Gbps physical peak
56
Serialization Performance 100,000 X One way call of vector of 10 X {"hello", 3.14, 100}
57
Distributed Computing Challenges Q1: How do we efficiently distribute the state ? - Potentially varying #machines Q2: How do we ensure sequential consistency ? Keeping in mind: Limited Bandwidth High Latency Performance
58
Carnegie Mellon Distributed Graph
59
Two-stage Partitioning Initial Overpartitioning of the Graph
60
Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph
61
Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph
62
Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed
63
Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed
64
Ghosting Ghost vertices/edges act as cache for remote data. Coherency maintained using versioning. Decrease bandwidth utilization. Ghost vertices are a copy of neighboring vertices which are on remote machines.
65
Carnegie Mellon Distributed Engine
66
Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl. To improve performance: User provides some “expert knowledge” about the properties of the update function.
67
Full Consistency User says: update function modifies all data in scope. Limited opportunities for parallelism. Acquire write-lock on all vertices.
68
Edge Consistency User: update function only reads from adjacent vertices. More opportunities for parallelism. Acquire write-lock on center vertex, read-lock on adjacent.
69
Vertex Consistency User: update function does not touch edges nor adjacent vertices Maximum opportunities for parallelism. Acquire write-lock on current vertex.
70
Performance Enhancements Latency Hiding: - “pipelining” of >> #CPU update function calls. (about 1K deep pipeline) - Hides the latency of lock acquisition and cache synchronization Lock Strength Reduction: - A trick where number of locks can be decreased while still providing same guarantees
71
Video Cosegmentation Segments mean the same Model: 10.5 million nodes, 31 million edges Gaussian EM clustering + BP on 3D grid
72
Speedups
73
Video Segmentation
75
Chromatic Distributed Engine Observation : Scheduling using vertex colorings can be used to automatically satisfy consistency. Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?
76
Example: Edge Consistency (distance 1) vertex coloring Update functions can be executed on all vertices of the same color in parallel.
77
Example: Full Consistency (distance 2) vertex coloring Update functions can be executed on all vertices of the same color in parallel.
78
Example: Vertex Consistency (distance 0) vertex coloring Update functions can be executed on all vertices of the same color in parallel.
79
Chromatic Distributed Engine Time Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Data Synchronization Completion + Barrier Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 Data Synchronization Completion + Barrier
80
Experiments Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d
81
Netflix Speedup Increasing size of the matrix factorization
82
Netflix
84
Experiments Named Entity Recognition (part of Tom Mitchell’s NELL project) CoEM Algorithm Web Crawl Model: 2 million nodes, 200 million edges Graph is rather dense. A small number of vertices connect to almost all the vertices.
85
Named Entity Recognition (CoEM) 85
86
Named Entity Recognition (CoEM) 86 Bandwidth Bound
87
Named Entity Recognition (CoEM) 87
88
Future Work Distributed GraphLab Fault Tolerance Spot Instances Cheaper Graph using off-memory store (disk/SSD) GraphLab as a database Self-optimized partitioning Fast data graph construction primitives GPU GraphLab ? Supercomputer GraphLab ?
89
Carnegie Mellon Is GraphLab the Answer to (Life the Universe and Everything?) Probably Not.
90
Carnegie Mellon graphlab.ml.cmu.edu Parallel/Distributed Implementation LGPL (highly probable switch to MPL in a few weeks) GraphLab bickson.blogspot.com Very fast matrix factorization implementations, other examples, installation, comparisons, etc Danny Bickson Marketing Agency Microsoft Safe
91
Carnegie Mellon Questions? Bayesian Tensor Factorization Gibbs Sampling Dynamic Block Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM Many Others… SVD
92
Video Cosegmentation Naïve Idea: Treat patches independently Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patch Does not take relationships among patches into account!
93
Video Cosegmentation Better Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster. Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.
94
Distributed Memory Programming APIs MPI Global Arrays GASnet ARMCI etc. …do not make it easy… Synchronous computation. Insufficient primitives for multi-threaded use. Also, not exactly easy to use… Synchronous computation. Insufficient primitives for multi-threaded use. Also, not exactly easy to use… If all your data is a n-D array Direct remote pointer access. Severe limitations depending on system architecture.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.