A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN
Why do we need GraphLab? Machine Learning and Data Mining (MLDM) problems increasingly need systems that can execute MLDM algorithms in parallel on large clusters. Implementing MDLM algorithms in parallel on current systems like Hadoop and MPI can be both prohibitively complex and costly. The MLDM community needs a high-level abstraction to handle the complexities of graph and network algorithms.
Introduction A BRIEF REVIEW OF KEY CONCEPTS
Sequential vs Parallel Processing In sequential processing, threads are run on a single node in the order that they are requested. In parallel processing, the independent threads are divided among the available nodes Node 1Node Node 1Node Threads
Data Parallelism Data parallelism focuses on distributing the data across different parallel computing nodes. It is achieved when each processer performs the same task on different pieces of distributed data. Data Set Node 1 Node 2 Node 3
Data Parallelism Data set: Numbers 1-9, Task Set: ‘Is Prime’
Task Parallelism Task parallelism focuses on distributing execution processes across different parallel computing nodes. It is achieved when each processor executes a different process on the same or different data. Task Set Node 1 Node 2 Node 3
Task Parallelism ‘Is Prime’, ‘Is Even’, ‘Is Odd’ ‘Is Prime’ ‘Is Even’ ‘Is Odd’ Data set: Numbers 1-9, Task Set: ‘Is Prime’, ‘Is Even’, ‘Is Odd’
Graph Parallelism Graph parallelism focuses on distributing vertices from a sparse graph G = {V,E} across different parallel computing nodes. It is achieved when each processor executes a vertex program Q which can interact with neighboring instances Q(u), (u,v) in V. Graph Node 1 Node 2 Node 3
Graph Parallelism G = {V,E} v1v1 v2v2 v3v3 Data set: Numbers 1-9, Task Set: ‘Is Prime’, ‘Is Even’, ‘Is Odd’
MLDM Algorithm Properties
Graph Structured Computation DEPENDENCY GRAPH Many of the recent advances in MLDM have focused on modeling the dependencies between data. By modeling dependencies, we are able to extract more signal from noisy data.
Asynchronous Iterative Computation Synchronous systems update all parameters simultaneously (in parallel) using parameter values from the previous time step as input Asynchronous systems update parameters using the most recent parameter values as input. Many MLDM algorithms benefit from asynchronous systems.
Dynamic Computation Static computation requires the algorithm to update all vertices equally often. This wastes time recomputing vertices who have effectively converged. Dynamic computation allows the algorithm to potentially save time by only recomputing vertices whose neighbors have recently updated.
Serializability 15 For every parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Parallel Sequential time
Serializability Serializability ensures that all parallel executions have an equivalent sequential execution, which eliminates race conditions. Race conditions are a programming fault which can produce undetermined program states and behaviors. Many MLDM algorithms converge faster if serializability is ensured. Some, like Dynamic Advanced Life Support algorithm, require serializability for correctness and/or stability.
Distributed GraphLab Abstraction
PageRank Algorithm
Data Graph The GraphLab abstraction stores the program state as a directed graph called the data graph, G = (V, E, D), where D is the user defined data. Data is broadly defined as model parameters, algorithm state, and statistical data. Graph Based Data Representation
Data Graph – PageRank Example 20 A graph, G = {V,E}, with arbitrary data associated with each vertex and edge. Vertex Data: stores R(v) Current PageRank estimate Edge Data: store w u,v Directed weight of the link Graph: Web graph
Update Functions An update function is a stateless procedure that modifies the data within the scope of a vertex and schedules the future execution of the update functions on other vertices. GraphLab update takes a vertex v and its scope S v and returns the new versions of the data in the scope as well as a set vertices T: Update: f(v,S v ) -> (S v, T)
Update Function: PageRank The update function for PageRank computes a weighted sum of the current ranks of neighboring vertices and assigns it as the rank of the current vertex. The algorithm is adaptive: neighbors are scheduled for update only if the value of the current vertex changes by more than a predefined threshold. Current Vertex Scope
The GraphLab Execution Model The GraphLab execution model enables efficient distribution by relaxing the execution- ordering requirements of the shared memory and allowing the GraphLab runtime engine to determine best order in which to run vertices. It eliminates messaged and isolates the user- defined algorithm from the movement of the data, allowing the system to choose when and how to move the program state. GraphLab Execution Model Input: Data Graph G = (V, E, D) Input: Initial vertex set T = {v 1, v 2,…} while T is not Empty do v <- RemoveNext(T) (T’, S v ) <- f(v, S v ) T <- T U T’ Output: Modified Data Graph G = (V, E, D’)
Ensuring Serializability GraphLab ensures a serializable execution by stipulating that for every parallel execution, there exists a sequential execution of update functions which produces the same result. GraphLab several consistency models which allow the runtime to optimize the parallel execution while maintaining serializability. The greater the consistency, the lower the parallelism. Full Consistency Edge Consistency Vertex Consistency
Read Write Full Consistency A full consistency which ensure that the scopes of concurrently executing update functions do not overlap. The update function has complete read- write access to its entire scope. This limits the potential parallelism since concurrently executing update functions must be at least two vertices apart.
Edge Consistency The edge consistency model ensures each update function has exclusive read-write access to its vertex and adjacent edges, but read only access to adjacent vertices This increases parallelism by allowing update functions with slightly overlapping scopes to safely run in parallel. Read Write
Vertex Consistency The vertex consistency model only provides write access to the central vertex data. This allows all update functions to be run in parallel, providing maximum parallelism. However, the this is the least consistent model available. Read Write
Global Values Many MLDM algorithms require the maintenance of global statistics describing data stored in the data graph. GraphLab defines global values as values which are read by update functions and written with sync operations.
Sync Operation The sync operation is an associative commutative sum which is defined over all parts of the graph. This supports tasks like normalization that are common in MLDM algorithms. The sync operation runs continuously in the background to maintain updated estimates of the global value. Ensuring serializability of the sync operation is costly and requires synchronization and halting all computation.
Distributed GraphLab Design
Distributed Data Graph A graph is distributed into k parts where k is much greater than the number of machines. Each part, called an atom is stored as a separate file on a DFS. A meta-graph is used to store the connectivity structure and file locations of the k atoms. 31 Atom
Distributed Data Graph - Ghost Vertices Each atom maintains stores information regarding ghosts: the set of vertices and edges adjacent to the partition boundary. Ghosts vertices maintain adjacency structure and replicate remote data. They are used as caches for their true counterparts across the network and coherence is managed with version control. 32
Distributed GraphLab Engines The Distributed GraphLab engine emulates the GraphLab execution model of the shared- memory abstraction. Responsible for: Executing update functions Executing sync operations Maintaining the set of scheduled vertices T Ensuring serializability with respect to the appropriate consistency model Two Engine Types: 1)Chromatic Engine 2)Distributed Locking Engine
Chromatic Engine The Chromatic Engine uses vertex coloring to satisfy the edge consistency model by executing synchronously all vertices of the sale color in the vertex set T before proceeding to the next color. Vertex consistency is satisfied by assigning all vertices the same color. Full consistency is satisfied by ensuring that no vertex shares the same color as any of its distance two neighbors. The Chromatic Engine has low-overhead, but does not support vertex prioritization. Edge Consistency model using the Chromatic Engine
Distributed Locking Engine The Distribute Locking Engine uses the technique of mutual exclusion by associating a readers-writers lock with each vertex. Vertex consistency is achieved by acquiring a write-lock on the central vertex of each requested scope. Edge consistency is achieved by acquiring a write-lock on the central vertex and read locks on adjacent vertices. Full consistency is achieved by acquiring write- locks on the central vertex and all adjacent vertices. Write Lock Read Lock Central Vertex Scope
Pipelined Locking and Prefetching Each machine maintains a pipeline of which locks have been requested, but not yet fulfilled. The pipelining system uses callbacks instead of readers-writer locks since the later would halt the pipeline. Lock acquisition requests provide a pointer to a callback, which is called once the request is fulfilled. Pipelining reduces latency by synchronizing locked data immediately after each local lock. Pipelined Locking Engine Thread Loop while not done do if Pipeline Has Read Vertex v then Execute (T’, S v ) = f(v, S V ) //update scheduler on each machine For each machine p, Send {sϵT’: owner(s) = p} Release locks and push changes to S v in background else Wait on the Pipeline
Fault Tolerance GraphLab uses a distributed checkpoint system called Snapshot Update to introduce fault tolerance. Snapshot Update can be deployed synchronously or asynchronously. Asynchronous snapshots are more efficient and can guarantee a consistent snapshot under the following conditions: Edge consistency is used on all update functions. Schedule completes before the scope is unlocked. Snapshot Update is prioritized over other updates. Snapshot Update on vertex v If v was already snapshotted then Quit Save D v // Save current vertex foreach u ϵ N[v] do // Loop over neighbors if u was not snapshotted then Save data on edge D u v Schedule u for a Snapshot Update Mark v as snapshotted
System Design Overview Initialization Phase Distributed File system Raw Graph Data Distributed File system Atom Index Atom File Atom File Atom File Atom File Atom File Atom File (MapReduce) Graph Builder Parsing + Partitioning Atom Collection Index Construction GraphLab Execution Phase Distributed File system Atom Index Atom File Atom File Atom File Atom File Atom File Atom File Cluster TCP RCP Comms Monitoring + Atom Placement GL Engine
Locking Engine Design Overview Async RPC Comm. (Over TCP) GraphLab Engine Update Fn. Exec Thread Scope Prefetch (Locks + Data) Pipeline Update Fn. Exec Thread Scheduler Remote Graph Cache Distributed Locks Local Graph Storage Distributed Graph
Applications
Netflix Movie Recommendation The Netflix movie recommendation task uses collaborative filtering to predict the movie ratings for each user based on the ratings of similar users. The alternating least squares(ALS) algorithm is often used and can be represented using the GraphLab abstraction The sparse matrix R defines a bipartite graph connecting each user with the movies that they rated. Vertices are users and movies and edges contain the ratings for a user-movie pair. 1 2 a b c d
Netflix Comparisons The GraphLab implementation was compared against Hadoop and MPI using between 4 to 64 machines. GraphLab performs between times faster than Hadoop. It also slightly outperformed the optimized MPI implementation. Hadoop MPI GraphLab # Machines
Netflix Scaling with Intensity Plotted is the speedup achieved for varying values of dimensionality, d, and the corresponding number of cycles required per update. Extrapolating to obtain the theoretically optimal runtime, the estimated overhead of Distributed GraphLab at 64 machines is 12x for dimensionality 5 and 4.9x for dimensionality 100. This overhead includes graph loading and communication and gives a measurable objective for future optimizations Ideal D=100 D=20 # machines
Video Co-segmentation (CoSeg) Video co-segmentation automatically identifies and clusters spatio-temporal segments of video that share similar texture and color characteristics. Frames of high-resolution video are processed by coarsening each frame to a regular grid of rectangular super-pixels. The CoSeg algorithm predicts the best label (e.g. sky, building, grass, pavement, trees for each super pixel). High-Res ImageSuper-Pixel
CoSeg Algorithm Implementation CoSeg uses Gaussian Mixture Model in conjunction with Loopy Belief Propagation. Updates that are expected to change vertex values significantly are prioritized. Distributed GraphLab is the only distributed graph abstraction that allows the use of prioritized scheduling. CoSeg scales excellently due to having a very sparse graph and high computational intensity.
Named Entity Recognition (NER) Named Entity Recognition is the task of determining the type (e.g., Person, Place, or Thing) of a noun-phrase (e.g. Obama, Chicago, or Car) from its context (e.g. “President..”, “Lives near..”, or “bought a..”). The data graph of bipartite with one set of vertices corresponding to the noun-phrases and other corresponding to each contexts. There is an edge between a noun-phrase and a context if the noun-phrase occurs in the context. FoodReligion onionCatholic garlicFreemasonry noodlesMarxism blueberriesCatholic Chr.
NER Comparisons The GraphLab implementation of NER achieved 20-30x speedup over Hadoop and was comparable to the optimized MPI. However, GraphLab scaled poorly achieving only a 3x improvement using 16x more machines. This poor performance can be attributed to the large vertex data size, dense connectivity, and poor partitioning.
EC2 Cost Evaluation The price-runtime curves for GraphLab and Hadoop illustrate the monetary cost of deploying either system. The price-runtime curve demonstrates diminishing returns: the cost of attaining reduced runtimes increases faster than linearly. For the Netflix application, GraphLab is about two orders of magnitude more cost-effective than Hadoop.
Conclusion Distributed GraphLab extends the shared memory GraphLab to the distributed setting by: Refining the execution model Relaxing the schedule requirements Introducing a new distributed data-graph Introducing new execution engines Introducing fault tolerance. Distributed Graphlab outperforms Hadoop by 20-60x and is competitive with tailored MPI implementations.
Asynchronous Iterative Update Dependency graph of shoppers’ preferences Local update of shopper