Machine Learning in the Cloud Yucheng Low Joey Gonzalez Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein David O’Hallaron
In ML we face BIG problems 13 Million Wikipedia Pages 500 Million Facebook Users 24 Hours a Minute YouTube 3.6 Billion Flickr Photos
Exponential Parallelism Exponentially Increasing Parallel Performance Constant Sequential Performance Exponentially Increasing Sequential Performance Processor Speed GHz [c] A decade ago, processors have experienced exponentially increasing sequantial performance. [c] the last 5 years or so sequential performance has stagnated ,[c] but instead we are observing expontially increasing parallel performance. [c] need to take advantage parallelism to make use of the ever increasing dataset sizes that are now available. Release Date
The Challenges of Parallelism Wide array of different parallel architectures: New algorithm design challenges: Race conditions and deadlocks Distributed state New software implementations challenges: Parallel debugging and profiling Hardware specific APIs GPUs Multicore Clusters Mini Clouds Clouds any non-trivial algorithm to scale will require the designer to reason about race conditions, deadlocks as well as a variety of other systems issues. ML experts have to repeated address these same parallel design challeges Therefore we typically Make use of High level abstractions to manage much of the complexity for us. An abstraction that has gained significant popularity lately is the MapReduce abstraction. We will quickly review it here.
high-level abstractions Our Current Solution ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a specific parallel platform A month later the conference paper contains: “We implemented ______ in parallel.” Graduate students any non-trivial algorithm to scale will require the designer to reason about race conditions, deadlocks as well as a variety of other systems issues. ML experts have to repeated address these same parallel design challeges Therefore we typically Make use of High level abstractions to manage much of the complexity for us. An abstraction that has gained significant popularity lately is the MapReduce abstraction. We will quickly review it here. avoid these problems by using high-level abstractions
MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 . 9 CPU 2 4 2 . 3 CPU 3 2 1 . 3 CPU 4 2 5 . 8 2 parts. A Map stage and a Reduce stage. The Map stage represents embarassingly parallel computation. That is, each computation is independent and can performed on different macheina without any communciation. Embarrassingly Parallel independent computation No Communication needed
MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 . 1 CPU 2 8 4 . 3 CPU 3 1 8 . 4 CPU 4 8 4 . For instance, we could use MapReduce to perform feature extraction on a large number of pictures. For instance, .. To compute an attractiveness score. 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 7 . 5 CPU 2 6 7 . 5 CPU 3 1 4 . 9 CPU 4 3 4 . 1 2 . 9 2 4 . 1 4 2 . 3 8 4 . 3 2 1 . 3 1 8 . 4 2 5 . 8 8 4 . Embarrassingly Parallel independent computation No Communication needed
MapReduce – Reduce Phase CPU 1 22 26 . CPU 2 17 26 . 31 The Reduce stage is essentially a “fold” or an aggregation operation over the results. This for instance can be used to compile summary statistics. 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 3 4 . Fold/Aggregation
Data-Parallel Complex Parallel Structure MapReduce and ML Excellent for large data-parallel tasks! Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Map Reduce Feature Extraction Cross Validation Algorithms such as cross validation and feature extraction are data parallel algorithms, which are easily map-reduceable. But there is a much larger world of algorithms which have complex parallel structure and do not fit well in the MapReduce framework. However, there are some useful properties common to a number of ML algorithms which we can exploit. Computing Sufficient Statistics
Iterative Algorithms? We can implement iterative algorithms in MapReduce: Iterations Barrier Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data Processor Slow Data Data Data Data Data
MapAbuse: Iterative MapReduce Only a subset of data needs computation: Iterations Barrier Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data Data Data S Data Data Data
MapAbuse: Iterative MapReduce System is not optimized for iteration: Iterations Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Startup Penalty Disk Penalty Data Data Data Data Data Data
Data-Parallel Algorithms can be Inefficient Optimized in Memory MapReduceBP Asynchronous Splash BP Limitations of MapReduce can lead to inefficient parallel algorithms But distributed Splash BP was built from scratch… efficient, parallel implementation was painful, painful, painful to achieve
What about structured Problems? Example Problem: Will I be successful in research? Success depends on the success of others. May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] But what if we have a million images, all are related, and we want to reason about them together? Say for instance, people who are attractive tend to make friends with attractive people. Computation becomes interdependent and this is no longer easily MapReduceable Interdependent Computation: Not Map-Reducible
Parallel Computing and ML Not all algorithms are efficiently data parallel ? Data-Parallel Structured Iterative Parallel Map Reduce GraphLab Tensor Factorization Feature Extraction Cross Validation Lasso Kernel Methods Belief Propagation Algorithms such as cross validation and feature extraction are data parallel algorithms, which are easily map-reduceable. But there is a much larger world of algorithms which have complex parallel structure and do not fit well in the MapReduce framework. However, there are some useful properties common to a number of ML algorithms which we can exploit. Computing Sufficient Statistics SVM Learning Graphical Models Sampling Deep Belief Networks Neural Networks
Common Properties 1) Sparse Data Dependencies Sparse Primal SVM Tensor/Matrix Factorization 2) Local Computations Sampling Belief Propagation Many ML algorithms have relatively sparse data dependencies, that is to say, there are not too many relationships between the data elements. Computation is local. This means that computation can be broken down into pieces where each piece does not depend on a large amount of program state. Finally, many ML algoithms tend to be iterative in nature. That is: the same set of operations are done repeatedly. 3) Iterative Updates Expectation Maximization Optimization Operation A Operation B
Gibbs Sampling 1) Sparse Data Dependencies X1 X2 X3 2) Local Computations X4 X5 X6 For instance, Gibbs sampling is one simple example which satisfies the 3 properties.The data dependencies of Gibbs sampling can be sparse depending on the underlying graphical model. The amount of state needed to sample a single variable is small: This is essentially governed by the size of the Markov Blanket. For instance to sample x6 you only need informatoun about x3 and x9. Finally, Gibbs sampling is iterative in nature as this procedure is repeated ientically on all the vertices. 3) Iterative Updates X7 X8 X9
GraphLab is the Solution Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Multicore Distributed Cloud computing GPU implementation in progress [c] The GraphLab abstraction we built is designed to target these properties specifically. To make it easy to express data depednencies, to express iterative procedures GraphLab simplifies the design of parallel progras by…. Note that the current implementation described here is multi-core, but the abstraction generalizes to the distributed setting. A distributed version is work in progress.
A New Framework for Parallel Machine Learning 4:20 max. I shall now present the GraphLab Framekwork
GraphLab Shared Data Table GraphLab Model Update Functions and Scopes Data Graph Update Function Shared Data Table GraphLab Model The GraphLab model is defined in 4 parts. The Data Graph which is used to express sparse data dependencies in your computation. And the Shared Data Table which is used to express global data as well as global computation In addition, we also have the scheduler which determines the order of computation And the scope system which provides thread safety and consistency. Update Functions and Scopes Scheduling
Part 1: Data Graph A Graph with data associated with every vertex and edge X1 X2 X3 X5 X6 X7 X8 X9 X10 X4 X11 x3: Sample value C(X3): sample counts The data graph is a graph with data associated with every vertex and edge. Data can be any kind of data. Parameters, vectors, matrices, images, and so on. For instance in the case of Gibbs sampling, the data graph will e exactly the markov random field.. on each vertex we might sture the current variable sample as well as a histogram of all the samples seen so far. While on the edges we will store the binary potential. Φ(X6,X9): Binary potential :Data
Update Functions Update Functions: operations applied on vertex transform data in scope of vertex Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex The data graph is modified via update functions. Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex. Where the scope of the vertex consists of the data on the vertex itself, adjacent vertices and edges. For instance, in the case of Gibbs sampling, the sampling function just …. And this fits nicely within the framkwork of the update function.
Update Function Schedule CPU 1 e f g k j i h d c b a a h a i These update functions are evalauted in parallel based on a schedule, which abstractly represents a sequence of tasks to be executed.. For instance, here we have 2 cpus, and a data graph. Each processor then reads a vertex from the schedule…executes it. CPU 2 b d
Part 2: Update Function Schedule CPU 1 e f g k j i h d c b a a i And this repeats. in parallel until the scheduler runs out of tasks or some termination condition is reached. CPU 2 b d
Need for Dynamic Scheduling Converged Slowly Converging Focus Effort However, static schedules are insufficient. For instance, one part of the problem could be easy, can could have converged in much fewer iterations than another part of the problem. Dynamic scheduling could allow us to “focus” computation on the difficult part of the problem.
Dynamic Schedule CPU 1 CPU 2 e f g k j i h d c b a b a h a b i To see how a dynamic schedule might work, we can repeat the earlier scheduling example. As usual, each cpu reads an element off the scheduler, and runs the update funciton, but the updte function has the oppurtunity to insert new tasks back into the schedule i CPU 2
Obtain different algorithms simply by changing a flag! Dynamic Schedule Update Functions can insert new tasks into schedule FIFO Queue Wildfire BP [Selvatici et al.] Priority Queue Residual BP [Elidan et al.] Splash Schedule Splash BP [Gonzalez et al.] Dynamic schedule is a scheduler where update functions can insert new tasks into the schedule. Many types of dynamic schedules are possible. Essentially by changing the underlying datastructure, we can obtain different kinds of schedules. For instance, if we were to do Belief Ppropagation…. We can obtain different BP style algorithms just by changing a single flag! Obtain different algorithms simply by changing a flag! --scheduler=fifo --scheduler=priority --scheduler=splash
Global Information What if we need global information? However, all these is restricted to just local computation. What if we need some global information? For instance, what if you want to specify algorithm parameters to update functions? … Algorithm Parameters? Sufficient Statistics? Sum of all the vertices?
Part 3: Shared Data Table (SDT) Global constant parameters Constant: Temperature Constant: Total # Samples This brings us to the shared data table. Shared Data table is a table where global contant parameters can be stored. This table is accessible, but read-only to all update functions. For instance, for Gibbs, I could store parameters such as the total number of samples I like to draw. Or some temperature parameter. Can also do computation.
Sync Operation Sync is a fold/reduce operation over the graph Accumulate performs an aggregation over vertices Apply makes a final modification to the accumulated data Example: Compute the average of all the vertices 1 Sync! 6 1 5 2 3 Accumulate Function: Add 8 Apply Function: Divide by |V| The Sync operation is similar to MapReduce’s “reduce” operation in that it performs a reduction over all the graph data. It can be associated with an entry in the shared data table, and It is defined as two user functions: an accumulate ad an apply. The accumulate function performs an aggregation over vertices, while the apply function makes a final modifcation to the accumulated data. So for instance, if I would like to compute the average of all the vertices, I would define an accumulate function which just adds, while an apply function which ….. Then when I call sync on the associated entry in the shred data table. The add function will be called with ain initial 0 value and the first vertex in the graph. This the repeats for all the vertices in the graph, accumuating the result as it goes along. When it is done, the apply function is called on the final value. And the result is stored back in the shared data table. This operation allows global information to be aggregated. More importantly, GraphLab allows this operation to be run at the same time as other update functions allowing this to be used to perform background tasks, like estimating the termination condition for instance. 2 22 9 1 3 2 1 1 2 1
Shared Data Table (SDT) Global constant parameters Global computation (Sync Operation) Constant: Temperature Sync: Loglikelihood Constant: Total # Samples SO in context of the Gibbs example, Sync: Sample Statistics
Safety and Consistency 10 min. As with all parallel programs, safety and consistency is important. What I mean by safety and Consistency. Lets see the following example
Write-Write Race Write-Write Race If adjacent update functions write simultaneously Left update writes: Final Value Right update writes: But if two adjacent vertices are updated simultaneously…. We could get a clash on the center edge. The result on the center edges could become inconsistent. [pause] For instance, the left vertex could write the blue histogram to the edge. The right. But due to the collision, it is possible for the final value can be very wrong.
Race Conditions + Deadlocks Just one of the many possible races Race-free code is extremely difficult to write GraphLab design ensures race-free operation (through user-tunable consistency mechanism) Just one of the many possible races that could occur in your code. Race free code is extremely difficult to write The GraphLab design however, can ensure race-free operation. Graphlab does this through the use of scope rules.
Part 4: Scope Rules Full Consistency GraphLab allows you to pick from a few consistency models, of which the strongest is called the full consiteny model. This works by ensuring that all update function scopes do not overlap Guaranteed safety for all update functions
Full Consistency Parallel update only allowed two vertices apart It however has reduced opportunities for parallelsim Only allows update fnctions two vertices apart to be run in parallel. Parallel update only allowed two vertices apart Reduced opportunities for parallelism
Obtaining More Parallelism Not all update functions will modify the entire scope! Full Consistency Edge Consistency We can however relax this to try to obtain more parallelism . By taking advantage of the fact… not all update functions will modifythe entire scope! .. . So we have an edge consistency model which only guarantees safe access to data on the current vertex and adjacent edge. Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices
Edge Consistency Edge Consistency Edge consistency model is weaker, and comes at a corresponding increase in parallelism as now update functions which are just one vertex apart can be run in parallel.
Obtaining More Parallelism Full Consistency Edge Consistency Vertex Consistency Finally, if the update functions does not even need to access edge data, for instance, Map style operations. Vertex consistency is sufficient. The vertex consistency model is the weakest only guarantees safe access to the vertex data itself. “Map” operations. Feature extraction on vertex data
Vertex Consistency Vertex Consistency Largest amount of parallelism available since adjacent vertices can be all run in parallel
Thm: Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions which produce same result key for proving correctness of parallel algorithm CPU 1 time Parallel With the right choice of model, …. … If we have the followig simple 3 vertex graph. And graphlab using 2 cpus in parallel perform a particular sequence of update functions. If the execution is sequentially consistent means that There exiss a sequentialization of the update functions onto a single processor sequence such that the final result is exactly the same as the parallel verison CPU 2 CPU 1 Sequential
GraphLab Shared Data Table GraphLab Model Update Functions and Scopes Data Graph Update Function Shared Data Table GraphLab Model To summarize, GraphLab , we have seen The Data Graph which expressed the sparse data dependencies in your computation. The shared data table which provides global information and global computation. The scheduling and the update functions which provide large scale parallelism as well as consistency guarantees Together… akes up graphlab Update Functions and Scopes Scheduling
Multicore Experiments 14 min
Multicore Experiments Shared Memory Implemention in C++ using Pthreads Tested on a 16 processor machine 4x Quad Core AMD Opteron 8384 64 GB RAM Belief Propagation +Parameter Learning Gibbs Sampling CoEM Lasso Compressed Sensing SVM PageRank Tensor Factorization To demonstrate the GraphLab abstraction, we implemented a shared memory version in C++ using pthreads, and tested it on a 16 processor machine with 64 GB of RAM. We implemented a large number of algorithms, but I will only present result on the 4 on the left
Graphical Model Learning 3D retinal image denoising Data Graph: 256x64x64 (1M) vertices Update Function Belief Propagation Sync Acc: Compute inference statistics Apply:Take a gradient step Sync: Edge-potential The first experiment demonstrate use of the complete GraphLab pipeline. 3D retina images are acquired measuring the density at each voxel with a laser beam. The resultant images are however quite noisy, and the aim to use belief propagation to denoise the image. Parameter learning is used to estimate the edge potentials. …. The datagraph is pairwise markov random field arranged as a 3d cube. That’s over a million variables. The update function used here is … We store the shared edge potentials in the shared data talbe and we use a sync operation to perform parameter learning. How do we do that? The accumulation phase is used to gather inference statistics on the gaph, while the apply phase is used to take a gradient step
Graphical Model Learning Better Optimal Splash Schedule Approx. Priority Schedule I will now present speedup results from running the problem on 1 to 16 processors. We tested two schedulers on this problem, a priority queue scheduler and a splash scheduler. And we obtained nearly linear speedup, achieving.15x to 15.5x speedup on 16 proc. However, we can do better with GraphLab 15.5x speedup on 16 cpus
Graphical Model Learning Standard parameter learning takes gradient only after inference is complete With GraphLab: Take gradient step while inference is running 2100 sec Runtime 3x faster! Inference Gradient Step Parallel Inference + Gradient Step 700 sec Typical parameter learning set up requires the gradient step to be computed only after inference is complete. However, since Graphlab allows you to perform Sync operations in the background while update functions are running, we can easily experiment with simultaneous parameter learning and inference: taking gradient steps while the inference algorithm is still running. And as we can see, by doing this, we can attain an additional 3x speedup over the typical algorithm. Iterated Simultaneous
Full Consistency Model Lasso Data matrix, n x d Observations n x 1 weights d x 1 5 Features 4 Examples Shooting Algorithm [Coordinate Descent] Updates on weight vertices modify losses on observation vertices. The data graph is a graph with data associated with every vertex and edge. Data can be any kind of data. Parameters, vectors, matrices, images, and so on. For instance in the case of Gibbs sampling, the data graph will e exactly the markov random field.. on each vertex we might sture the current variable sample as well as a histogram of all the samples seen so far. While on the edges we will store the binary potential. Requires the Full Consistency Model Financial prediction dataset from Kogan et al [2009].
Full Consistency Better Optimal Sparse Dense Now for some results. With the full consistency scope. We obtain a poor 2.1x speedup on 16 processors. But we do better on the sparser dataset, attaining a 4x speedup. This is expected as the full consistency model has much fewer oopportunities for parallelism on denser model
Why does this work? (We may have an answer soon.) Relaxing Consistency Better Optimal Dense Sparse Now, however, here is an interesting observation we made. What if we relax the consistency to the weakest vertex consistency model. To our surprise, it still converges to the same loss! But of course, now at a significant increase in performance, attaining about 9x speedup on 16 processors on the dense dataset, and a 5x speedup on the sparse dataset.. Just as a comparison, here are the curves when the full consistency model is used. It is however, still an open question. Why does the system still converge in this setting. Why does this work? (We may have an answer soon.)
CoEM (Rosie Jones, 2005) Hadoop 95 Cores 7.5 hrs Named Entity Recognition Task Is “Dog” an animal? Is “Catalina” a place? Vertices Edges Small 0.2M 20M Large 2M 200M Hadoop 95 Cores 7.5 hrs the dog Australia Catalina Island <X> ran quickly travelled to <X> <X> is pleasant Our 3rd experiment CoEM, a named entity recognition task. We test the scalability of our GraphLab implementation. The aim of coEM is to classify noun phrases. For instance…. The CoEM problem can be represented as a bipartite graph with noun phrases on the left, and contexts on the right. An edge between a noun phrase and a context means that the NP was observed with that context in the corpus. For instance, on the top edge, it means that the phrase “the dog ran quickly” was observed. Small dataset, large dataset. Graph. Just to show you how hard this dataset is, Hadoop took 7.5 hrs to 95 cores.
CoEM (Rosie Jones, 2005) Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores Better Optimal Large Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Small 15x Faster! 6x fewer CPUs! We now show the performance of our implementation. On the small problem, we achieve a respectable 12x speedup on 16 processors. On the large problem, we are able to achieve nearly a perfect speedup. This is due to the large amount of work available. Now if you recall…. GraphLab however only used ….. So we used 6x fewer CPUS to get 15x faster performance. 54
GraphLab in the Cloud
Moving towards the cloud… Purchasing and maintaining computers is very expensive Most computing resources seldomly used Only for deadlines… Buy time, access hundreds or thousands of processors Only pay for needed resources
Addressing cloud computing challenges GraphLab solution distributed memory optimized data partition limited bandwidth smart caching, interleave computation/comm high latency push data, latency hiding mechanism
GraphLab in the Cloud Experiments Highly optimized implementation Computer clusters Amazon EC2 GraphLab automatically configures and distributes through EC2 Very easy to use Thoroughly evaluated in three case studies: coEM probabilistic tensor factorization video co-segmentation inference & learning in a huge graphical model
Experiment Setup Tested on both Regular and HPC nodes, using up to 32 machines Regular Nodes 8 Cores per node Up to 128 Cores HPC Node 16 Cores per node Up to 256 Cores
CoEM (Rosie Jones, 2005) 0.3% of Hadoop time Hadoop 95 Cores 7.5 hrs Better Optimal Large Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Small GraphLab in the Cloud 32 EC2 machines 80 secs We now show the performance of our implementation. On the small problem, we achieve a respectable 12x speedup on 16 processors. On the large problem, we are able to achieve nearly a perfect speedup. This is due to the large amount of work available. Now if you recall…. GraphLab however only used ….. So we used 6x fewer CPUS to get 15x faster performance. 0.3% of Hadoop time 60
Video Cosegmentation A lot of data! Complex solutions infeasible! Segments mean the same The first experiment demonstrate use of the complete GraphLab pipeline. 3D retina images are acquired measuring the density at each voxel with a laser beam. The resultant images are however quite noisy, and the aim to use belief propagation to denoise the image. Parameter learning is used to estimate the edge potentials. …. The datagraph is pairwise markov random field arranged as a 3d cube. That’s over a million variables. The update function used here is … We store the shared edge potentials in the shared data talbe and we use a sync operation to perform parameter learning. How do we do that? The accumulation phase is used to gather inference statistics on the gaph, while the apply phase is used to take a gradient step A lot of data! Complex solutions infeasible!
Video Cosegmentation Naïve Idea: Treat patches independently Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patch The first experiment demonstrate use of the complete GraphLab pipeline. 3D retina images are acquired measuring the density at each voxel with a laser beam. The resultant images are however quite noisy, and the aim to use belief propagation to denoise the image. Parameter learning is used to estimate the edge potentials. …. The datagraph is pairwise markov random field arranged as a 3d cube. That’s over a million variables. The update function used here is … We store the shared edge potentials in the shared data talbe and we use a sync operation to perform parameter learning. How do we do that? The accumulation phase is used to gather inference statistics on the gaph, while the apply phase is used to take a gradient step Does not take relationships among patches into account!
Video Cosegmentation Better Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster. The first experiment demonstrate use of the complete GraphLab pipeline. 3D retina images are acquired measuring the density at each voxel with a laser beam. The resultant images are however quite noisy, and the aim to use belief propagation to denoise the image. Parameter learning is used to estimate the edge potentials. …. The datagraph is pairwise markov random field arranged as a 3d cube. That’s over a million variables. The update function used here is … We store the shared edge potentials in the shared data talbe and we use a sync operation to perform parameter learning. How do we do that? The accumulation phase is used to gather inference statistics on the gaph, while the apply phase is used to take a gradient step Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.
Video Co-Segmentation Discover “coherent” segment types across a video (extends Batra et al. ‘10) 1. Form super-voxels video 2. EM & inference in Markov random field Huge model: 23 million nodes, 390 million edges
Cost-Time Tradeoff a few machines helps a lot faster video co-segmentation results a few machines helps a lot faster diminishing returns more machines, higher cost
Bayesian Tensor Factorization Users Time Factors Netflix User 1 User 2 User 3 Movie 1 Movie 2 Movie 3 At time 6 At time 2 At time 3 Time Movies Vertices store user-factors and movie-factors Edges store user<->movie ratings Time-factors are stored as Shared Data
Bayesian Tensor Factorization Better Better Optimal HPC Nodes HPC Nodes Regular Nodes Regular Nodes Netflix dataset: 480K users, 18K movies, 27 time periods, 100M ratings Xiong et al (2010) 1 Core 2160s per iteration Distributed GraphLab (HPC) 256 Cores 6.8s per iteration
Parallel GraphLab 1.1 Multicore Available Today GraphLab in the Cloud soon… Documentation… Code… Tutorials… We are open sourcing our reference implementation of Parallel GraphLab today. http://graphlab.ml.cmu.edu
GraphLab Release 1.1 Parallel PThread based implementation Matlab™ 2010b interface using EMLC Java , Jython interface through JNI Tutorials and Demonstration Code Native C++ Interface GraphLab Engine Multicore Distributed GPU
C++, Java and Python Native C++ interface highest Performance access to complete GraphLab feature set template heavy code can be difficult for novice C++ programmers Pure Java API for GraphLab use plain Java objects for the whole graph write update function in Java full Graphlab C++ performance through Java Native Interface (JNI) Python API for GraphLab (via Jython) vertex and edge data can be any Python type Python allows very concise update functions works on top of Java interface for GraphLab
Matlab Interface Update Functions are written in a Matlab subset (embedded Matlab) Update Functions compiled to native C++ code which do not depend on Matlab Generated MEX interface allow resultant GraphLab program to interface with Matlab easily
Matlab Interface
GraphLab Shared Data Table GraphLab Model Update Functions and Scopes Data Graph Update Function Shared Data Table GraphLab Model To summarize, GraphLab , we have seen The Data Graph which expressed the sparse data dependencies in your computation. The shared data table which provides global information and global computation. The scheduling and the update functions which provide large scale parallelism as well as consistency guarantees Together… akes up graphlab Update Functions and Scopes Scheduling
GraphLab Parallel abstraction tailored to Machine Learning Parallel framework compactly expresses Data/computational dependencies Iterative computation Achieves state-of-the-art parallel performance on variety of problems Easy to use E.g., data partition, optimized communication, automatically configures & distributes over EC2,… GraphLab is an abstraction tailored specifically to the needs of a good number of ML algorithm. And most importantly.. It is easy to use.
Future Work Distributed GraphLab GPU GraphLab Robustness GPU GraphLab Memory bus bottle neck Warp alignment State-of-the-art performance for <Your Algorithm Here> . Future work includes Distributed GRaphLab as well as GPU GraphLab. There are a large number of complex problems to solve along the way. And the hope is that if you use GraphLab, we should be able to scale your algorithm from the shared memory setting to the distributed setting with minimal additional work.
Parallel GraphLab 1.1 Multicore Available Today GraphLab in the Cloud soon… Documentation… Code… Tutorials… We are open sourcing our reference implementation of Parallel GraphLab today. http://graphlab.ml.cmu.edu
Gets the current vertex data. Matlab Interface Gets the current vertex data.
Gets the edge data on a particular in-edge Matlab Interface Gets the edge data on a particular in-edge
Matlab Interface Computes the new belief by multiplying incoming messages with the unary potential
Updates the current vertex data Matlab Interface Updates the current vertex data
For each in/out edge pair Matlab Interface For each in/out edge pair
Compute the new outgoing message Matlab Interface Compute the new outgoing message
Sets the new outgoing edge data Matlab Interface Sets the new outgoing edge data
Computes the message residual Matlab Interface Computes the message residual
schedule destination vertex Matlab Interface If residual is large, schedule destination vertex
Matlab Interface bpupdate.m MRF Specification compile_update_function({‘bpupdate’}, vdata_example, edata_example, …) [newvdata, newedata, newadj] = bp(vertexdata, edgedata, adjacency_matrix, scheduling_spec); MRF Specification
Matlab Interface bpupdate.m Resultant Graph after running BP compile_update_function({‘bpupdate’}, vdata_example, edata_example, …) Resultant Graph after running BP [newvdata, newedata, newadj] = bp(vertexdata, edgedata, adjacency_matrix, scheduling_spec);
Matlab to GraphLab Compiler Details Vdata.belief = [1,1] Vdata.unary = [2,2] bpupdate.m Graph datatype examples Update functions vdata.belief = [double(0)]; eml.varsize('vdata.belief', [1 Inf]); vdata.unary = [double(0)]; eml.varsize('vdata.unary', Type check restrictions and generate EML type descriptors Matlab to C generation with EMLC Extensive use of C++ templates and preprocessor to identify graph datatypes and to wrap the update functions. Parse output C code and generate converters between emxArray and mxArray. emxArray serialization and deserialiation. Generate binary, mex and m frontend code. Makefile generation
Gibbs Sampling Two methods for sequentially consistency: Scopes Edge Scope graphlab(gibbs, edge, sweep); Scheduling Graph Coloring CPU 1 CPU 2 CPU 3 t0 t1 t2 t3 graphlab(gibbs, vertex, colored); Our next problem is Gibbs sampling and here we will demonstrate 2 different methods …… Parallel gibbs sampling requires the constraint that neighboring vertices do not get sampled at the same time. If you recall this is naturally expressed using the edge consistency model. This means that I can use any scheduler on this problem, and I will havea correct Gibbs sampler. A different method of achieving the same constaint is to color the graph, such that adjacent vertices are of different colors, and using a scheduler which only schedules the colors in sequence. This allows me to use a the weakest vertex consistency model since the scheduler provides the guarantees for me.
Gibbs Sampling Protein-protein interaction networks [Elidan et al. 2006] Pair-wise MRF 14K Vertices 100K Edges 10x Speedup Scheduling reduces locking overhead Better Optimal Colored Schedule Round robin schedule We demonstrate this on a protein-protein interaction network which is a .pairwise MRF……. Using the round-robin scheduler on the edge consistency model provides a good 8x speedup on 16 processors. While using the colored scheduler, we can attain nearly a 10x speed up due to the decreased locking overhead.
Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed The lasso task is to perform L1 regularized linear regression. The algorithm we used is the shooting algorithm, which is basically coordinate descent. This allows us to represent the problem as a bipartite graph in GraphLab by treating X as an adjacency matrix. Due to the properties of the shooting algorithm, the full consistency scope is necessary. We can see why this is the case with the example here.
Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed The shhoting algorithm allows me to descend on the purple vertices simultanrously. Since they do not interact through any of the Y’s.
Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed But I cannot perform independent coordinate descent on \beta2 and \beta3 at the same time since they interact through Y2 and Y3. Predict 12 month volatility of a stock from the company’s annual reports. We derived a sparse dataset and a dense dataset from this data. Full consistency, as expected, performance is poor on denser model. Finance Dataset from Kogan et al [2009].
Full Consistency Better Optimal Sparse Dense Now for some results. With the full consistency scope. We obtain a poor 2.1x speedup on 16 processors. But we do better on the sparser dataset, attaining a 4x speedup. This is expected as the full consistency model has much fewer oopportunities for parallelism on denser model
Why does this work? (Open Question) Relaxing Consistency Better Optimal Dense Sparse Now, however, here is an interesting observation we made. What if we relax the consistency to the weakest vertex consistency model. To our surprise, it still converges to the same loss! But of course, now at a significant increase in performance, attaining about 9x speedup on 16 processors on the dense dataset, and a 5x speedup on the sparse dataset.. Just as a comparison, here are the curves when the full consistency model is used. It is however, still an open question. Why does the system still converge in this setting. Why does this work? (Open Question)
Did you compare against ___ Lasso We had trouble getting the standard Lasso implementations to run on the dataset But we will love to try it out
Comparing against Hadoop Hadoop is the current available implementation of Tom Mitchell’s group we are assisting. Now this is what they are using. Demonstrate that Hadoop (though popular) is not necessarily the best framework for your problem
DAG Abstraction Computation represented as a Directed Acyclic Graph. Vertices represent “programs”. A program starts when data is received on all incoming edges, and outputs data on its outgoing edges. - Introduce “vertex” function
Clocked Systolic Array Processors are arranged in a directed graph. All processors compute, then transmit outgoing messages in sync. - “Clocked” systolic array - “communication” centric parallelism can be hard for people used to thinking about a program as a single piece of state.
Clocked Systolic Array Processors are arranged in a directed graph. All processors compute, then transmit outgoing messages in sync. - “Clocked” systolic array - “communication” centric parallelism can be hard for people used to thinking about a program as a single piece of state.
Read-Write Race Update Function reads from in-edges and writes to out-edges
Read-Write Race Update Function reads from in-edges and writes to out-edges Read-Write Race if write while an adjacent update function is reading
Compressed Sensing Represent image as sparse linear combination of basis functions. Minimize reconstruction error. Interior Point method with Gaussian BP as linear solver 50% random wavelet projection Compressed sensing task. An image is represented…… We use an interior point method where we use GraphLab to run guassian BP as a linear solver in an inner loop. We ran it on the well known Lena image on the left using 50% random wavelet projections. The reconstructed image is in the center. Even though GraphLab is only used as an inner loop of a larger algorithm, we still obtained excellent speedups in this setting.
Graphical Model Learning 3D retinal image denoising Data Graph: 256x64x64 Update Function: Belief Propagation Shared Data: Edge Potentials. Gradient step computed with Sync Optimal Better Better Runtime Inference Gradient Step Example to demonstrate use of the complete GraphLab pipeline. 3D retina images are acquired by firing a laser beam into the eyeball and measuring the density at each voxel. The resultant images are however quite noisy, and the aim to use belief propagation to denoise the image. Parameter learning is used to estimate the edge potentials. …..[desc alg]. On the left, we plot speedup against #processors for a variety of schedulers. we see that we obtain nearly linear speedup : between 13x and 15x speedup on 16 processors. Typical parameter learning set up requires the gradient step to be computed only after inference is complete. However, using Graphlab, is that we can easily experiment with simultaneous parameter learning and inference: taking gradient steps while the inference algorithm is still running. As we can see computing the gradient steps frequently results in signficantly faster convergence, but increases the amount of deviation compared to the parameters learnt the normal way. The deviation is quite small though, and could be acceptable for a significant decrease in runtime.