Danny Bickson Parallel Machine Learning for Large-Scale Graphs The GraphLab Team: Joe Hellerstein Alex Smola Yucheng Low Joseph Gonzalez Aapo Kyrola Jay Gu Carlos Guestrin
Parallelism is Difficult Wide array of different parallel architectures: Different challenges for each architecture GPUs Multicore Clusters Clouds Supercomputers High Level Abstractions to make things easier
How will we design and implement parallel learning systems?
Build learning algorithms on-top of high-level parallel abstractions ... a popular answer: Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions
Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Computing Sufficient Statistics
Example of Graph Parallelism
PageRank Example Iterate: Where: α is the random reset probability L[j] is the number of links on page j 1 2 3 4 5 6
Properties of Graph Parallel Algorithms Dependency Graph Local Updates Iterative Computation My Rank Friends Rank
Addressing Graph-Parallel ML We need alternatives to Map-Reduce Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph)? Map Reduce? Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Feature Extraction Cross Validation Computing Sufficient Statistics
Pregel (Giraph) Compute Communicate Bulk Synchronous Parallel Model: Barrier
PageRank in Giraph (Pregel) bsp_page_rank() { sum = 0 forall (message in in_messages()) sum = sum + message rank = ALPHA + (1-ALPHA) * sum; set_vertex_value(rank); if (current_super_step() < MAX_STEPS) { nedges = num_out_edges() forall (neighbors in out_neighbors()) send_message(rank / nedges); } else vote_to_halt(); } Sum PageRank over incoming messages Send new messages to neighbors or terminate
Bulk synchronous computation can be highly inefficient Problem: Bulk synchronous computation can be highly inefficient
BSP Systems Problem: Curse of the Slow Job Iterations Barrier Barrier Data Barrier Data Data Data CPU 1 CPU 2 CPU 1 CPU 1 Data CPU 2 CPU 2 Data Data CPU 3 CPU 3 CPU 3 Data Data Data
The Need for a New Abstraction If not Pregel, then what? Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph) Feature Extraction Cross Validation Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Computing Sufficient Statistics
The GraphLab Solution Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Multicore Distributed Cloud computing GPU implementation in progress
What is GraphLab?
The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model Scheduler
Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Graph: Social Network Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights
Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Dynamic computation
PageRank in GraphLab GraphLab_pagerank(scope) { sum = 0 forall ( nbr in scope.in_neighbors() ) sum = sum + neighbor.value() / nbr.num_out_edges() old_rank = scope.vertex_data() scope.center_value() = ALPHA + (1-ALPHA) * sum double residual = abs(scope.center_value() – old_rank) if (residual > EPSILON) reschedule_out_neighbors() }
Actual GraphLab2 Code! PageRank in GraphLab2 BE MORE CLEAR struct pagerank : public iupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; double old_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; BE MORE CLEAR
The Scheduler Scheduler The scheduler determines the order that vertices are updated CPU 1 e f g k j i h d c b a b c Scheduler e f b a i h i j CPU 2 The process repeats until the scheduler is empty
The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model Scheduler
Ensuring Race-Free Code How much can computation overlap?
Potentially Slower Convergence of ML Need for Consistency? No Consistency Higher Throughput (#updates/sec) Potentially Slower Convergence of ML
Inconsistent ALS Consistent Netflix data, 8 cores Full netflix, 8 cores Highly connected movies, bad intermediate results Netflix data, 8 cores
Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors ) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum …
Inconsistent PageRank 8 cores,
Point of Convergence
Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum … CPU 1 CPU 2 Read Read-write race CPU 1 reads bad PageRank estimate, as CPU 2 computes value
Race Condition Can Be Very Subtle GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / neighbor.num_out_edges sum = ALPHA + (1-ALPHA) * sum … Unstable GraphLab_pagerank(scope) { sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum scope.center_value = sum … Stable This was actually encountered in user code.
GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 time Parallel CPU 2 Single CPU Sequential
Consistency Rules Full Consistency Data Guaranteed sequential consistency for all update functions
Full Consistency Full Consistency
Obtaining More Parallelism Full Consistency Edge Consistency
Edge Consistency Edge Consistency CPU 1 CPU 2 Safe Read
The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model Scheduler
What algorithms are implemented in GraphLab?
Dynamic Block Gibbs Sampling Alternating Least Squares SVD Splash Sampler CoEM Bayesian Tensor Factorization Lasso Belief Propagation PageRank LDA SVM Gibbs Sampling Dynamic Block Gibbs Sampling K-Means Matrix Factorization …Many others… Linear Solvers
GraphLab Libraries Matrix factorization Linear Solvers Clustering SVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD Linear Solvers Jacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG Clustering K-means, Fuzzy K-means, LDA, K-core decomposition Inference Discrete BP, NBP, Kernel BP
Institute of Automation Chinese Academy of Sciences Efficient Multicore Collaborative Filtering LeBuSiShu team – 5th place in track1 Yao Wu Qiang Yan Qing Yang Danny Bickson Yucheng Low Institute of Automation Chinese Academy of Sciences Machine Learning Dept Carnegie Mellon University ACM KDD CUP Workshop 2011
ACM KDD CUP 2011 Task: predict music score Two main challenges: Data magnitude – 260M ratings Taxonomy of data
Data taxonomy
Our approach Use ensemble method Custom SGD algorithm for handling taxonomy This graph shows performance of the different methods using RMSE metric (root square mean error) Note that time-MFITR has very good performance after time-svd++
Ensemble method Solutions are merged using linear regression
Performance results Blended Validation RMSE: 19.90 This graph shows performance of the different methods using RMSE metric (root square mean error) Note that time-MFITR has very good performance after time-svd++
Classical Matrix Factorization Sparse Matrix Users Item MFITR is our developed novel method for coping with KDD characteristics of data, namely hierarchy of track, album artist and genere. It is composed of two elements. This slides discusses the first element. r_ui – is the scalar predicted rating between user u and item i. We have here a linear prediction rule. Mu – is the model mean b_i, b_u, b_a are biases of item user and artiest, which are learned from the data. q_i, q_a, p_u are feature vectors which are learned form the data 1) In addition to linear model of matrix factorization who factor the model into user and feature vectors, we add an additional feature which is the artist feature (noted q_a) d
MFITR Features of the Artist Features of the Album Sparse Matrix Users Features of the Artist Features of the Album Item Specific Features “Effective Feature of an Item” d MFITR is our developed novel method for coping with KDD characteristics of data, namely hierarchy of track, album artist and genere. It is composed of two elements. This slides discusses the first element. r_ui – is the scalar predicted rating between user u and item i. We have here a linear prediction rule. Mu – is the model mean b_i, b_u, b_a are biases of item user and artiest, which are learned from the data. q_i, q_a, p_u are feature vectors which are learned form the data 1) In addition to linear model of matrix factorization who factor the model into user and feature vectors, we add an additional feature which is the artist feature (noted q_a)
Penalty terms which ensure Artist/Album/Track features are “close” Intuitively, features of an artist and features of his/her album should be “similar”. How do we express this? Artist Penalty terms which ensure Artist/Album/Track features are “close” Strength of penalty depends on “normalized rating similarity” (See neighborhood model) Album Track
Fine Tuning Challenge Dataset has around 260M observed ratings 12 different algorithms, total 53 tunable parameters How do we train and cross validate all these parameters? USE GRAPHLAB!
16 Cores Runtime This plot shows run time using 8 cores. While SGD is very fast, it has a worst speedup relative to ALS.
Speedup plots Yucheng knows what is speedup – so I don’t need to write it down… Anyway we can see that alternating least squares style algo perform very well since once a subset of nodes (user or movie) are fixes, all the other nodes (movies/user) can be run in parallel. SGD and SVD++ perform less well, since when two users have seen the same movie they need to update the movie feature vector at the same time
Who is using GraphLab?
Universities using GraphLab
Companies tyring out GraphLab Startups using GraphLab 2400++ Unique Downloads Tracked (possibly many more from direct repository checkouts) Companies tyring out GraphLab
User community
Performance results
GraphLab vs. Pregel (BSP) (via GraphLab) GraphLab Pregel (via GraphLab) 51% updated only once Multicore PageRank (25M Vertices, 355M Edges)
CoEM (Rosie Jones, 2005) Vertices: 2 Million Edges: 200 Million Hadoop Named Entity Recognition Task the dog Australia Catalina Island <X> ran quickly travelled to <X> <X> is pleasant Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million Hadoop 95 Cores 7.5 hrs
CoEM (Rosie Jones, 2005) Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores Better Optimal GraphLab CoEM Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min 15x Faster! 6x fewer CPUs! 62
GraphLab in the Cloud
CoEM (Rosie Jones, 2005) 0.3% of Hadoop time Hadoop 95 Cores 7.5 hrs Better Optimal Large Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Small GraphLab in the Cloud 32 EC2 machines 80 secs 0.3% of Hadoop time
Cost-Time Tradeoff a few machines helps a lot faster video co-segmentation results a few machines helps a lot faster diminishing returns more machines, higher cost
Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Ideal D=100 D=20 Netflix Users Movies Hadoop MPI GraphLab D
Multicore Abstraction Comparison Dynamic Computation, Faster Convergence Netflix Matrix Factorization
The Cost of Hadoop
Fault Tolerance
Fault-Tolerance Larger Problems Increased chance of Machine Failure GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms Synchronous Snapshots Chandi-Lamport Asynchronous Snapshots
Synchronous Snapshots Run GraphLab Run GraphLab Barrier + Snapshot Time Run GraphLab Run GraphLab Barrier + Snapshot Run GraphLab Run GraphLab
Curse of the slow machine No Snapshot sync. Snapshot
Curse of the Slow Machine Run GraphLab Run GraphLab Time Barrier + Snapshot Run GraphLab Run GraphLab
Curse of the slow machine No Snapshot Delayed sync. Snapshot sync. Snapshot
Asynchronous Snapshots Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency struct chandy_lamport { void operator()(icontext_type& context) { save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() ) { if (edge.source() was not marked as saved) { save(context.edge_data(edge)); context.schedule(edge.source(), chandy_lamport()); } ... Repeat for context.out_edges Mark context.vertex() as saved; };
Snapshot Performance Async. Snapshot No Snapshot sync. Snapshot
Snapshot with 15s fault injection Halt 1 out of 16 machines 15s sync. Snapshot No Snapshot Async. Snapshot
New challenges
Natural Graphs Power Law Yahoo! Web Graph: 1.4B Verts, 6.7B Edges “Power Law” Top 1% of vertices is adjacent to 53% of the edges!
Problem: High Degree Vertices High degree vertices limit parallelism: Requires Heavy Locking Touch a Large Amount of State Processed Sequentially
High Communication in Distributed Updates Split gather and scatter across machines: Machine 1 Machine 2 Y Data from neighbors transmitted separately across network
High Degree Vertices are Common Popular Movies “Social” People Users Movies Netflix Hyper Parameters Common Words Obama Docs Words LDA θ Z w B α θ Z w θ Z w θ Z w
Two Core Changes to Abstraction Factorized Update Functors Delta Update Functors Monolithic Updates + Gather Apply Scatter Decomposed Updates Monolithic Updates Composable Update “Messages” f1 f2 (f1o f2)( )
PageRank in GraphLab Parallel “Sum” Gather Atomic Single Vertex Apply struct pagerank : public iupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; double old_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; Parallel “Sum” Gather Atomic Single Vertex Apply Parallel Scatter [Reschedule] BE MORE CLEAR
Decomposable Update Functors Locks are acquired only for region within a scope Relaxed Consistency Gather Apply Y Scatter( ) Update adjacent edges and vertices. User Defined: Scatter Scope Y Y Y Parallel Sum Y Y Apply( , Δ) Apply the accumulated value to center vertex User Defined: + + … + Δ Y User Defined: Gather( ) Δ Y Δ1 + Δ2 Δ3
Factorized PageRank double gather(scope, edge) { return edge.source().value().rank / scope.num_out_edge(edge.source()) } double merge(acc1, acc2) { return acc1 + acc2 } void apply(scope, accum) { old_value = scope.center_value().rank scope.center_value().rank = ALPHA + (1 - ALPHA) * accum scope.center_value().residual = abs(scope.center_value().rank – old_value) void scatter(scope, edge) { if (scope.center_vertex().residual > EPSILON) reschedule_schedule(edge.target())
Factorized Updates: Significant Decrease in Communication Split gather and scatter across machines: Y Y F1 F2 ( o )( ) Y Y Y Y Small amount of data transmitted over network
Factorized Consistency Neighboring vertices maybe be updated simultaneously: Gather Gather A B
Factorized Consistency Locking Gather on an edge cannot occur during apply: Gather A B Apply Vertex B gathers on other neighbors while A is performing Apply
Factorized PageRank BE MORE CLEAR struct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0; void gather(icontext_type& context, const edge_type& edge) { accum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; } void merge(const pagerank& other) { accum += other.accum; } void apply(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double old_value = vdata.rank; vdata.rank = RESET_PROB + (1 - RESET_PROB) * accum; residual = fabs(vdata.rank – old_value) / context.num_out_edges(); void scatter(icontext_type& context, const edge_type& edge) { if (residual > EPSILON) context.schedule(edge.target(), pagerank()); }; BE MORE CLEAR
Decomposable Loopy Belief Propagation Gather: Accumulates product of in messages Apply: Updates central belief Scatter: Computes out messages and schedules adjacent vertices
Decomposable Alternating Least Squares (ALS) y1 y2 y3 y4 w1 w2 x1 x2 x3 User Factors (W) Movie Factors (X) Users Movies Netflix ≈ x xj wi Update Function: Gather: Sum terms Apply: matrix inversion & multiply
Decomposable Functors Fits many algorithms Loopy Belief Propagation, Label Propagation, PageRank… Addresses the earlier concerns: Large State Distributed Gather and Scatter Heavy Locking Fine Grained Locking Sequential Parallel Gather and Scatter
Comparison of Abstractions GraphLab1 Factorized Updates Multicore PageRank (25M Vertices, 355M Edges)
Need for Vertex Level Asynchrony Costly gather for a single change! Y Exploit commutative associative “sum” + + + + + Y
Commut-Assoc Vertex Level Asynchrony Exploit commutative associative “sum” + + + + + Y
Commut-Assoc Vertex Level Asynchrony + Δ Exploit commutative associative “sum” + + + + + + Δ Y
Delta Updates: Vertex Level Asynchrony Exploit commutative associative “sum” + + + + + + Δ Old (Cached) Sum Y
Delta Updates: Vertex Level Asynchrony Δ Exploit commutative associative “sum” + + + + + + Δ Old (Cached) Sum Y
Delta Update Program starts with: schedule_all(ALPHA) void update(scope, delta) { scope.center_value() = scope.center_value() + delta if(abs(delta) > EPSILON) { out_delta = delta * (1 – ALPHA) * 1 / scope.num_out_edge(edge.source()) reschedule_out_neighbors(delta) } double merge(delta, delta) { return delta + delta } Slide 92 Program starts with: schedule_all(ALPHA)
Scheduling Composes Updates Calling reschedule neighbors forces update function composition: reschedule_out_neighbors(pagerank(3)) pagerank(3) pagerank(3) Pending: pagerank(10) Pending: pagerank(7) Pending: pagerank(3)
Multicore Abstraction Comparison Multicore PageRank (25M Vertices, 355M Edges)
Distributed Abstraction Comparison GraphLab1 GraphLab1 GraphLab2 (Delta Updates) GraphLab2 (Delta Updates) Distributed PageRank (25M Vertices, 355M Edges)
PageRank 1.4B vertices, 6.7B edges Altavista Webgraph 2002 Hadoop 800 cores Prototype GraphLab2 431s 512 cores Known Inefficiencies. 2x gain possible
Summary of GraphLab2 Decomposed Update Functions: Expose parallelism in high-degree vertices: Delta Update Functions: Expose asynchrony in high-degree vertices + Gather Apply Scatter Y Δ Y
Lessons Learned Machine Learning: System: Asynchronous often much faster than Synchronous Distributed asynchronous systems are harder to build Dynamic computation often faster But, no distributed barriers == better scalability and performance However, can be difficult to define optimal thresholds: Scaling up by an order of magnitude requires rethinking of design assumptions Science to do! Consistency can improve performance E.g., distributed graph representation Sometimes required for convergence High degree vertices & natural graphs can limit parallelism Though there are cases where relaxed consistency is sufficient Need further assumptions on update functions
Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems
Parallel GraphLab 1.1 Multicore Available Today GraphLab2 (in the Cloud) soon… Documentation… Code… Tutorials… http://graphlab.org