Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.

Slides:



Advertisements
Similar presentations
Danny Bickson Parallel Machine Learning for Large-Scale Graphs
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
Differentiated Graph Computation and Partitioning on Skewed Graphs
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Postdoc, UC Berkeley AMPLab A System for Distributed Graph-Parallel Machine Learning Yucheng Low Aapo Kyrola.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
GraphLab A New Framework for Parallel Machine Learning
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Introduction to Large-Scale Graph Computation
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 7.
GraphX: Graph Analytics on Spark
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Graph-Based Parallel Computing
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.
Carnegie Mellon University Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein Alex Smola The Next Generation.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Parallel and Distributed Systems for Probabilistic Reasoning
Big Data: Graph Processing
CSCI5570 Large Scale Data Processing Systems
A New Parallel Framework for Machine Learning
Big Learning with Graphs
PREGEL Data Management in the Cloud
Distributed Graph-Parallel Computation on Natural Graphs
CSCI5570 Large Scale Data Processing Systems
Data Structures and Algorithms in Parallel Computing
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Distributed Systems CS
Replication-based Fault-tolerance for Large-scale Graph Processing
Pregelix: Think Like a Vertex, Scale Like Spandex
Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University
Computational Advertising and
Markov Networks.
Presentation transcript:

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The Team: Carlos Guestrin

How will we design and implement parallel learning systems? Big-Learning

Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions The popular answer:

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization

Profile Label Propagation Social Arithmetic: Recurrence Algorithm: – iterate until convergence Parallelism: – Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking

Properties of Graph-Parallel Algorithms Dependency Graph Iterative Computation My Interests Friends Interests Local Updates Parallelism: Run local updates simultaneously

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Data-Mining PageRank Triangle Counting Collaborative Filtering Tensor Factorization Map Reduce? Graph-Parallel Abstraction

Graph-Parallel Abstractions Vertex-Program associated with each vertex Graph constrains the interaction along edges – Pregel: Programs interact through Messages – GraphLab: Programs can read each-others state

Barrier The Pregel Abstraction ComputeCommunicate Pregel_LabelProp(i) // Read incoming messages msg_sum = sum (msg : in_messages) // Compute the new interests Likes[i] = f( msg_sum ) // Send messages to neighbors for j in neighbors: send message(g(w ij, Likes[i])) to j Pregel_LabelProp(i) // Read incoming messages msg_sum = sum (msg : in_messages) // Compute the new interests Likes[i] = f( msg_sum ) // Send messages to neighbors for j in neighbors: send message(g(w ij, Likes[i])) to j

The GraphLab Abstraction Vertex-Programs are executed asynchronously and directly read the neighboring vertex-program state. GraphLab_LblProp(i, neighbors Likes) // Compute sum over neighbors sum = 0 for j in neighbors of i: sum = g(w ij, Likes[j]) // Update my interests Likes[i] = f( sum ) // Activate Neighbors if needed if Like[i] changes then activate_neighbors(); GraphLab_LblProp(i, neighbors Likes) // Compute sum over neighbors sum = 0 for j in neighbors of i: sum = g(w ij, Likes[j]) // Update my interests Likes[i] = f( sum ) // Activate Neighbors if needed if Like[i] changes then activate_neighbors(); Activated vertex-programs are executed eventually and can read the new state of their neighbors

Better Optimal GraphLab CoEM Never Ending Learner Project (CoEM) 11 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs Distributed GraphLab 32 EC2 machines 80 secs 0.3% of Hadoop time

The Cost of the Wrong Abstraction Log-Scale!

Startups Using GraphLab Companies experimenting (or downloading) with GraphLab Academic projects exploring (or downloading) GraphLab

Why do we need

Natural Graphs [Image from WikiCommons]

Assumptions of Graph-Parallel Abstractions Ideal Structure Small neighborhoods – Low degree vertices Vertices have similar degree Easy to partition Natural Graph Large Neighborhoods – High degree vertices Power-Law degree distribution Difficult to partition

Power-Law Structure Top 1% of vertices are adjacent to 50% of the edges! -Slope = α ≈ 2 High-Degree Vertices

Challenges of High-Degree Vertices Touches a large fraction of graph (GraphLab) Sequential Vertex-Programs Produces many messages (Pregel) Edge information too large for single machine Asynchronous consistency requires heavy locking (GraphLab) Synchronous consistency is prone to stragglers (Pregel)

Graph Partitioning Graph parallel abstraction rely on partitioning: – Minimize communication – Balance computation and storage Machine 1 Machine 2

Natural Graphs are Difficult to Partition Natural graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] Popular graph-partitioning tools (Metis, Chaco,…) perform poorly [Abou-Rjeili et al. 06] – Extremely slow and require substantial memory

Random Partitioning Both GraphLab and Pregel proposed Random (hashed) partitioning for Natural Graphs Machine 1 Machine 2 10 Machines  90% of edges cut 100 Machines  99% of edges cut!

In Summary GraphLab and Pregel are not well suited for natural graphs Poor performance on high-degree vertices Low Quality Partitioning

Distribute a single vertex-program – Move computation to data – Parallelize high-degree vertices Vertex Partitioning – Simple online heuristic to effectively partition large power-law graphs

Decompose Vertex-Programs + + … +  Y Y Y Parallel Sum User Defined: Gather( )  Σ Y Σ 1 + Σ 2  Σ 3 Y Scope Gather (Reduce) Y Y Apply(, Σ)  Y’ Apply the accumulated value to center vertex User Defined: Apply Y’ Scatter( )  Update adjacent edges and vertices. User Defined: Y Scatter

LabelProp_GraphLab2(i) Gather(Likes[i], w ij, Likes[j]) : return g(w ij, Likes[j]) sum(a, b) : return a + b; Apply(Likes[i], Σ) : Likes[i] = f(Σ) Scatter( Likes[i], w ij, Likes[j] ) : if (change in Likes[i] > ε) then activate(j) Writing a GraphLab2 Vertex-Program

Machine 2 Machine 1 Y Y Distributed Execution of a Factorized Vertex-Program ( + )( )  Y Y Y Σ1Σ1 Σ1Σ1 Σ 2 Y Y Y Y O(1) data transmitted over network

Cached Aggregation Repeated calls to gather wastes computation: Solution: Cache previous gather and update incrementally Y YYYY + + … + +  Σ ’ Wasted computation YYY + +…+ + Δ  Σ ’ Cached Gather (Σ) Y Δ Y New ValueOld Value

LabelProp_GraphLab2(i) Gather(Likes[i], w ij, Likes[j]) : return g(w ij, Likes[j]) sum(a, b) : return a + b; Apply(Likes[i], Σ) : Likes[i] = f(Σ) Scatter( Likes[i], w ij, Likes[j] ) : if (change in Likes[i] > ε) then activate(j) Post Δ j = g( w ij, Likes[i] new ) - g( w ij, Likes[i] old ) Writing a GraphLab2 Vertex-Program Reduces Runtime of PageRank by 50%!

Execution Models Synchronous and Asynchronous

Similar to Pregel For all active vertices – Gather – Apply – Scatter – Activated vertices are run on the next iteration Fully deterministic Potentially slower convergence for some machine learning algorithms Synchronous Execution

Similar to GraphLab Active vertices are processed asynchronously as resources become available. Non-deterministic Optionally enable serial consistency Asynchronous Execution

Preventing Overlapping Computation New distributed mutual exclusion protocol Conflict Edge Conflict Edge

Multi-core Performance Multicore PageRank (25M Vertices, 355M Edges) GraphLab GraphLab2 Factorized Pregel (Simulated) GraphLab2 Factorized + Caching

Vertex-Cuts for Partitioning Percolation theory suggests that Power Law graphs can be split by removing only a small set of vertices. [Albert et al. 2000] What about graph partitioning?

GraphLab2 Abstraction Permits New Approach to Partitioning Rather than cut edges: we cut vertices: CPU 1 CPU 2 Y Y Must synchronize many edges CPU 1 CPU 2 Y Y Must synchronize a single vertex Theorem: For any edge-cut we can directly construct a vertex-cut which requires strictly less communication and storage.

Constructing Vertex-Cuts Goal: Parallel graph partitioning on ingress. Propose three simple approaches: – Random Edge Placement Edges are placed randomly by each machine – Greedy Edge Placement with Coordination Edges are placed using a shared objective – Oblivious-Greedy Edge Placement Edges are placed using a local objective

Random Vertex-Cuts Assign edges randomly to machines and allow vertices to span machines. Y Machine 1 Machine 2 Y

Random Vertex-Cuts Assign edges randomly to machines and allow vertices to span machines. Expected number of machines spanned by a vertex: Number of Machines Spanned by v Degree of v Spanned Machines Numerical Functions

Random Vertex-Cuts Assign edges randomly to machines and allow vertices to span machines. Expected number of machines spanned by a vertex: α = 1.65 α = 1.7 α = 1.8 α = 2

Greedy Vertex-Cuts by Derandomization Place the next edge on the machine that minimizes the future expected cost: Greedy – Edges are greedily placed using shared placement history Oblivious – Edges are greedily placed using local placement history Placement information for previous vertices

Shared Objective (Communication) Greedy Placement Shared objective Machine1 Machine 2

Local Objective Oblivious Placement Local objectives: CPU 1 CPU 2

Partitioning Performance Twitter Graph: 41M vertices, 1.4B edges Oblivious/Greedy balance partition quality and partitioning time. Spanned Machines Load-time (Seconds)

32-Way Partitioning Quality VerticesEdges Twitter41M1.4B UK133M5.5B Amazon0.7M5.2M LiveJournal5.4M79M Hollywood2.2M229M Oblivious 2x Improvement+ 20% load-time Greedy 3x Improvement+ 100% load-time Spanned Machines

System Evaluation

Implementation Implemented as C++ API Asynchronous IO over TCP/IP Fault-tolerance is achieved by check-pointing Substantially simpler than original GraphLab – Synchronous engine < 600 lines of code Evaluated on 64 EC2 HPC cc1.4xLarge

Comparison with GraphLab & Pregel PageRank on Synthetic Power-Law Graphs – Random edge and vertex cuts Denser GraphLab2 RuntimeCommunication

Benefits of a good Partitioning Better partitioning has a significant impact on performance.

Performance: PageRank Twitter Graph: 41M vertices, 1.4B edges Oblivious Greedy Oblivious Random Greedy

Matrix Factorization Matrix Factorization of Wikipedia Dataset (11M vertices, 315M edges) Docs Words Wiki Consistency = Lower Throughput

Matrix Factorization Consistency  Faster Convergence Fully Asynchronous Serially Consistent

PageRank on AltaVista Webgraph 1.4B vertices, 6.7B edges Pegasus1320s800 coresGraphLab276s512 cores

Conclusion Graph-Parallel abstractions are an emerging tool for large-scale machine learning The Challenges of Natural Graphs – Power-Law degree distribution – Difficult to partition GraphLab2: – Distributes single vertex programs – New vertex partitioning heuristic to rapidly place large power-law graphs Experimentally outperforms existing graph- parallel abstractions

Carnegie Mellon University Official release in July.

Pregel Message Combiners User defined commutative associative (+) message operation: Machine 1 Machine Sum

Costly on High Fan-Out Many identical messages are sent across the network to the same machine: Machine 1 Machine 2

GraphLab Ghosts Neighbors values are cached locally and maintained by system: Machine 1 Machine 2 Ghost

Reduces Cost of High Fan-Out Change to a high degree vertex is communicated with “single message” Machine 1 Machine 2

Increases Cost of High Fan-In Changes to neighbors are synchronized individually and collected sequentially: Machine 1 Machine 2

Comparison with GraphLab & Pregel PageRank on Synthetic Power-Law Graphs GraphLab2 Power-Law Fan-InPower-Law Fan-Out Denser

Straggler Effect PageRank on Synthetic Power-Law Graphs Power-Law Fan-InPower-Law Fan-Out Denser GraphLab Pregel (Piccolo) GraphLab2 GraphLab GraphLab2 Pregel (Piccolo)

Cached Gather for PageRank Initial Accum computation time Reduces runtime by ~ 50%.

Machine 1 Machine 2 Machine 3 Machine 4 Mirror Set Vertex-Cuts Edges are assigned to machines Vertices span machines – Forms a Mirror Set Cut Objective: – Minimize mirrors: Balance Constraint: – No machine has too many edges

Relation to Abelian Groups We can define an incremental update when: – Gather(U, V)  T – T is an Abelian Group: Commutative associative (+) and an inverse (-) – Define delta value as: Δ v = Gather(U new, V) - Gather(U old, V)