Big Learning with Graph Computation Joseph Gonzalez Download the talk:

Slides:



Advertisements
Similar presentations
Scaling Up Graphical Model Inference
Advertisements

Danny Bickson Parallel Machine Learning for Large-Scale Graphs
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Exact Inference in Bayes Nets
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Dynamic Bayesian Networks (DBNs)
Distributed Graph Processing Abhishek Verma CS425.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial.
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
GraphLab A New Framework for Parallel Machine Learning
Pregel: A System for Large-Scale Graph Processing
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Computation and Minimax Risk The most challenging topic… Some recent progress: –tradeoffs between time and accuracy via convex relaxations (Chandrasekaran.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Carnegie Mellon University Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein Alex Smola The Next Generation.
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
TensorFlow– A system for large-scale machine learning
Big Data: Graph Processing
A New Parallel Framework for Machine Learning
Big Learning with Graphs
PREGEL Data Management in the Cloud
Distributed Graph-Parallel Computation on Natural Graphs
CSCI5570 Large Scale Data Processing Systems
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
Distributed Systems CS
Replication-based Fault-tolerance for Large-scale Graph Processing
Expectation-Maximization & Belief Propagation
Splash Belief Propagation:
CMPT 733, SPRING 2017 Jiannan Wang
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Big Learning with Graph Computation Joseph Gonzalez Download the talk:

48 Hours of Video Uploaded every Minute 750 Million Facebook Users 6 Billion Flickr Photos Big Data already Happened 1 Billion Tweets Per Week

How do we understand and use Big Data? Big Learning

Big Learning Today: Regression Pros: – Easy to understand/predictable – Easy to train in parallel – Supports Feature Engineering – Versatile: classification, ranking, density estimation Philosophy of Big Data and Simple Models Simple Models

“Invariably, simple models and a lot of data trump more elaborate models based on less data.” Alon Halevy, Peter Norvig, and Fernando Pereira, Google

“Invariably, simple models and a lot of data trump more elaborate models based on less data.” Alon Halevy, Peter Norvig, and Fernando Pereira, Google

Why not build elaborate models with lots of data? Difficult Computationally Intensive

Big Learning Today: Simple Models Pros: – Easy to understand/predictable – Easy to train in parallel – Supports Feature Engineering – Versatile: classification, ranking, density estimation Cons: – Favors bias in the presence of Big Data – Strong independence assumptions

Shopper 1 Shopper 2 Cameras Cooking 9 Social Network

Big Data exposes the opportunity for structured machine learning 10

Examples

Profile Label Propagation Social Arithmetic: Recurrence Algorithm: – iterate until convergence Parallelism: – Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking

PageRank (Centrality Measures) Iterate: Where: – α is the random reset probability – L[j] is the number of links on page j

Matrix Factorization Alternating Least Squares (ALS) r 11 r 12 r 23 r 24 u1u1 u2u2 m1m1 m2m2 m3m3 User Factors (U) Movie Factors (M) Users Movies Netflix Users ≈ x Movies uiui mjmj Update Function computes:

Other Examples Statistical Inference in Relational Models – Belief Propagation – Gibbs Sampling Network Analysis – Centrality Measures – Triangle Counting Natural Language Processing – CoEM – Topic Modeling

Graph Parallel Algorithms Dependency Graph Iterative Computation My Interests Friends Interests Local Updates

? Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso What is the right tool for Graph-Parallel ML 17 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce?

Why not use Map-Reduce for Graph Parallel algorithms?

Data Dependencies are Difficult Difficult to express dependent data in MR – Substantial data transformations – User managed graph structure – Costly data replication Independent Data Records

Iterative Computation is Difficult System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 21 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce? MPI/Pthreads

Threads, Locks, & Messages “low level parallel primitives” We could use ….

Threads, Locks, and Messages ML experts repeatedly solve the same parallel design challenges: – Implement and debug complex parallel system – Tune for a specific parallel platform – Six months later the conference paper contains: “We implemented ______ in parallel.” The resulting code: – is difficult to maintain – is difficult to extend couples learning model to parallel implementation 23 Graduate students

Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Addressing Graph-Parallel ML We need alternatives to Map-Reduce Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics MPI/Pthreads Pregel (BSP)

Barrier Pregel: Bulk Synchronous Parallel ComputeCommunicate

Open Source Implementations Giraph: Golden Orb: An asynchronous variant: GraphLab:

PageRank in Giraph (Pregel) public void compute(Iterator msgIterator) { double sum = 0; while (msgIterator.hasNext()) sum += msgIterator.next().get(); DoubleWritable vertexValue = new DoubleWritable( * sum); setVertexValue(vertexValue); if (getSuperstep() < getConf().getInt(MAX_STEPS, -1)) { long edges = getOutEdgeMap().size(); sentMsgToAllEdges( new DoubleWritable(getVertexValue().get() / edges)); } else voteToHalt(); } Sum PageRank over incoming messages

Tradeoffs of the BSP Model Pros: – Graph Parallel – Relatively easy to implement and reason about – Deterministic execution

Barrier Embarrassingly Parallel Phases ComputeCommunicate

Tradeoffs of the BSP Model Pros: – Graph Parallel – Relatively easy to build – Deterministic execution Cons: – Doesn’t exploit the graph structure – Can lead to inefficient systems

Curse of the Slow Job Data CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier Data Barrier

Assuming runtime is drawn from an exponential distribution with mean 1. Curse of the Slow Job

Tradeoffs of the BSP Model Pros: – Graph Parallel – Relatively easy to build – Deterministic execution Cons: – Doesn’t exploit the graph structure – Can lead to inefficient systems – Can lead to inefficient computation

Example: Loopy Belief Propagation (Loopy BP) Iteratively estimate the “beliefs” about vertices – Read in messages – Updates marginal estimate (belief) – Send updated out messages Repeat for all variables until convergence 34

Bulk Synchronous Loopy BP Often considered embarrassingly parallel – Associate processor with each vertex – Receive all messages – Update all beliefs – Send all messages Proposed by: – Brunton et al. CRV’06 – Mendiburu et al. GECC’07 – Kang,et al. LDMTA’10 –…–… 35

Sequential Computational Structure 36

Hidden Sequential Structure 37

Hidden Sequential Structure Running Time: Evidence Time for a single parallel iteration Time for a single parallel iteration Number of Iterations 38

Optimal Sequential Algorithm Forward-Backward Bulk Synchronous 2n 2 /p p ≤ 2n Running Time 2n Gap p = 1 n p = 2 39

The Splash Operation Generalize the optimal chain algorithm: to arbitrary cyclic graphs: ~ 1)Grow a BFS Spanning tree with fixed size 2)Forward Pass computing all messages at each vertex 3)Backward Pass computing all messages at each vertex 40

BSP is Provably Inefficient Limitations of bulk synchronous model can lead to provably inefficient parallel algorithms Bulk Synchronous (Pregel) Asynchronous Splash BP BSP Splash BP Gap

Tradeoffs of the BSP Model Pros: – Graph Parallel – Relatively easy to build – Deterministic execution Cons: – Doesn’t exploit the graph structure – Can lead to inefficient systems – Can lead to inefficient computation – Can lead to invalid computation

The problem with Bulk Synchronous Gibbs Sampling Adjacent variables cannot be sampled simultaneously. Strong Positive Correlation t=0 Parallel Execution t=2t=3 Strong Positive Correlation Strong Positive Correlation t=1 Sequential Execution Strong Negative Correlation Strong Negative Correlation 43

Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso The Need for a New Abstraction If not Pregel, then what? Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Pregel (Giraph)

GraphLab Addresses the Limitations of the BSP Model Use graph structure – Automatically manage the movement of data Focus on Asynchrony – Computation runs as resources become available – Use the most recent information Support Adaptive/Intelligent Scheduling – Focus computation to where it is needed Preserve Serializability – Provide the illusion of a sequential execution – Eliminate “race-conditions”

What is GraphLab? Checkout Version 2

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights Graph: Social Network

Implementing the Data Graph All data and structure is stored in memory – Supports fast random lookup needed for dynamic computation Multicore Setting: – Challenge: Fast lookup, low overhead – Solution: dense data-structures Distributed Setting: – Challenge: Graph partitioning – Solutions: ParMETIS and Random placement

Natural graphs have poor edge separators Classic graph partitioning tools (e.g., ParMetis, Zoltan …) fail Natural graphs have good vertex separators New Perspective on Partitioning CPU 1 CPU 2 Y Y Must synchronize many edges CPU 1 CPU 2 Y Y Must synchronize a single vertex

pagerank(i, scope){ // Get Neighborhood data (R[i], W ij, R[j])  scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

PageRank in GraphLab (V2) struct pagerank : public iupdate_functor { void operator()(icontext_type& context) { double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()); double& rank = context.vertex_data(); double old_rank = rank; rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(rank – old_rank) if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } };

PageRank Update Function GraphLab_pagerank(scope) { double sum = 0; forall ( nbr in scope.in_neighbors() ) sum = sum + neighbor.value() / nbr.num_out_edges(); double old_rank = scope.vertex_data(); scope.center_value() = ALPHA + (1-ALPHA) * sum; double residual = abs(scope.center_value() – old_rank); if (residual > EPSILON) reschedule_out_neighbors(); } Directly Read Neighbor Values Directly Read Neighbor Values Dynamically Schedule Computation Dynamically Schedule Computation

ConvergedSlowly Converging Focus Effort Dynamic Computation 54

The Scheduler CPU 1 CPU 2 The scheduler determines the order that vertices are updated e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty

Choosing a Schedule GraphLab provides several different schedulers Round Robin: vertices are updated in a fixed order FIFO: Vertices are updated in the order they are added Priority: Vertices are updated in priority order 56 The choice of schedule affects the correctness and parallel performance of the algorithm Obtain different algorithms by simply changing a flag! --scheduler=sweep --scheduler=fifo --scheduler=priority

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

Ensuring Race-Free Execution How much can computation overlap?

GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

Consistency Rules 60 Guaranteed sequential consistency for all update functions Data

Full Consistency 61

Obtaining More Parallelism 62

Edge Consistency 63 CPU 1 CPU 2 Safe Read

Consistency Through Scheduling Edge Consistency Model: Two vertices can be Updated simultaneously if they do not share an edge. Graph Coloring: Two vertices can be assigned the same color if they do not share an edge. Barrier Phase 1 Barrier Phase 2 Barrier Phase 3 Execute in Parallel Synchronously

Consistency Through R/W Locks Read/Write locks: Full Consistency Edge Consistency Write Canonical Lock Ordering ReadWrite Read Write

Multicore Setting: Pthread R/W Locks Distributed Setting: Distributed Locking Prefetch Locks and Data Allow computation to proceed while locks/data are requested. Node 2 Consistency Through R/W Locks Node 1 Data Graph Partition Lock Pipeline

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

Bayesian Tensor Factorization Dynamic Block Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM K-Means SVD LDA …Many others… Linear Solvers Splash Sampler Alternating Least Squares

Startups Using GraphLab Companies experimenting (or downloading) with GraphLab Academic projects exploring (or downloading) GraphLab

GraphLab vs. Pregel (BSP) PageRank (25M Vertices, 355M Edges) 51% updated only once

CoEM (Rosie Jones, 2005) Named Entity Recognition Task the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million

Better Optimal GraphLab CoEM CoEM (Rosie Jones, 2005) 72 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs

CoEM (Rosie Jones, 2005) Optimal Better Small Large GraphLab16 Cores30 min Hadoop95 Cores7.5 hrs GraphLab in the Cloud 32 EC2 machines 80 secs 0.3% of Hadoop time

The Cost of the Wrong Abstraction Log-Scale!

Tradeoffs of GraphLab Pros: – Separates algorithm from movement of data – Permits dynamic asynchronous scheduling – More expressive consistency model – Faster and more efficient runtime performance Cons: – Non-deterministic execution – Defining “residual” can be tricky – Substantially more complicated to implement

Scalability and Fault-Tolerance in Graph Computation

Scalability MapReduce: Data Parallel – Map heavy jobs scale VERY WELL – Reduce places some pressure on the networks – Typically Disk/Computation bound – Favors: Horizontal Scaling (i.e., Big Clusters) Pregel/GraphLab: Graph Parallel – Iterative communication can be network intensive – Network latency/throughput become the bottleneck – Favors: Vertical Scaling (i.e., Faster networks and Stronger machines)

Cost-Time Tradeoff video co-segmentation results more machines, higher cost faster a few machines helps a lot diminishing returns

Video Cosegmentation Segments mean the same Model: 10.5 million nodes, 31 million edges Gaussian EM clustering + BP on 3D grid Video version of [Batra]

Video Coseg. Strong Scaling GraphLab Ideal

Video Coseg. Weak Scaling GraphLab Ideal

Fault Tolerance

Rely on Checkpoint Barrier ComputeCommunicate Checkpoint Pregel (BSP)GraphLab Synchronous Checkpoint Construction Asynchronous Checkpoint Construction

Checkpoint Interval Tradeoff: – Short T i : Checkpoints become too costly – Long T i : Failures become too costly Time Re-compute Checkpoint Checkpoint Interval: TiTi Checkpoint Length: TsTs Machine Failure Time Checkpoint Time Checkpoint

Optimal Checkpoint Intervals Construct a first order approximation: Example: – 64 machines with a per machine MTBF of 1 year T mtbf = 1 year / 64 ≈ 130 Hours – T c = of 4 minutes – T i ≈ of 4 hours From: Checkpoint Interval Length of Checkpoint Mean time between failures

Open Challenges

Dynamically Changing Graphs Example: Social Networks – New users  New Vertices – New Friends  New Edges How do you adaptively maintain computation: – Trigger computation with changes in the graph – Update “interest estimates” only where needed – Exploit asynchrony – Preserve consistency

Graph Partitioning How can you quickly place a large data-graph in a distributed environment: Edge separators fail on large power-law graphs – Social networks, Recommender Systems, NLP Constructing vertex separators at scale: – No large-scale tools! – How can you adapt the placement in changing graphs?

Graph Simplification for Computation Can you construct a “sub-graph” that can be used as a proxy for graph computation? See Paper: – Filtering: a method for solving graph problems in MapReduce.

Concluding BIG Ideas Modeling Trend: Independent Data  Dependent Data – Extract more signal from noisy structured data Graphs model data dependencies – Captures locality and communication patterns Data-Parallel tools not well suited to Graph Parallel problems Compared several Graph Parallel Tools: – Pregel / BSP Models: Easy to Build, Deterministic Suffers from several key inefficiencies – GraphLab: Fast, efficient, and expressive Introduces non-determinism Scaling and Fault Tolerance: – Network bottlenecks and Optimal Checkpoint intervals Open Challenges: Enormous Industrial Interest