CSCI5570 Large Scale Data Processing Systems

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Danny Bickson Parallel Machine Learning for Large-Scale Graphs
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
GraphLab A New Framework for Parallel Machine Learning
Pregel: A System for Large-Scale Graph Processing
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
TensorFlow– A system for large-scale machine learning
Big Data: Graph Processing
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Distributed Shared Memory
CSCI5570 Large Scale Data Processing Systems
A New Parallel Framework for Machine Learning
Chilimbi, et al. (2014) Microsoft Research
PREGEL Data Management in the Cloud
Distributed Graph-Parallel Computation on Natural Graphs
Parallel and Distributed Computing
Distributed Systems CS
湖南大学-信息科学与工程学院-计算机与科学系
Predictive Performance
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
HPML Conference, Lyon, Sept 2018
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
COS 418: Distributed Systems Lecture 19 Wyatt Lloyd
Big Data I: Graph Processing, Distributed Machine Learning
Cloud Computing Large-scale Resource Management
Splash Belief Propagation:
CMPT 733, SPRING 2017 Jiannan Wang
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

CSCI5570 Large Scale Data Processing Systems Graph Processing Systems James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Yucheng Low

Distributed GraphLab: A Framework for Machine Learning in the Cloud Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein PVLDB 2012

Big Data is Everywhere “…growing at 50 percent a year…” 28 Million Wikipedia Pages 6 Billion Flickr Photos 900 Million Facebook Users 72 Hours a Minute YouTube “… data a new class of economic asset, like currency or gold.” “…growing at 50 percent a year…” - everywhere - useless -> actionable. - identify trends, create models, discover patterns.  Machine Learning. Big Learning.

How will we design and implement Big learning systems?

Shift Towards Use Of Parallelism in ML GPUs Multicore Clusters Clouds Supercomputers ML experts repeatedly solve the same parallel design challenges: Race conditions, distributed state, communication… Resulting code is very specialized: difficult to maintain, extend, debug… Graduate students - shift towards parallelism in different forms: … - addressing - built to solve very specific tasks. Avoid these problems by using high-level abstractions

MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 - An example: - MapReduce. Two Phases. - Map phase: independent embarassingly parallel computation : feature extraction from images.

MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 . 9 CPU 2 4 2 . 3 CPU 3 2 1 . 3 CPU 4 2 5 . 8 - MapReduce. Two Phases. - Map phase: independent embarassingly parallel computation : feature extraction from images. Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 . 1 CPU 2 8 4 . 3 CPU 3 1 8 . 4 CPU 4 8 4 . - MapReduce. Two Phases. - Map phase: independent embarassingly parallel computation : feature extraction from images. 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4 7 . 5 CPU 2 6 7 . 5 CPU 3 1 4 . 9 CPU 4 3 4 . - MapReduce. Two Phases. - Map phase: independent embarassingly parallel computation : feature extraction from images. 1 2 . 9 2 4 . 1 4 2 . 3 8 4 . 3 2 1 . 3 1 8 . 4 2 5 . 8 8 4 . Embarrassingly parallel independent computation No communication needed

MapReduce – Reduce Phase Attractive Face Statistics Ugly Face Statistics CPU 1 22 26 . CPU 2 17 26 . 31 Attractive Faces Ugly Faces “fold” or an aggregation operation over the results. - features of ugly faces, and features of attractive faces. 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 3 4 . U A A U U U A A U A U A Image Features

MapReduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? MapReduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM - data parallel are map reduceable Is there more to ML? Yes! Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting

Exploit Dependencies A lot of powerful ML methods come in exploitng dependencies between data!

Hockey Scuba Diving Underwater Hockey Hockey Scuba Diving - product recommentations. - Impossible. Social Network. - Close friends.

Graphs are Everywhere Netflix Wiki Collaborative Filtering Social Network Users Movies Netflix Collaborative Filtering Probabilistic Analysis Docs Words Wiki Text Analysis - infer interests - collaborative filtering: sparse matrix of movie rentals - text - relate random variables.

Properties of Computation on Graphs Dependency Graph Local Updates Iterative Computation My Interests Friends Interests - Data related to each other via a graph - Computation has locality properties. The computation of parameters on a vertex may depend only on the vertices adjacent to it. - Iterative in nature. Sequence of operations that are repeated (3.30)

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics - Universe - GraphLab is designed to target Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting

2010 Shared Memory Alternating Least Squares SVD Splash Sampler CoEM Bayesian Tensor Factorization Lasso Belief Propagation LDA 2010 SVM PageRank Shared Memory Algorithms users Gibbs Sampling Linear Solvers Matrix Factorization …Many others…

Limited CPU Power Limited Memory Limited Scalability

Distributed Cloud Unlimited amount of computation resources! (up to funding limitations) Distributing State Data Consistency Fault Tolerance - classical

The GraphLab Framework Graph Based Data Representation Update Functions User Computation - overview - - - Consistency Model

Data Graph Data associated with vertices and edges Graph: Social Network Vertex Data: User profile Current interests estimates Core GL datastrucutre. Arbitrary graph (5) Edge Data: Relationship (friend, classmate, relative)

Distributed Graph Partition the graph across multiple machines. - cut along edges - gaps. Local-> remote

Ghost vertices maintain adjacency structure and replicate remote data. Distributed Graph Ghost vertices maintain adjacency structure and replicate remote data. - ghost. Maintain adjacency. replicate data. “ghost” vertices

Distributed Graph Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …) - Too difficult -> random “ghost” vertices

The GraphLab Framework Graph Based Data Representation Update Functions User Computation - However, given this data, what can I do with it? Consistency Model

Update Functions Update function applied (asynchronously) User-defined program: applied to a vertex and transforms data in scope of vertex Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation Pagerank(scope){ // Update the current vertex data // Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; } weighted sum Neighbors needs to be recomputed Dynamic computation

Shared Memory Dynamic Schedule Scheduler CPU 1 e f g k j i h d c b a b a h a b Implement dyanimc asynchronous in shared memory Each task is an update function applied on a vertex scheduler = parallel data structure i CPU 2 Process repeats until scheduler is empty

Distributed Scheduling Each machine maintains a schedule over the vertices it owns. a f e i h b a b f g k j d c c h g Each machine does the same as in the shared memory setting. Pulls a task out and runs it i j Distributed Consensus used to identify completion

Ensuring Race-Free Code How much can computation overlap? - in such an asynchronous, dynamic computation environment, where ...access scope.. -prevent race conditions. - overlap

The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model

PageRank Revisited Pagerank(scope) { … } - revisit - core piece of computation. - see how it behaves if we allow it to race

Racing PageRank This was actually encountered in user code. Plot error to ground truth over time. - if we can run consistently. Converges just fine - if we don’t. - zoom in. Fluctures near convergence and never quite gets there. - PageRank: provably convergent under inconsistency. Yet it does not converge. - Why?

Bugs Pagerank(scope) { … } Take a look at the code again. Can we see the problem? - fluctuates when neighbors read. Resulting in propagation of bad values.

Bugs Pagerank(scope) { … } tmp tmp Store in a temporary, and only modify the vertex once at the end. - First problem: without consistency, the programmer has to be constantly aware of that fact and be careful to code around it.

Throughput != Performance No Consistency Higher Throughput (#updates/sec) Potentially Slower Convergence of ML - It has been said that ML algorithms are robust to no consistency … ignore it .. And it will still work out. - Allow computation to race - … higher throughput … != finish faster.

Serializability For every parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 time Parallel - defines race free operation using notion of serializability. - serializability. CPU 2 Single CPU Sequential

Serializability Example Write Read Stronger / Weaker consistency levels available User-tunable consistency levels trades off parallelism & consistency Overlapping regions are only read. Example: … - expert user. - only focus on edge consistency. Update functions one vertex apart can be run in parallel. Edge Consistency

Distributed Consistency Solution 1 Solution 2 Graph Coloring Distributed Locking

Edge Consistency via Graph Coloring Vertices of the same color are all at least one vertex apart. Therefore, All vertices of the same color can be run in parallel!

Chromatic Distributed Engine Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Ghost Synchronization Completion + Barrier Time Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 simple Ghost Synchronization Completion + Barrier

Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d Users Movies - factor sparse matrix -easy to color

Netflix Collaborative Filtering Hadoop MPI GraphLab # machines - good performance with D. D. - two orders. MPI.

The Cost of Hadoop - less discussed. Cost. Time is money. Wrong Abstraction costs money. - runtime and cost, difference cluster sizes on EC2. - 100x more, 100x longer. Logarithmic axes.

CoEM (Rosie Jones, 2005) Vertices: 2 Million Edges: 200 Million Named Entity Recognition Task the cat Australia Istanbul <X> ran quickly travelled to <X> <X> is pleasant Is “Cat” an animal? Is “Istanbul” a place? Vertices: 2 Million Edges: 200 Million 0.3% of Hadoop time -Answer questions - Construct bipartite graph from corpus Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Distributed GL 32 EC2 Nodes 80 secs

Problems Require a graph coloring to be available. Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round. - Complex graphs, user cannot provide a graph coloring. And we cannot use the chromatic engine to find a graph coloring. -

Distributed Consistency Solution 1 Solution 2 Thus we have a second engine implementation built around distributed locking. Graph Coloring Distributed Locking

Distributed Locking Edge Consistency can be guaranteed through locking. : RW Lock We associate a rw-lock with each vertex.

Consistency Through Locking Acquire write-lock on center vertex, read-lock on adjacent. - write on center, read on adjacent, taking care to acquire locks in canonical ordering - read lock acquired more than once - minor variations.

Consistency Through Locking Multicore Setting PThread RW-Locks A C B D Distributed Setting Distributed Locks CPU Machine 1 Challenges Latency A A C B D - easy in shared. - distributed, latency, limiting. Solution Pipelining Machine 2 A C B D

No Pipelining lock scope 1 Process request 1 scope 1 acquired Time scope 1 acquired - visualize - acquire a remote lock, send a message. - only when locks are acquire then machine can run the update function. update_function 1 release scope 1 Process release 1

Pipelining / Latency Hiding Hide latency using pipelining lock scope 1 lock scope 2 Process request 1 Time lock scope 3 Process request 2 scope 1 acquired Process request 3 - large number (10K) simultaneous. - update function when locks ready. In flight. - Hide latency of individual. scope 2 acquired scope 3 acquired update_function 1 release scope 1 update_function 2 Process release 1 release scope 2

Residual BP on 190K Vertex 560K Edge Graph Latency Hiding Hide latency using request buffering Residual BP on 190K Vertex 560K Edge Graph 4 Machines No Pipelining 472 s lock scope 1 Pipelining 10 s lock scope 2 Process request 1 Time lock scope 3 Process request 2 scope 1 acquired 47x Speedup Process request 3 - Small problem scope 2 acquired scope 3 acquired update_function 1 release scope 1 update_function 2 Process release 1 release scope 2

Video Cosegmentation Probabilistic Inference Task 1740 Frames Segments mean the same - To evaluate… Probabilistic Inference Task 1740 Frames Model: 10.5 million nodes, 31 million edges

Video Coseg. Speedups Ideal GraphLab # machines Highly dynamic and stresses the locking engine. We do pretty well. # machines

The GraphLab Framework Graph Based Data Representation Update Functions User Computation Concludes. - how data is distributed , how computation is performed , how consistency is maintained However, Consistency Model

How do we provide fault tolerance? What if machines fail? How do we provide fault tolerance?

1: Stop the world 2: Write state to disk Checkpoint 1: Stop the world 2: Write state to disk

Snapshot Performance Snapshot One slow machine Because we have to stop the world, One slow machine slows everything down! No Snapshot Snapshot Snapshot time One slow machine - evaluate behavior. Plot progress. - linear progress - Introduce one slow machine into the system Slow machine

Take advantage of consistency How can we do better? Take advantage of consistency

Easily implemented within GraphLab as an Update Function! Checkpointing Fine Grained Chandy-Lamport. - instead of snapshotting machines, lets think extremely fine grained. We have a graph. Lets map this to the graph. - On such granularity, did we make the problem harder? given edge consistency, extremely easy in GraphLab! GraphLab’s async checkpointing can be done entirely within GraphLab! Easily implemented within GraphLab as an Update Function!

Async. Snapshot Performance No penalty incurred by the slow machine! No Snapshot Snapshot One slow machine Lets see the behavior of this Asynchronous Snapshot procedure. Instead of a flat-line, we get a little curve.

Summary Extended GraphLab abstraction to distributed systems Two different methods of achieving consistency Graph Coloring Distributed Locking with pipelining Efficient implementations Asynchronous Fault Tolerance with fined-grained Chandy-Lamport Obtain Performance ,efficiency and scalability while as a high level abstraction, while as a high level abstraction without sacrificing useability Performance Efficiency Scalabilitys Useability