Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Slides:



Advertisements
Similar presentations
Scaling Up Graphical Model Inference
Advertisements

Danny Bickson Parallel Machine Learning for Large-Scale Graphs
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Exact Inference in Bayes Nets
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Libperf libperf provides a tracing interface into the Linux Kernel Performance Counters (LKPC) subsystem recently introduced into the Linux Kernel mainline.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
GraphLab A New Framework for Parallel Machine Learning
Pregel: A System for Large-Scale Graph Processing
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Carnegie Mellon University Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein Alex Smola The Next Generation.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
Distributed cooperation and coordination using the Max-Sum algorithm
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
Parallel and Distributed Systems for Probabilistic Reasoning
TensorFlow– A system for large-scale machine learning
Big Data: Graph Processing
CSCI5570 Large Scale Data Processing Systems
A New Parallel Framework for Machine Learning
Big Learning with Graphs
PREGEL Data Management in the Cloud
Distributed Graph-Parallel Computation on Natural Graphs
CSCI5570 Large Scale Data Processing Systems
Data Structures and Algorithms in Parallel Computing
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
Distributed Systems CS
HPML Conference, Lyon, Sept 2018
COS 418: Distributed Systems Lecture 19 Wyatt Lloyd
EE 4xx: Computer Architecture and Performance Programming
Expectation-Maximization & Belief Propagation
Big Data I: Graph Processing, Distributed Machine Learning
Splash Belief Propagation:
CMPT 733, SPRING 2017 Jiannan Wang
Presentation transcript:

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework for Machine Learning Alex Smola

In ML we face BIG problems 48 Hours a Minute YouTube 24 Million Wikipedia Pages 750 Million Facebook Users 6 Billion Flickr Photos

Parallelism: Hope for the Future Wide array of different parallel architectures: New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocks Managing distributed model state New Challenges for Implementing Machine Learning Algorithms: Parallel debugging and profiling Hardware specific APIs 3 GPUsMulticoreClustersMini CloudsClouds

How will we design and implement parallel learning systems?

Threads, Locks, & Messages “low level parallel primitives” We could use ….

Threads, Locks, and Messages ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a specific parallel platform Two months later the conference paper contains: “We implemented ______ in parallel.” The resulting code: is difficult to maintain is difficult to extend couples learning model to parallel implementation 6 Graduate students

Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions... a better answer:

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 8 Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Image Features

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 10 Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 MapReduce – Reduce Phase Image Features Attractive Face Statistics Ugly Face Statistics

Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 12 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Is there more to Machine Learning ?

Concrete Example Label Propagation

Profile Label Propagation Algorithm Social Arithmetic: Recurrence Algorithm: iterate until convergence Parallelism: Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking

Properties of Graph Parallel Algorithms Dependency Graph Iterative Computation What I Like What My Friends Like Factored Computation

? Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 16 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce?

Why not use Map-Reduce for Graph Parallel Algorithms?

Data Dependencies Map-Reduce does not efficiently express dependent data User must code substantial data transformations Costly data replication Independent Data Rows

Slow Processor Iterative Algorithms Map-Reduce not efficiently express iterative algorithms: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapAbuse: Iterative MapReduce Only a subset of data needs computation: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapAbuse: Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 22 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce? Pregel (Giraph)?

Barrier Pregel (Giraph) Bulk Synchronous Parallel Model: ComputeCommunicate

Problem with Bulk Synchronous Example Algorithm: If Red neighbor then turn Red Bulk Synchronous Computation : Evaluate condition on all vertices for every phase 4 Phases each with 9 computations  36 Computations Asynchronous Computation (Wave-front) : Evaluate condition only when neighbor changes 4 Phases each with 2 computations  8 Computations Time 0 Time 1 Time 2Time 3Time 4

Bulk synchronous computation can be highly inefficient. 25 Example: Loopy Belief Propagation Problem

Loopy Belief Propagation (Loopy BP) Iteratively estimate the “beliefs” about vertices – Read in messages – Updates marginal estimate (belief) – Send updated out messages Repeat for all variables until convergence 26

Bulk Synchronous Loopy BP Often considered embarrassingly parallel – Associate processor with each vertex – Receive all messages – Update all beliefs – Send all messages Proposed by: – Brunton et al. CRV’06 – Mendiburu et al. GECC’07 – Kang,et al. LDMTA’10 –…–… 27

Sequential Computational Structure 28

Hidden Sequential Structure 29

Hidden Sequential Structure Running Time: Evidence Time for a single parallel iteration Time for a single parallel iteration Number of Iterations 30

Optimal Sequential Algorithm Forward-Backward Bulk Synchronous 2n 2 /p p ≤ 2n Running Time 2n Gap p = 1 n p = 2 31

The Splash Operation Generalize the optimal chain algorithm: to arbitrary cyclic graphs: ~ 1)Grow a BFS Spanning tree with fixed size 2)Forward Pass computing all messages at each vertex 3)Backward Pass computing all messages at each vertex 32

Data-Parallel Algorithms can be Inefficient The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms. Optimized in Memory Bulk Synchronous Asynchronous Splash BP

Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso The Need for a New Abstraction Map-Reduce is not well suited for Graph-Parallelism 34 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Pregel (Giraph)

What is GraphLab?

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 36

Data Graph 37 A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights Graph: Social Network

Implementing the Data Graph Multicore Setting In Memory Relatively Straight Forward vertex_data(vid)  data edge_data(vid,vid)  data neighbors(vid)  vid_list Challenge: Fast lookup, low overhead Solution: Dense data-structures Fixed Vdata & Edata types Immutable graph structure Cluster Setting In Memory Partition Graph: ParMETIS or Random Cuts Cached Ghosting Node 1 Node 2 A A B B C C D D A A B B C C D D A A B B C C D D

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 39

label_prop(i, scope){ // Get Neighborhood data (Likes[i], W ij, Likes[j])  scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); } Update Functions 40 An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 41

The Scheduler 42 CPU 1 CPU 2 The scheduler determines the order that vertices are updated. e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty.

Choosing a Schedule GraphLab provides several different schedulers Round Robin: vertices are updated in a fixed order FIFO: Vertices are updated in the order they are added Priority: Vertices are updated in priority order 43 The choice of schedule affects the correctness and parallel performance of the algorithm Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority

CPU 1CPU 2CPU 3CPU 4 Implementing the Schedulers Multicore Setting Challenging! Fine-grained locking Atomic operations Approximate FiFo/Priority Random placement Work stealing Cluster Setting Multicore scheduler on each node Schedules only “local” vertices Exchange update functions Queue 1 Queue 2 Queue 3 Queue 4 Node 1 CPU 1CPU 2 Queue 1 Queue 2 Node 2 CPU 1CPU 2 Queue 1 Queue 2 v1v1 v1v1 v2v2 v2v2 f(v 1 ) f(v 2 )

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 45

Ensuring Race-Free Code How much can computation overlap?

Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares

Importance of Consistency Machine learning algorithms require “model debugging” Build Test Debug Tweak Model

GraphLab Ensures Sequential Consistency 49 For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

CPU 1 CPU 2 Common Problem: Write-Write Race 50 Processors running adjacent update functions simultaneously modify shared data: CPU1 writes:CPU2 writes: Final Value

Consistency Rules 51 Guaranteed sequential consistency for all update functions Data

Full Consistency 52

Obtaining More Parallelism 53

Edge Consistency 54 CPU 1 CPU 2 Safe Read

Consistency Through R/W Locks Read/Write locks: Full Consistency Edge Consistency Write Canonical Lock Ordering ReadWrite Read Write

Multicore Setting: Pthread R/W Locks Distributed Setting: Distributed Locking Prefetch Locks and Data Allow computation to proceed while locks/data are requested. Node 2 Consistency Through R/W Locks Node 1 Data Graph Partition Lock Pipeline

Consistency Through Scheduling Edge Consistency Model: Two vertices can be Updated simultaneously if they do not share an edge. Graph Coloring: Two vertices can be assigned the same color if they do not share an edge. Barrier Phase 1 Barrier Phase 2 Barrier Phase 3

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 58

Algorithms Implemented PageRank Loopy Belief Propagation Gibbs Sampling CoEM Graphical Model Parameter Learning Probabilistic Matrix/Tensor Factorization Alternating Least Squares Lasso with Sparse Features Support Vector Machines with Sparse Features Label-Propagation …

Shared Memory Experiments Shared Memory Setting 16 Core Workstation 60

Loopy Belief Propagation 61 3D retinal image denoising Data Graph Update Function: Loopy BP Update Equation Scheduler: Approximate Priority Consistency Model: Edge Consistency Vertices: 1 Million Edges: 3 Million

Loopy Belief Propagation 62 Optimal Better SplashBP 15.5x speedup

CoEM (Rosie Jones, 2005) Named Entity Recognition Task the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million

Better Optimal GraphLab CoEM CoEM (Rosie Jones, 2005) 64 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs

Experiments Amazon EC2 High-Performance Nodes 65

Video Cosegmentation Segments mean the same Model: 10.5 million nodes, 31 million edges Gaussian EM clustering + BP on 3D grid

Video Coseg. Speedups

Prefetching Data & Locks

Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d

Netflix Speedup Increasing size of the matrix factorization

The Cost of Hadoop

Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems 72

Carnegie Mellon Checkout GraphLab 73 Documentation… Code… Tutorials… Questions & Feedback

Current/Future Work Out-of-core Storage Hadoop/HDFS Integration Graph Construction Graph Storage Launching GraphLab from Hadoop Fault Tolerance through HDFS Checkpoints Sub-scope parallelism Address the challenge of very high degree nodes Improved graph partitioning Support for dynamic graph structure