Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.

Slides:



Advertisements
Similar presentations
Scaling Up Graphical Model Inference
Advertisements

BiG-Align: Fast Bipartite Graph Alignment
Lecture 19: Parallel Algorithms
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Sebastian Schelter, Venu Satuluri, Reza Zadeh
GraphLab A New Framework for Parallel Machine Learning
Pregel: A System for Large-Scale Graph Processing
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Parallel and Distributed Systems for Probabilistic Reasoning
TensorFlow– A system for large-scale machine learning
Big Data: Graph Processing
A New Parallel Framework for Machine Learning
PREGEL Data Management in the Cloud
CSCI5570 Large Scale Data Processing Systems
Predictive Performance
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
Distributed Systems CS
Collaborative Filtering Matrix Factorization Approach
HPML Conference, Lyon, Sept 2018
COS 418: Distributed Systems Lecture 19 Wyatt Lloyd
Expectation-Maximization & Belief Propagation
Machine Learning in the Cloud
Big Data I: Graph Processing, Distributed Machine Learning
Splash Belief Propagation:
CMPT 733, SPRING 2017 Jiannan Wang
Presentation transcript:

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework for Machine Learning

In ML we face BIG problems 24 Hours a Minute YouTube 13 Million Wikipedia Pages 750 Million Facebook Users 3.6 Billion Flickr Photos

Parallelism: Hope for the Future Wide array of different parallel architectures: New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocks Managing distributed model state New Challenges for Implementing Machine Learning Algorithms: Parallel debugging and profiling Hardware specific APIs 3 GPUsMulticoreClustersMini CloudsClouds

Carnegie Mellon Core Question How will we design and implement parallel learning systems?

Carnegie Mellon Threads, Locks, & Messages Build each new learning systems using low level parallel primitives We could use ….

Threads, Locks, and Messages ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a specific parallel platform Two months later the conference paper contains: “We implemented ______ in parallel.” The resulting code: is difficult to maintain is difficult to extend couples learning model to parallel implementation 6 Graduate students

Carnegie Mellon Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions... a better answer:

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 8 Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 9 Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 10 Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 MapReduce – Reduce Phase Fold/Aggregation

Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Sampling Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 12 Data-Parallel Graph-Parallel Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Carnegie Mellon Concrete Example Label Propagation

Profile Label Propagation Algorithm Social Arithmetic: Recurrence Algorithm: iterate until convergence Parallelism: Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 30% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking

Properties of Graph Parallel Algorithms Dependency Graph Iterative Computation What I Like What My Friends Like Factored Computation

? Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Sampling Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 16 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce?

Carnegie Mellon Why not use Map-Reduce for Graph Parallel Algorithms?

Data Dependencies Map-Reduce does not efficiently express dependent data User must code substantial data transformations Costly data replication Independent Data Rows

Slow Processor Iterative Algorithms Map-Reduce not efficiently express iterative algorithms: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapAbuse: Iterative MapReduce Only a subset of data needs computation: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapAbuse: Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

Synchronous vs. Asynchronous Example Algorithm: If Red neighbor then turn Red Synchronous Computation (Map-Reduce) : Evaluate condition on all vertices for every phase 4 Phases each with 9 computations  36 Computations Asynchronous Computation (Wave-front) : Evaluate condition only when neighbor changes 4 Phases each with 2 computations  8 Computations Time 0 Time 1 Time 2Time 3Time 4

Data-Parallel Algorithms can be Inefficient The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms. Optimized in Memory MapReduceBP Asynchronous Splash BP

? Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Sampling Lasso The Need for a New Abstraction Map-Reduce is not well suited for Graph-Parallelism 24 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Carnegie Mellon What is GraphLab?

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 26

Data Graph 27 A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights Graph: Social Network

Data Graph: Gibbs Sampling Graph = Markov Random Field (MRF) Vertices correspond to random variables Edges correspond to dependencies in the MRF 28 Vertex Data: Node potential Current assignment Sequence of samples Edge Data: Edge Potential Algorithm: For each variable draw a new assignment given its neighbors assignments and edge potentials.

Data Graph: Lasso 29 Data matrix, n x d weights d x 1 Observations n x 1 5 Features 4 Examples Shooting Algorithm: Sequentially optimize each weight holding all other weights fixed Vertex Data: Weights and losses Edge Data: Entries in X

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 30

label_prop(i, scope){ // Get Neighborhood data (Likes[i], W ij, Likes[j])  scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); } Update Functions 31 An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 32

The Scheduler 33 CPU 1 CPU 2 The scheduler determines the order that vertices are updated. e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty.

Choosing a Schedule GraphLab provides several different schedulers Round Robin: vertices are updated in a fixed order FIFO: Vertices are updated in the order they are added Priority: Vertices are updated in priority order 34 The choice of schedule affects the correctness and parallel performance of the algorithm Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority

ConvergedSlowly Converging Focus Effort Dynamic Computation 35

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 36

GraphLab Ensures Sequential Consistency 37 For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

GraphLab Guarantee For every parallel execution of a GraphLab program there exists an equivalent sequential execution of update functions. Race-Free: Consistency Model

Ensuring Race-Free Code How much can computation overlap?

CPU 1 CPU 2 Common Problem: Write-Write Race 40 Processors running adjacent update functions simultaneously modify shared data: CPU1 writes:CPU2 writes: Final Value

Nuances of Sequential Consistency Data consistency depends on the update function: Some algorithms are “robust” to data-races GraphLab Solution The user can choose from three consistency models Full, Edge, Vertex GraphLab automatically enforces the users choice 41 CPU 1 CPU 2 Unsafe CPU 1 CPU 2 Safe Read

Consistency Rules 42 Guaranteed sequential consistency for all update functions

Full Consistency 43 Only allow update functions two vertices apart to be run in parallel Reduced opportunities for parallelism

Obtaining More Parallelism 44 Not all update functions will modify the entire scope! Edge consistency is sufficient for a large number of algorithms including: Label Propagation

Edge Consistency 45 CPU 1 CPU 2 Safe Read

Obtaining More Parallelism 46 “Map” operations. Feature extraction on vertex data

Vertex Consistency 47

Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares

The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 49

Anatomy of a GraphLab Program: 1) Define C++ Update Function 2) Build data graph using the C++ graph object 3) Set engine parameters: 1) Scheduler type 2) Consistency model 4) Add initial vertices to the scheduler 5) Run the engine on the graph [Blocking C++ call] 6) Final answer is stored in the graph

Algorithms Implemented PageRank Loopy Belief Propagation Gibbs Sampling CoEM Graphical Model Parameter Learning Probabilistic Matrix/Tensor Factorization Alternating Least Squares Lasso with Sparse Features Support Vector Machines with Sparse Features Label-Propagation …

Carnegie Mellon Implementing the GraphLab API Multi-core & Cloud Settings

Multi-core Implementation Implemented in C++ on top of: Pthreads, GCC Atomics Consistency Models implemented using: Read-Write Locks on each vertex Canonically ordered lock acquisition (dining philosophers) Approximate schedulers: Approximate FiFo/Priority ordering to reduced locking overhead Experimental Matlab/Java/Python support Nearly Complete Implementation Available under Apache 2.0 License at graphlab.org

Distributed Cloud Implementation Implemented in C++ on top of: Multi-core implementation for each multi-core node Custom RPC built on-top of TCP/IP and MPI Graph is Partitioned over Cluster using either: ParMETIS: High-performance partitioning heuristics Random Cuts: Seems to work well on natural graphs Consistency models are enforced using either Distributed RW-Locks with pipelined acquisition Graph-coloring with phased execution No Fault Tolerance: we are working on a solution Still Experimental

Carnegie Mellon Shared Memory Experiments Shared Memory Setting 16 Core Workstation 55

Loopy Belief Propagation 56 3D retinal image denoising Data Graph Update Function: Loopy BP Update Equation Scheduler: Approximate Priority Consistency Model: Edge Consistency Vertices: 1 Million Edges: 3 Million

Loopy Belief Propagation 57 Optimal Better SplashBP 15.5x speedup

Gibbs Sampling Protein-protein interaction networks [Elidan et al. 2006] Provably correct Parallelization Edge Consistency Round-Robin Scheduler 58 Discrete MRF 14K Vertices 100K Edges Backbone Protein Interactions Side-Chain

Gibbs Sampling 59 Optimal Better Chromatic Gibbs Sampler

CoEM (Rosie Jones, 2005) Named Entity Recognition Task the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million

Better Optimal GraphLab CoEM CoEM (Rosie Jones, 2005) 61 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs

Lasso: Regularized Linear Model 62 Data matrix, n x d weights d x 1 Observations n x 1 5 Features 4 Examples Shooting Algorithm [Coordinate Descent] Updates on weight vertices modify losses on observation vertices. Requires the Full Consistency Model Financial prediction dataset from Kogan et al [2009]. Regularization

Full Consistency 63 Optimal Better Dense Sparse

Relaxing Consistency 64 Why does this work? (See Shotgut ICML Paper) Better Optimal Dense Sparse

Carnegie Mellon Experiments Amazon EC2 High-Performance Nodes 65

Video Cosegmentation Segments mean the same Model: 10.5 million nodes, 31 million edges Gaussian EM clustering + BP on 3D grid

Video Coseg. Speedups

Video Segmentation

Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d

Netflix Speedup Increasing size of the matrix factorization

Netflix

Netflix: Comparison to MPI Cluster Nodes

Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems 74

Current/Future Work Out-of-core Storage Hadoop/HDFS Integration Graph Construction Graph Storage Launching GraphLab from Hadoop Fault Tolerance through HDFS Checkpoints Sub-scope parallelism Address the challenge of very high degree nodes Update Functions -> Update Functors Allows update functions to send state when rescheduling.

Carnegie Mellon Checkout GraphLab 76 Documentation… Code… Tutorials… Questions & Feedback

Outline of the GraphLab Model 77 SchedulingData GraphUpdate Functions Consistency ModelShared Data Table

Global Information What if we need global information? 78 Global loss estimate for termination assessment Algorithm Parameters Sufficient Statistics Shared Data Table

Shared Data Table (SDT) Global constant parameters Global computation (Sync Operation) Reduce: repeatedly computed in the background by “summing” over all vertices. 79 Constant Required Number of Samples Constant Required Number of Samples Constant Temperature Constant Temperature Sync Termination criterion Sync Termination criterion Sync Log-likelihood Sync Log-likelihood

Automatic Termination Checking User attaches predicates to Shared Data Table Example: residualCheck(Shared Data Table) Returns True if termination condition is satisfied The termination functions are evaluated every time the Shared Data Table is updated