Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of this week Debugging tips for ML algorithms
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Network Support for Cloud Services Lixin Gao, UMass Amherst.
Pregel: A System for Large-Scale Graph Processing
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
TensorFlow– A system for large-scale machine learning
Processes and threads.
PREGEL Data Management in the Cloud
MapReduce Simplied Data Processing on Large Clusters
Replication-based Fault-tolerance for Large-scale Graph Processing
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University

Motivating Example: PageRank for each node X in graph: for each edge X  Z: next[Z] += curr[X] Repeat until convergence A  B,C,D BEBE CDCD … A: 0.12 B: 0.15 C: 0.2 … A: 0 B: 0 C: 0 … CurrNextInput Graph A: 0.2 B: 0.16 C: 0.21 … A: 0.2 B: 0.16 C: 0.21 … A: 0.25 B: 0.17 C: 0.22 … A: 0.25 B: 0.17 C: 0.22 … Fits in memory!

PageRank in MapReduce Distributed Storage Graph streamRank stream A->B,C, B->D A:0.1, B:0.2 Rank stream A:0.1, B:0.2  Data flow models do not expose global state.

PageRank in MapReduce Distributed Storage Graph streamRank stream A->B,C, B->D A:0.1, B:0.2 Rank stream A:0.1, B:0.2  Data flow models do not expose global state.

PageRank With MPI/RPC Distributed Storage Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … User explicitly programs communication

Piccolo’s Goal: Distributed Shared State Distributed Storage Graph A->B,C B->D … Ranks A: 0 B: 0 … Ranks A: 0 B: 0 … read/write Distributed in- memory state

Piccolo’s Goal: Distributed Shared State Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … Piccolo runtime handles communication

Ease of usePerformance

Talk outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

Programming Model Graph A  B,C B  D … Ranks A: 0 B: 0 … Ranks A: 0 B: 0 … read/write x get/put update/iterate Implemented as library for C++ and Python

def main(): for i in range(50): launch_jobs(NUM_MACHINES, pr_kernel, graph, curr, next) swap(curr, next) next.clear() def pr_kernel(graph, curr, next): i = my_instance n = len(graph)/NUM_MACHINES for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) Naïve PageRank with Piccolo Run by a single controller Run by a single controller Jobs run by many machines curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) Controller launches jobs in parallel

Naïve PageRank is Slow Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … get put get

PageRank: Exploiting Locality Control table partitioning Co-locate tables Co-locate execution with table curr = Table(…,partitions=100,partition_by=site) next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) def pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) swap(curr, next) next.clear()

Exploiting Locality Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … get put get

Exploiting Locality Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … put get put get

Synchronization Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … put (a=0.3) put (a=0.2) How to handle synchronization?

Synchronization Primitives  Avoid write conflicts with accumulation functions  NewValue = Accum(OldValue, Update) sum, product, min, max  Global barriers are sufficient  Tables provide release consistency

PageRank: Efficient Synchronization curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) def pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) barrier(handle) swap(curr, next) next.clear() Accumulation via sum Update invokes accumulation function Explicitly wait between iterations

Efficient Synchronization Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3) Runtime computes sum Workers buffer updates locally  Release consistency Workers buffer updates locally  Release consistency

Table Consistency Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel(graph, curr, next): for node in graph.get_iterator(my_instance) for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) def main(): curr, userdata = restore() last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear() Restore previous computation User decides which tables to checkpoint and when

Distributed Storage Recovery via Checkpointing Graph A->B,C … Graph A->B,C … Ranks A: 0 … Ranks A: 0 … Graph B->D … Graph B->D … Ranks B: 0 … Ranks B: 0 … Graph C->E,F … Graph C->E,F … Ranks C: 0 … Ranks C: 0 … Runtime uses Chandy-Lamport protocol

Talk Outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

Load Balancing master J5 J3 Other workers are updating P6! Pause updates! P1 P1, P2 P3 P3, P4 P5 P5, P6 J1 J1, J2 J3 J3, J4 J6 J5, J6 P6 Coordinates work- stealing

Talk Outline  Motivation  Piccolo's Programming Model  System Design  Evaluation

Piccolo is Fast NYU cluster, 12 nodes, 64 cores 100M-page graph Main Hadoop Overheads:  Sorting  HDFS  Serialization Main Hadoop Overheads:  Sorting  HDFS  Serialization PageRank iteration time (seconds) Hadoop Piccolo

Piccolo Scales Well EC2 Cluster - linearly scaled input graph ideal 1 billion page graph PageRank iteration time (seconds)

 Iterative Applications  N-Body Simulation  Matrix Multiply  Asynchronous Applications  Distributed web crawler ‏ Other applications No straightforward Hadoop implementation

Related Work  Data flow  MapReduce, Dryad  Tuple Spaces  Linda, JavaSpaces  Distributed Shared Memory  CRL, TreadMarks, Munin, Ivy  UPC, Titanium

Conclusion  Distributed shared table model  User-specified policies provide for  Effective use of locality  Efficient synchronization  Robust failure recovery

Gratuitous Cat Picture I can haz kwestions? Try it out: piccolo.news.cs.nyu.edu

Naïve PageRank – Problems! curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel(graph, curr, next): i = my_instance n = len(graph)/NUM_MACHINES for s in graph[(i-1)*n:i*n]: for t in s.out: next[t] += curr[s.id] / len(s.out) def main(): for i in range(50): launch_kernel(NUM_MACHINES, pr_kernel, graph, curr, next) swap(curr, next) next.clear() Lots of remote reads! No synchronization!

Failure Handling  Users decide what and when to checkpoint  When to checkpoint?  At global barriers  Periodically during execution  What to checkpoint?  Table data (saved automatically by Piccolo) ‏  User state (saved/restored via callback)

System Overview Store table partitions Process table operations Run kernel threads Store table partitions Process table operations Run kernel threads Assign partitions/kernels Run control thread Assign partitions/kernels Run control thread worker master

P1 Partition and Kernel Assignment P1, P2 P3 worker P3, P4 P5 P5, P6 master K1 K1, K2 K3 K3, K4 K6 K5, K6 Initially assign kernels by locality

Kernel Execution A A C C B B master Remote reads block execution Remote writes are accumulated locally K5 P1 P1, P2 P3 P3, P4 P5 P5, P6 K1 K1, K2 K3 K3, K4 K6 K5, K6

Asynchronous Web Crawler  Simple to implement  Crawl pages quickly, as they are discovered  Saturate 100mb link using 32 workers  Simple to implement  Crawl pages quickly, as they are discovered  Saturate 100mb link using 32 workers

Exploiting Locality  Explicitly divide tables into partitions  Expose partitioning to users  Users provide key to partition mapping function  Co-locate kernels with table partitions  Co-locate partitions of different tables