FlumeJava Easy, Efficient Data-Parallel Pipelines Mosharaf Chowdhury.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Grand Challenge: The BlueBay Soccer Monitoring Engine Hans-Arno Jacobsen Kianoosh Mokhtarian Tilmann Rabl Mohammad.
MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Program Representations. Representing programs Goals.
AUTOMATIC GENERATION OF CODE OPTIMIZERS FROM FORMAL SPECIFICATIONS Vineeth Kumar Paleri Regional Engineering College, calicut Kerala, India. (Currently,
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
Spark: Cluster Computing with Working Sets
Distributed Computations
Chapter 2: Algorithm Discovery and Design
1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.
Chapter 2: Algorithm Discovery and Design
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
SS ZG653Second Semester, Topic Architectural Patterns Pipe and Filter.
Query Processing Presented by Aung S. Win.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Differences between C# and C++ Dr. Catherine Stringfellow Dr. Stewart Carpenter.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
ExTASY 0.1 Beta Testing 1 st April 2015
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
© 2006 IBM Corporation IBM WebSphere Portlet Factory Architecture.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Google’s MapReduce Connor Poske Florida State University.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
Pipes & Filters Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
DESIGN PATTERNS -BEHAVIORAL PATTERNS WATTANAPON G SUTTAPAK Software Engineering, School of Information Communication Technology, University of PHAYAO 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Java8 Released: March 18, Lambda Expressions.
Advanced Component Models ULCM & HLCM Julien Bigot, Hinde Bouziane, Christian Perez COOP Project Lyon, 9-10 mars 2010.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Big Data is a Big Deal!.
CSCI5570 Large Scale Data Processing Systems
CS239-Lecture 4 FlumeJava Madan Musuvathi Visiting Professor, UCLA
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
MapReduce Types, Formats and Features
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Parallel Objects: Virtualization & In-Process Components
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Task Parallel Library: Design Principles and Best Practices
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
CMPT 733, SPRING 2016 Jiannan Wang
Distributed System Gang Wu Spring,2018.
Slides prepared by Samkit
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Group 15 Swathi Gurram Prajakta Purohit
Introduction to Spark.
Lambda Expressions.
A Map-Reduce System with an Alternate API for Multi-Core Environments
5/7/2019 Map Reduce Map reduce.
CMPT 733, SPRING 2017 Jiannan Wang
Presentation transcript:

FlumeJava Easy, Efficient Data-Parallel Pipelines Mosharaf Chowdhury

Problem Efficient data-parallel pipelines – Chain of MapReduce programs – Iterative jobs –…–… Exposes a limited set of parallel operations on immutable parallel collections

Goals Expressiveness Abstractions – Data representation – Implementation strategy Performance – Lazy evaluation – Dynamic optimization Usability & deployability – Implemented as a Java library – Inspired by the failure of Lumberjack

FlumeJava Workflow Write a Java program using the FlumeJava library FlumeJava.run(); Optimize Execute PCollection words = lines.parallelDo(new DoFn () { void process(String line, EmitFn emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } }, collectionOf(strings())); PCollection words = lines.parallelDo(new DoFn () { void process(String line, EmitFn emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } }, collectionOf(strings()));

Core Abstractions Parallel Collections 1.PCollection 2.PTable Data-parallel Operations Primitives 1.parallelDo() 2.groupByKey() 3.combineValues() 4.flatten() Derived operations 1.count() 2.join() 3.top()

MapShuffleCombineReduce (MSCR) Transform combinations of the four primitives into single MapReduce Generalizes MapReduce – Multiple reducers/combiners – Multiple output per reducer – Pass-through outputs

Optimization Optimizer Strategy 1.Sink flattens 2.Lift CombineValues 3.Insert fusion blocks 4.Fuse parallelDos 5.Fuse MSCRs Optimizer Output 1.MSCR 2.Flatten 3.Operate

Hit or Miss? Sizable reduction in SLOC – Except for Sawzall 5x reduction in average number of stages Faster than other approaches – Except for Hand-optimized MapReduce chains 319 users over a year period