MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
MapReduce.
SDN + Storage.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
CS347: MapReduce CS Motivation for Map-Reduce Distribution makes simple computations complex Communication Load balancing Fault tolerance … What.
Distributed Computations
Distributed Computations MapReduce
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
MapReduce.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop MapReduce Framework
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Ch 4. The Evolution of Analytic Scalability
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Outline Background Motivation: Block vs pipeline Hadoop Online Prototype Model Pipeline within job Online Aggregation Continuous Queries Conclusion

Background Map Reduce system – Massive data parallelism, batch-oriented, high through put – Fault tolerance via putting results on HDFS However, – Stream process: analyze streams of data – Online aggregation: used interactively

Motivation: Batch vs online Batch: – Reduce begin after all map task – High through put, high latency Online – Stream process is usually not fault tolerant – Lower latency Blocking does not fit online/stream data – Final answers only – No infinite streams Fault tolerance is important, how?

Map Reduce Job Map step – Parse input into words – For each word, output Reduce step – For each word, list of counts – Sum counts and output Combine step (optional) – Preaggregate map output – Same with reduce

Map Reduce steps Client Submit job, master schedule mapper and reducer Map step, Group(sort) step, Combine step(optional), Commit step Map finish

Map Reduce steps Master tell reduce the output location Shuffle(pull data) step, Group(all sort) step, – Start too late? Reduce step Reduce finish Job finish

Hadoop Online Prototype(HOP) Major: Pipelining between operators – Data pushed from mapper to reducer – Data transfer concurrently with map/reduce computation – Still fault tolerant Benefit – Low latency – Higher Utilization – Smooth network traffic

Performance at a Glance In some case, HOP can reduce job completion time by 25%.

Pipeline within Job Simple design: pipeline each record – Prevents map to group and combine – Network I/O heavy load – Map flood and bury reducer Revised: pipeline small sorted runs(spills) – Task thread: apply map/reduce function, buffer output – Spill thread: sort & combine buffer, spill to file – Task Tracker: handle service consumer requests

Utility balance control Mapper send early results: move computation(group&combine) from mapper to reducer. If reducer is fast, mapper aggressive++, mapper sort&spill-- If reducer is slow, mapper aggressive--, mapper sort&spill++ Halt pipeline when: backup or effective combiner Resume pipeline by: merge&combine accumulated spill files

Pipelined Fault tolerant(PFT) Simple PFT design (coarse): – Reduce treats in-progress map output as tentative – If map succeed accept its output – If map die, throw its output Revised PTF design (finer): – Record mapper progress, recover from latest checkpoint – Correctness: Reduce task ensure spill files are good – Map tasks recover from latest checkpoint, no redundant spill file Master is more busy here – Need to record progress for each map, – Need to record whether each map output is send

System fault tolerant Mapper fail – New mapper start from checkpoint and sent to reducer Reducer fail – All mapper resend all intermediate result. Mapper still need to store the intermediate result on local disk, but reducer dont have to block. Master fail – The system cannot survive.

Online aggregation Show snapshot of reducer result from time to time Show Progress (reducer get %)

Pipeline between jobs Assume we run job1, job2. job2 needs job1s result. Snapshot the output of job1 and pipeline it to job2 from time to time. Fault tolerant: – Job1 fail: recover as before – Job2 fail: restart failed task – Both fail: job2 restart from latest snapshot

Continuous Queries Mapper: add flush API, store it locally if reducer is unavailable Reducer: run periodically – Wall-clock time, logical time, #input rows, etc – Fix #mapper and #reducers Fault tolerant: – Mapper cannot retain infinite results. – Reducer: saving checkpoint using HDFS

Performance Evaluation

Impact of #Reducer

When #reducer is enough, faster. When #reducer is not enough, slower. – Not able to balance workload between mapper and reducer

Small vs Large block When using large block, HOP is faster because reducer doesnt wait.

Small vs Large block When using small block, HOP is still faster, but advantage is smaller.

Discussion HOP improved hadoop for real time/stream process, useful with few jobs. Finer granularity control may make task master busy, affect scalability. When there is a lot of jobs, it may increase computation and decrease through put. (busier network, many overhead for master)