MapReduce Online Veli Hasanov 50051030 Fatih University.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Spark: Cluster Computing with Working Sets

MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein.

Distributed Computations

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Distributed Computations MapReduce

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

MapReduce How to painlessly process terabytes of data.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Software Systems Development

Chapter 10 Data Analytics for IoT

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

Introduction to HDFS: Hadoop Distributed File System

Software Engineering Introduction to Apache Hadoop Map Reduce

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Simplied Data Processing on Large Clusters

COS 418: Distributed Systems Lecture 1 Mike Freedman

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Interpret the execution mode of SQL query in F1 Query paper

5/7/2019 Map Reduce Map reduce.

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

MapReduce Online Veli Hasanov Fatih University

OUTLINE 1 Introduction 2 Background (Hadoop) 3 Pipelined MapReduce 4 Online Aggregation 5 Conclusion

1-Introduction MapReduce has emerged as a popular way in working on large clusters The Google MapReduce framework and open-source Hadoop system Data-centric fashion: transformation to data sets distributed execution, network communication and fault tolerance are handled by MR

1-Introduction Pipelining provides several important advantages to a MapReduce framework: -map->reduce => generate & refine an approximation of their ﬁnal answer during the course of execution =online aggregation -MapReduce jobs(that run continuosly) accepting new data, as it arrives, analyse it immediately. This allows MR to be used in applications such as event monitoring and stream processing -pipelining can reduce job completion times by up to 25% in some scenarios.

1-Introduction We present a modiﬁed version of the Hadoop MapReduce framework that supports on-line aggregation, which allows users to see “early returns”

OUTLINE 1 Introduction 2 Background (Hadoop) 3 Pipelined MapReduce 4 Online Aggregation 5 Conclusion

2-Hadoop Background 2.1 Programming Model -mapping, [intermediate pairs.] reducing -Optionally, user can use Combiner Function (map-side pre-aggregation). Reduces network traffic btw map-reduce

2-Hadoop Background 2.2 Hadoop Architecture -Hadoop MapReduce + Hadoop Distributed File System (HDFS) -HDFS is used to store both inputs to map and reduce steps(Intermediate results are stored in node’s local file system) -Hadoop installation =single master node and many worker nodes -Master Node - JobTracker (accept, divide, give) -Worker Node - TaskTracker (manages execution of tasks. By default is has 2 maps & 2 reduces slots)

2-Hadoop Background 2.3 Map Task Execution -Each map task is assigned a portions of input files (splits) -By default split contains a single HDFS block (in 64MB). number of file blocks = number of map tasks. -Execution of map tasks is divided into 2 phases: -1. The map phase : : Reads the task’s splits from HDFS, parses it, and applies map function to each record -2. The commit phase : After that the commit phase registers the final output with the TaskTracker, (informs JobTracker about completing of a task) -outputCollector function (in the map phase ) stores map output in a format that is easy for reduce tasks to consume.

2-Hadoop Background 2.4 Reduce Task Execution is divided into three phases: -Shuffle, sort, reduce The output of the reduce function is written to a temporary location on HDFS. After comlpeting, the task’s HDFS output file is atomically renamed from its temporary location to its final location. If there is a failure in map or reduce task execution??

2-Hadoop Background Map tasks write their output to local disk – Output available after map task has completed Reduce tasks write their output to HDFS – Once job is finished, next job’s map tasks can be scheduled, and will read input from HDFS Therefore, fault tolerance is simple: simply re- run tasks on failure – No consumers see partial operator output

Dataflow in Hadoop Submit job schedule map reduce

Dataflow in Hadoop HDFS Block 1 Block 2 map reduce Read Input File

Dataflow in Hadoop map reduce Local FS HTTP GET

Dataflow in Hadoop reduce HDFS Write Final Answer

OUTLINE 1 Introduction 2 Background (Hadoop) 3 Pipelined MapReduce 4 Online Aggregation 5 Conclusion

3- Pipelined MapReduce 3.1 Pipelining within a job Naïve Pipelining -we modiﬁed Hadoop to send data directly from map to reduce tasks. -Client submit jobs-> JobTracker assigns map&reduce tasks to the available TaskTracker slots.. -TCP socket will be used to pipeline the output of the map function. As soon as map-output is produced, Mapper determines where(which partition in reduce task) the record should be sent to.

3- Pipelined MapReduce Refinements Naïve Pipelining may suffer from several practical problems: -Prblm1. There may not be enough slots available to schedule every task in a new job and large number of TCP connection is needed. => Map Task write output to the disk. Once the reduce task assign a slot, then it can pull the records from the map task. For TCP problems: each reducer can be configurable. (pull the data from a certain number of mappers at once) -Prblm2. The map function was invoked by the same thread that wrote output records to the pipeline sockets. i.e. If the network is truncated the mapper will be prevented from doing useful work. => separate thread: stores its output in an in-memory buffer, then another one sends these data to the connected reducers

3- Pipelined MapReduce Granularity of Map Output Another problem with the naïve design is that it eagerly sends each record as soon as it is produced, which prevents the use of map- side combiners. => Instead of sending the buffer contents to reducers directly, we wait for the buffer to grow to a threshold size. The mapper then applies the combiner function and writes the buffer to disk using the spill file format. -When a map task generates a new spill file, it first queries the TaskTracker for the number of unsent spill files. If this number grows beyond a certain threshold. mapper will accumulate multiple spill files. -Once the queue of unsent spill files exceeds the threshold, the map task merges and combines the accumulated spill files into a single file, and then registers its output with the TaskTracker.

3- Pipelined MapReduce 3.2 Pipelining Between Jobs In the traditional Hadoop architecture, the output of each job is written to HDFS. (j1, j2) Furthermore, the JobTracker cannot schedule a consumer job until the producer job has completed, because scheduling a map task requires knowing the HDFS block locations of the map’s input split. => In our modiﬁed version of Hadoop, the reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next job. And we’ll introduce ‘snapshot’ outputs that is publishd by online aggregation and continuous queries.

3- Pipelined MapReduce FAULT TOLERANCE Traditional fault tolerance algorithms for pipelined dataflow systems are complex HOP approach: write to disk and pipeline – Producers write data into in-memory buffer – In-memory buffer periodically spilled to disk – Spills are also sent to consumers – Consumers treat pipelined data as “tentative” until producer is known to complete – Fault tolerance via task restart, tentative output discarded

OUTLINE 1 Introduction 2 Background (Hadoop) 3 Pipelined MapReduce 4 Online Aggregation 5 Conclusion

4- Online Aggregation Although MapReduce was originally designed as a batch-oriented system, it is often used for interactive data analysis. (examle) an interactive user would prefer a “quick and dirty” approximation over a correct answer that takes much longer to compute. How we extended our pipelined Hadoop implementation to support online aggregation within a single job (Section 4.1) and between multiple jobs (Section 4.2).

4- Online Aggregation 4.1 Single-Job Online Aggregation In HOP, the data records produced by map tasks are sent to reduce tasks shortly after each record is generated. Snapshot is an output of the reduce task at a certain time. It is important for us, how correct a snapshot is. & how does snapshot coincide with the correct data. It is a hard problem..

4- Online Aggregation Snapshots are computed periodically, as new data arrives at each reducer. The user may -speciﬁy how often snapshots should be computed - specify whether to include data from tentative (unﬁnished) map tasks if there are not enough free slots to allow all the reduce tasks in a job to be scheduled, snapshots will not be available for reduce tasks that are still waiting to be executed Within a single job: periodically invoke reduce function at each reduce task on available data Between jobs: periodically send a “snapshot” to consumer jobs

4- Online Aggregation HDFS Write Snapshot Answer HDFS Block 1 Block 2 Read Input File map reduce

4- Online Aggregation 4.1 Multi-Job Online Aggregation Similar to the single-job Online Aggregation, but approximate answers are pipelined to map tasks of next job. (j1, j2, …) Unfortunately output of the reduce function is not monotonic. Why that co-scheduling a sequence of jobs is required. Consumer job computes an approximation

4- Online Aggregation Write Answer HDFS map Job 2 Mappers reduce Job 1 Reducers

4- Online Aggregation Fault Tolerance for Multi-Job Online Aggregation: Let’s assume we have 2 jobs (j1,j2)… We suppose 3 cases: 1. Task in j1 fails (we discussed earlier) 2. Task in j2 fails (system restarts the failed task) 3. To handle failures in j1, we replace the most recent snapshot from j2 to j1, with the failed one. If tasks from both jobs fail, a new task in j2 recovers the most recent snapshot from j1.

Stream Processing MapReduce is often applied to streams of data that arrive continuously – Click streams, network traffic, web crawl data, … Traditional approach: buffer, batch process 1.Poor latency 2.Analysis state must be reloaded for each batch Instead, run MR jobs continuously, and analyze data as it arrives

Monitoring The thrashing host was detected very rapidly—notably faster than the 5-second TaskTracker- JobTracker heartbeat cycle that is used to detect straggler tasks in stock Hadoop. We envision using these alerts to do early detection of stragglers within a MapReduce job.

5- Conclusion HOP extends the applicability of the model to pipelining behaviors, while preserving the simple programming model and fault tolerance of a full- featured MapReduce framework. Future topics -Scheduling -using MapReduce-style programming for even more interactive applications.