MapReduce: Data Distribution for Reduce

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Introduction to Hadoop and HDFS
MapReduce How to painlessly process terabytes of data.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Using Map-reduce to Support MPMD Peng
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Load Rebalancing for Distributed File Systems in Clouds.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Edinburgh Napier University
From Algorithm to System to Cloud Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Experiences with Hadoop and MapReduce
MapReduce Algorithm Design
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

MapReduce: Data Distribution for Reduce - HEENA DAVE - BHAVY BHUT

Data Distribution MapReduce After Key-Value generation by Map workers, pairs are grouped together and partitioned into R chunks and each partitioned data is assigned to one reduce worker.

Problems with existing data distribution It shows scenario of hash function load balancing and optimal load balancing for three reduce workers. Hash function assigns keys a and d to first machine, b and e to second machine and only c to third machine. As result first machine has maximum load of 8 while third machine has only load of 3. As shown in optimal load, hash function load balancing is not effective. Reference: Yanfang Le, Jiangchuan Liu, Funda Erg¨un, Dan Wang, "Online Load Balancing for MapReduce with Skewed Data Input"

Our strategy for data distribution Even data distribution among reducers. Sorted map keys, so that same key mapper output have higher chances of being executed on same reducer except it the data is overloaded on one reducer. One map output is distributed among different reducers. One final reducer to combine output of all the reducers.

System Requirements Hardware Software Processor: 1.8 GHz or higher RAM: 8 GB or higher Network: 1 GB onboard Software Library: Apache Hadoop which we use as distributed data storage and Yarn for job scheduling Java: Java Sdk 6 or higher which supports Hadoop IDE: Eclipse IDE for source code DBMS: not required

Algorithm

Advantages over Traditional MapReduce Effective utilization of reducers as mapper output is evenly distributed and all the reducers finish their task almost simultaneously. No reducer will have numerous keys to process. No reducer will wait sitting idle for a long time for a busy reducer to finish its task.

Performance Comparison

Performance Comparison

Performance Comparison

Future Work Future work is to implement algorithm on node architecture and compare results with Hadoop MapReduce and other MapReduce Frameworks. Evaluate performance for distributed systems.