Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
MapReduce.
Transparent and Flexible Network Management for Big Data Processing in the Cloud Anupam Das Curtis Yu Cristian Lumezanu Yueping Zhang Vishal Singh Guofei.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SDN + Storage.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
WSN Simulation Template for OMNeT++
A Hadoop MapReduce Performance Prediction Method
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Edge Based Cloud Computing as a Feasible Network Paradigm(1/27) Edge-Based Cloud Computing as a Feasible Network Paradigm Joe Elizondo and Sam Palmer.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
HAMS Technologies 1
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Matchmaking: A New MapReduce Scheduling Technique
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.
Software-defined network(SDN)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Interaction and Animation on Geolocalization Based Network Topology by Engin Arslan.
Big Data is a Big Deal!.
Curator: Self-Managing Storage for Enterprise Clusters
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Calculation of stock volatility using Hadoop and map-reduce
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Group 15 Swathi Gurram Prajakta Purohit
Distributed Systems CS
Presentation transcript:

Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010

Agenda Objective Comparison between MRPerf and Mumak Modifications to Mumak Results and discussion Conclusion

Objective Large scale distributed system has enormous amount of parameters. Running time of a user program depends non- linearly on these parameters. Predict the running time under various settings to help user choose the “optimal” setting. We start by varying the most basic parameter: cluster size.

MRPerf and Mumak MRPerf – Build upon a network simulator – Calculate the task running time and network delay from physical parameters – Implemented the Hadoop system in TCL – Flexible in simulation

MRPerf and Mumak Map slots per node Reduce slots per node Running Time 4 nodes double rack data center (Chunk Size = 64M) By MRPerf

MRPerf and Mumak 4 nodes (Chunk Size = 64M) By Mumak

MRPerf and Mumak Mumak – Inherit the JobTracker class from Hadoop and only defines the simulation interface – Use trace file to build the cluster topology / job story, then feed it into simulator – Can only reproduce previous finished experiment – Designed to verify/debug Hadoop system design – Only simulate the Map/Reduce tasks, no sort phase and shuffle phase

MRPerf and Mumak The approach taken by MRPerf is better – Take in parameters to estimate running time – Can make predictions MRPerf is simulating their implementation of Hadoop The design of Mumak is better – Inherit source code from Hadoop – Easy to understand and to extend We decide to take the good parts of MRPerf and then implement them in the framework of Mumak – Modify the Rumen log to change the parameters – Modify Mumak source code to add network simulator

Implementation Simulate a different cluster size – Hack the rumen log, change data replication factor/ locality – Modify the topology, add in / delete nodes, for example, from 2 slave nodes to 6 slave nodes. – The job tracker will assign the tasks to different nodes.

Implementation Simulate network delay – We defined a simple network simulator interface – Modified the source code of Mumak to add in the network delay – Actual the network delay can be ignored

Results and Discussion

Limitations and future work – Sort phase time not included – Only used single rack topology – Prediction is not always consistent for the same job with the same configuration

Conclusion Our objective is to predict the running time with different parameters We take the methods of MRPerf and implemented it on Mumak To have more flexible and accurate prediction, more modification to Mumak is needed – Independent from trace file – Solve the unstable problem

Questions?