Terasort Using SAGA-MapReduce Given by: Sharath Maddineni

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
MapReduce How to painlessly process terabytes of data.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Hadoop MapReduce Framework
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Why Terasort? Sorting the large datasets in scientific computations. Google processes around 20 Petabytes of data per day using MapReduce. So, Google may sort the huge datasets containing WebPages makes the searching and retrieval faster. Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Introduction Sort Benchmark (http://sortbenchmark.org/) Google won the 2010 competition, Yahoo Hadoop In 2009 But, Google sorting is limited to Google File System(GFS), and Yahoo is tied to Yahoo-Hadoop File System(HDFS) SAGA-MapReduce is infrastructure independent. Center CCT: Center for Computation & Technology

SAGA MapReduce Execution Overview Start the Master with a executable linked to SAGA-MapReduce and creates advert directory The master looks the InputFormat specified in the JobDescription to chunk the input data. The master spawns workers on the host machines specified in the configuration file using the SAGA Job API Worker puts its status information into an advert directory and will communicate with master using this advert service. Workers will process the chunks assigned by master using Map() and partition the Data according the partition function When all chunks mapping is done master moves to reduce Phase. In the reduce, the master assigns sets of partitions to be reduced to idle workers. Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Slide Title Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Terasort Sort-benchmark’s provides a “Gensort” program to generate Data Records Data Format Each Record has 100 bytes ASCII values contains where 10 bytes random key and rest is the value . 10^10, 100 byte-records for terabyte of data All the records are sorted according to this 10 byte key. Center CCT: Center for Computation & Technology

Terasort SAGA Map-Reduce Similar to SAGA-MapReduce Except the partition list is generated before launching the master The partition list generated will make sure that the keys in map phase goes into partition of its range. This will spread the keys evenly across all the partitions. Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology

Distributed Workers for Terasort Cyder and Cyd01 machines as workers Prerequisites: SSH password less login from Master machine to Worker machines. Fuser Mount the Input and Output Data Locations on each machine. Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Results Increasing the input Data size Constant Number of workers (3) (Both Master and Worker on Cyd01 ) Operating System : Redhat 5.5 Architecture : x86_64 Memory : 8 GB CPU Type : Dual-Core AMD Opteron Compiler Version : gcc version 4.4.3, Boost Version : 1.40, X-Axis -> Data set size in MB Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Results cont… Constant Input File Size(400MB, 6 Chunks, 5 partitions) Increasing number of workers Operating System : Ubuntu 10.04 Architecture : x86_64 AMD Memory : 63 GB CPU Type : 6-Core AMD Opteron Compiler Version : gcc version 4.4.3, Boost Version : 1.40, X-Axis -> Number of workers Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Results cont… Distributed workers (2 workers, 1 chunk(10mb), 5 partitions) Cyd01 and Cyder are used Case 1 : Master, Worker and Data on same machine Case 2 : Remote Master , Data and workers on same machine Case 3 : Remote Master, Remote data for one worker and local Data for one worker Case 4 : Remote Master, Remote Data for all workers X-Axis -> Cases Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology

SAGA Map-Reduce Usability Usable for users who have some familiarity with the C++,SAGA and prior knowledge of MapReduce. Sufficiently documented. However, some important details about mounting the input and out put with distributed computing were missing Tested on RHEL 4,5 and Ubuntu 10.04 SAGA 1.4.1 and 1.5 Boost Version 1.40 Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Future Work Currently MapReduce only supports Launching worker through forking Localhost and SSH SAGA- BigJob can be used to launch the workers instead Helps in running MapReduce distributed over LONI Machines But mounting directories is a problem over LONI. Center CCT: Center for Computation & Technology

CCT: Center for Computation & Technology Thank You Center CCT: Center for Computation & Technology