MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Hadoop: The Definitive Guide Chap. 2 MapReduce
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Distributed Computations
Distributed Computations MapReduce
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
HAMS Technologies 1
University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
Hadoop.
Introduction to Distributed Platforms
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Applying Twister to Scientific Applications
Hadoop Basics.
Hadoop Technopoints.
Map Reduce, Types, Formats and Features
Presentation transcript:

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon, R. 報告者 : 巫佳哲 日期 : 2013/5/28

OUTLINE 1.INTRODUCTION 2.RUNNING SCIENTIFIC APPLICATIONS WITH MAPREDUCE 3.PERFORMANCE RESULTS 4.RELATED WORK 5.CONCLUSIONS 2

1.INTRODUCTION Evolving scientific instruments and the rapid sophistication of computing systems have resulted in large-scale scientific simulations and data analysis workflows As more and more scientific data is generated, our ability to effectively manage and process such data also needs to evolve 3

Programming model for processing large data sets (MapReduce) Characteristics of the model: – Relaxed synchronization constraints – Locality optimization – Fault-tolerance – Load balancing 4

Apache Hadoop, the most widely used MapReduce framework, provides these same advantages Hadoop native does not provide support for application source code written in languages other than Java 5

In this paper, we use the word streaming as in the context of Hadoop, which provides a mechanism to run non-Java applications from within the context of a Java-based MapReduce framework Hadoop MapReduce relies on the Hadoop Distributed File System (HDFS), a non POSIX compliant filesystem, for its data and cluster operations 6

We present in this paper, MARISSA (MApReduce Implementation for Streaming Science Applications), a MapReduce framework offering better performance and faster application turnaround time than Hadoop streaming, while capable of fully supporting a variety of POSIX compliant file systems 7

2.RUNNING SCIENTIFIC APPLICATIONS WITH MAPREDUCE Apache Hadoop Open-source MapReduce implementation in Java Easy scalability Built-in I/O management – Hadoop Distributed File System(HDFS) – Data distribution, management and replication Load balancing – Handles stragglers 8

Fault tolerance – Commodity hardware – Heartbeats – Speculative execution and data replication Hadoop Streaming – Create and run MapReduce jobs with any executable or script as the mapper and/or the reducer 9

Apache Hadoop Challenges – Streaming imposes a performance penalty – Lack of support for multiple executables – Lack of support for individual input – Lack of iterative support – Lack of support for user specified redundant tasks – HDFS is not POSIX compliant – Hadoop streaming requires STDIN/STDOUT 10

Hadoop Streaming requires STDIN/STDOUT – HDFS cannot be read or written to without the HDFS API calls – Scientific applications tend to work with files rather than STDIN/STDOUT Hadoop lacks of support for individual input files – Scientific applications tend to work with files rather than input lines Hadoop Streaming imposes a performance penalty – Experience shows poor processing performance 11

The ‘C” vs Java baseline experiments in the table: – “C” was 20% faster than Java for the same application – Streaming counterpart: At 1.6TB the “C” streaming version is about 19% slower than Hadoop native 12

This graph shows processing of 1.6TB of Wikipedia data Filtering to isolate embedded indices using 75 nodes in both cases Streaming performance as not a factor of I/O speeds In fact ‘C” streaming performs faster I/O than Hadoop native The performance losses occur in processing 13

MARISSA Architecture The shared file system is used for both input and output placement. Executables accessible by individual workers can be used for map or reduce functions 14

Architecture and Design of MARISSA MARISSA allows the use of the shared-disk file systems – The framework leverages the data visibility offered by the shared-disk file system – No need of extra layer like HDFS Providing input to applications through standard I/O and through file access – STDIN/STDOUT is not a must – Scheduling each node to spawn a process tied to the specified input file or dataset – Model permits redundant task execution Each worker can execute different binaries – Same binary can work with different arguments A node specific fault-tolerance mechanism, rather than an input specific one – Node failure is inherently decoupled from data availability – Re-execution simply consists of task re-assignment 15

3. PERFORMANCE RESULTS In this section we compare the performance of MARISSA and Hadoop streaming for two classes of scientific applications: – BLAST and K-means clustering 16

80-core Hadoop and MARISSA clusters K-means clustering on 0.1 Billion to 3.4 Billion 2D-records MARISSA here, given the CPU-intensive nature of the application performs up to 47% faster 17

K-means clustering 858 million 2D-records cluster sizes ranging from 20 to 320 cores MARISSA performs up to 46.8% faster than with 20 cores, and up to 47% with 400 cores 18

160-core Hadoop Streaming and MARISSA clusters Performing gene sequencing and similarity comparisons using BLAST Each framework processes from 10,240 to 327,680 genetic sequences, using 320 cores 19

MARISSA and Hadoop clusters running BLAST Performing similarity comparisons on 40,960 genetic sequences The number of cores used here varies from 10 to 160 cores 20

Hadoop Streaming and MARISSA faced with defaulting or dying nodes while running BLAST Processing genetic sequences Both clusters start with 20 nodes, and progressively, in separate runs lose 2, 4, and 6 nodes The Hadoop experiences a total failure beyond a 4 node loss This occurs as worker nodes also hold vital data for processing 21

Hadoop streaming and MARISSA slowdown given node losses while running BLAST Both frameworks start with 20 nodes and process genetic sequences Where comparable, Hadoop performs up to three times as slow, while MARISSA only slows down by a factor of 2 22

4. RELATED WORK MapReduce comparisons – Twister – LEMO-MR MapReduce applications in Science – CGL-MapReduce 23

5. CONCLUSIONS MARISSA allows for: – The ability to work with existing file systems – The ability to work with no STDIN/STDOUT applications – The ability to run different executables on each node of the cluster – The ability to run a same executables with different arguments – The ability to run different input datasets on different nodes. – The ability for all or a subset of the nodes to run duplicates of the same task with same input without need of data replication 24

END 謝謝 25