U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
SDN + Storage.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Computer System Architectures Computer System Software
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Extracted directly from:
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Block1 Wrapping Your Nugget Around Distributed Processing.
MapReduce How to painlessly process terabytes of data.
An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
The concept of RAID in Databases By Junaid Ali Siddiqui.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Next Generation of Apache Hadoop MapReduce Owen
Matrix Multiplication in Hadoop
By: Joel Dominic and Carroll Wongchote 4/18/2012.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Supercomputing versus Big Data processing — What's the difference?
Introduction to Computers - Hardware
Hadoop Aakash Kag What Why How 1.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
A DFA with Extended Character-Set for Fast Deep Packet Inspection
Distributed Network Traffic Feature Extraction for a Real-time IDS
PA an Coordinated Memory Caching for Parallel Jobs
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
Hadoop Technopoints.
CSE8380 Parallel and Distributed Processing Presentation
Distributed computing deals with hardware
By Brandon, Ben, and Lee Parallel Computing.
MAPREDUCE TYPES, FORMATS AND FEATURES
Course Code 114 Introduction to Computer Science
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1 Department of Computer Science and Engineering; 2 Binghamton University Apache Hadoop has a longer runtime than necessary. For smaller tasks, it is exceedingly stagnant. It generates excess network traffic, neglects to thoroughly utilize multi-core architecture, and is in need of a more time efficient model for I/O tasks. A new implementation that partially models the structure of Hadoop, but evades this problem is possible with a focus on hastening tasks that run for minutes or hours, rather than days or months. The new implementation excludes unnecessary fault tolerance and redundant data copies. It includes an asynchronous parallel design and more efficient use of memory, disk access and network bandwidth with respect to time. Grid computing tasks that are I/O intensive have a runtime of Θ(nlog(n)) with Hadoop, compared to the new model, which is O(log(n)) as the number of nodes increase. However, as the input size increases, the runtime for both are Θ(n), but the alternative is sloped significantly less. Tasks that are purely CPU intensive were shown to take two times longer than required on a cluster of quad-core machines. ABSTRACTHOW IT WORKS The inspiring goal of this research was simply to be able to parse XML files ranging from 25 megabytes to a few gigabytes. The most widely used framework for this type of task, Apache Hadoop, gives this ability, with assurance that the job will almost always finish. Hadoop was designed for mission critical applications, such as use with NASA, where a single run may continue for weeks or months. Often times, nodes will fail causing the process to have to restart. Since modern computers are for the most part reliable enough to continue working for a couple days without fault, we decided to not implement as much fault tolerance as Hadoop. Although the design was successful enough for long periods of time, it did not however perform well for tasks that run for minutes, hours or a few days. As a test, a simple 7 byte size task took just over a minute. Most of that time was spent during map reduction! A closer look showed that very large packets were being sent regularly to the node computers and choking performance. The tests show in figures 4, 7, and 8, took an input file filled with one number (0, 1, …, 9) per line and the nodes translate that in to the respective word (zero, one, …, nine). The process of this requires an even amount of IO and CPU. The largest bottleneck for both the “Alternative” and Hadoop, was the network bandwidth. All servers used 1 gigabit connections, but large input sizes costed Hadoop much more than it did for the “Alternative”. The tests shown were conducted on quad core Intel Xeon 2.66GHz. Similar to Hadoop, the “Alternative” is a Java software framework that supports data intensive distributed applications. I single large task is split and distributed. Each fragment is processed and returned to the master node to be assembled into a single resultant file. Figures 1 and 2 show this. Figure 1 – Distributed computing. A single task is broken into several smaller tasks. Figure 2 – Two example of how an XML file may be split for 3 and 5 nodes. Rather than following Hadoop’s linearity, the program was designed to run asynchronously for time efficiency. This model thoroughly increased CPU and IO utilization in a good way. Initially, fault tolerance was at a bare minimum for speed, but later we realized that it could still be fast with it and not become as stagnant as Hadoop. Figure 3 (left) - The initial implementation resulted in this structure, but later, scheduling was added for flow control, additional fault tolerance for problems caused by bad nodes or unavailable ports, and special debugging tools. Figure 7 - Hadoop converges to about 40 seconds with the input of 25 MB, however the “Alternative” converges to about 1.5 seconds. Figure 4 – 5 nodes and various input sizes. Both are running at Θ(n). Some tests that are not shown uses 8 core machines (Intel Xeon 2.33GHz) and the results from these were even more impressive than those shown here. Since the “Alternative” used multiple ports to prevent blocking and Hadoop generated extra traffic while using a single port for data transfer, it’s very clear who had the advantage. Figure 5 – Using an input file of 25MB with one number per line, Fibonacci(N+25) is calculated recursively. Hadoop used about 40% of the CPU compared to the “Alternative”, which used about 90%. The result clearly being that the “Alternative” had at most a 2.5x speedup. The asynchronous design used gave more parallelization. Figure 8 – 400MB and variant quantity of nodes. Hadoop’s runtime was Θ(nlog(n)), compared to the “Alternative” model, which ran at O(log(n)). As the number of nodes increase greatly, Hadoop begins to take even longer than it did with less nodes. Hadoop Alternative Hadoop Alternative Hadoop Alternative RESULTS