Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Spark: Cluster Computing with Working Sets
Chapter 19: Network Management Business Data Communications, 4e.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
And now … Graphs simulation input file parameters 10,000 requests 4 categories of file sizes 1K- 80% frequency 4K – 15% 16K – 4% 64K –1% poisson arrival.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
HADOOP ADMIN: Session -2
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Network Support for Cloud Services Lixin Gao, UMass Amherst.
1 The Google File System Reporter: You-Wei Zhang.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
1 Intern Project Presentation Connor Richardson Big Data August 4, 2015.
HAMS Technologies 1
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Effect Of Message Size and Number of Clients on WS Frameworks For CIS* Service Oriented Computing Dariusz Grabka Gerett Commeford Jack Cole.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.
OBTAINING QUALITY MILL PERFORMANCE Dan Miller
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Record Linkage in a Distributed Environment
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Hadoop MapReduce Framework
Maximum Availability Architecture Enterprise Technology Centre.
Ministry of Higher Education
SCOPE: Scalable Consistency in Structured P2P Systems
MapReduce Simplied Data Processing on Large Clusters
Hadoop Basics.
CS-514 Final Project How circular arrays behave under successive rounds of uniform insertions and deletions Diogo Andrade Gábor Rudolf.
Scientific Computational Reproducibility
Resource-Efficient and QoS-Aware Cluster Management
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Hadoop – A Quick Look What is Hadoop?

Distributed Computing framework for data- intensive distributed applications Commonly used in large clusters of Commercial-Off-The-Shelf Hardware Noted for Reliability and Speed and failure/fault tolorance.

THE QUESTION? Small Cluster Performance and reliability.

Testing Overview Three Main Tests – Speed and Data loss – Fault Tolerance – Node Recovery Hardware – repurposed Dell Optiplex 270 and 280 units for compatibility reasons

Test 1 DataLoss Tolerance Single simplest test of our testing procedure Word count on cluster, deleting all books on DFS I minute in and monitoring the result

Test 2 Speed Baselines Baseline test, with only a single node – Exact command not usable on just a single node, but a close duplicate was located to simulate similar results: » Cat *.txt | tr ‘’ ‘\n’ |sort |uniq –ic Baseline with cluster – Nearly identical to the single node test, but using the cluster as a whole, using 1-4 nodes Tests run 3 times and averaged for consistency

Test 3 Speed with Node Failure Variable tests with 1 to 3 nodes removed and complete task analysis. Each variation run 3 times and averaged for time comparisons

Test 4 Speed with Node Recovery Variable tests with 1 to 3 nodes removed 1 minute in, reconnected 1 minute later and complete task analyzed. Each variation run 3 times and averaged for time comparisons

Test Parameters All books loaded onto the master node and DFS. Default timeout changed from 10 minutes to 30 seconds to allow for timely testing. Node removal was 1 minute in.

RESULTS You are required to maneuver straight down this trench…

Data Loss Tolerance Test Group 1 Presentation.

Hadoop Speed Test Group 1 Presentation – Independent Test 22m 33s – 1 node 29m 50s w/ 22s deviation – 2 nodes 17m 32s w/ 18s deviation – 3 nodes 15m 6s w/ 16s deviation – 4 nodes 3m 54s w/6s deviation

Speed w/ Node Failure One Node removed – 13m 57s w/ 17s deviation 2 nodes – 16m 5s w/ 25s deviation 3 nodes – 28m 19s w/ 19s deviation

Speed w/ Node Recovery One Node Removed and Recovered – 5m 9s w/ 6s deviation – Recovery: 1m 3s w/ 3s deviation 2 nodes – 5m 27s w/ 8s deviation – Recovery: 51s w/ 2s deviation 3 nodes – 5m 31s w/ 6s deviation – Recovery: 54s w/ 5s deviation

CONCLUSION Is this the end?

Conclusion Hadoop overhead is large on clusters numbering less than 4 nodes – Roughly 24% overhead w/ a performance degradation of 50% Upon introduction of a 4 th node, average node performance dramatically increases up to 144% due to optimizations. Performance numbers were reflected in the tests performed, and loss of nodes impacted total time to compute minimally

Conclusion, Part Deux. Recovery performance was outstanding – nodes were disconnected for 1 minute and aside for a couple seconds of resync and overhead reintegrated without trouble.

The Final Word Ultimately, Hadoop performed above and beyond expectations, proving to be a valid and relatively inexpensive way to handle managing large volumes of certain kinds of data when used above 4 nodes. Excellent recovery and performance, and relatively easy to use.