Download presentation
Presentation is loading. Please wait.
Published byAmy Potter Modified over 9 years ago
1
Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson
2
Hadoop – A Quick Look What is Hadoop?
3
Distributed Computing framework for data- intensive distributed applications Commonly used in large clusters of Commercial-Off-The-Shelf Hardware Noted for Reliability and Speed and failure/fault tolorance.
4
THE QUESTION? Small Cluster Performance and reliability.
5
Testing Overview Three Main Tests – Speed and Data loss – Fault Tolerance – Node Recovery Hardware – repurposed Dell Optiplex 270 and 280 units for compatibility reasons
6
Test 1 DataLoss Tolerance Single simplest test of our testing procedure Word count on cluster, deleting all books on DFS I minute in and monitoring the result
7
Test 2 Speed Baselines Baseline test, with only a single node – Exact command not usable on just a single node, but a close duplicate was located to simulate similar results: » Cat *.txt | tr ‘’ ‘\n’ |sort |uniq –ic Baseline with cluster – Nearly identical to the single node test, but using the cluster as a whole, using 1-4 nodes Tests run 3 times and averaged for consistency
8
Test 3 Speed with Node Failure Variable tests with 1 to 3 nodes removed and complete task analysis. Each variation run 3 times and averaged for time comparisons
9
Test 4 Speed with Node Recovery Variable tests with 1 to 3 nodes removed 1 minute in, reconnected 1 minute later and complete task analyzed. Each variation run 3 times and averaged for time comparisons
10
Test Parameters All books loaded onto the master node and DFS. Default timeout changed from 10 minutes to 30 seconds to allow for timely testing. Node removal was 1 minute in.
11
RESULTS You are required to maneuver straight down this trench…
12
Data Loss Tolerance Test Group 1 Presentation.
13
Hadoop Speed Test Group 1 Presentation – Independent Test 22m 33s – 1 node 29m 50s w/ 22s deviation – 2 nodes 17m 32s w/ 18s deviation – 3 nodes 15m 6s w/ 16s deviation – 4 nodes 3m 54s w/6s deviation
14
Speed w/ Node Failure One Node removed – 13m 57s w/ 17s deviation 2 nodes – 16m 5s w/ 25s deviation 3 nodes – 28m 19s w/ 19s deviation
15
Speed w/ Node Recovery One Node Removed and Recovered – 5m 9s w/ 6s deviation – Recovery: 1m 3s w/ 3s deviation 2 nodes – 5m 27s w/ 8s deviation – Recovery: 51s w/ 2s deviation 3 nodes – 5m 31s w/ 6s deviation – Recovery: 54s w/ 5s deviation
16
CONCLUSION Is this the end?
17
Conclusion Hadoop overhead is large on clusters numbering less than 4 nodes – Roughly 24% overhead w/ a performance degradation of 50% Upon introduction of a 4 th node, average node performance dramatically increases up to 144% due to optimizations. Performance numbers were reflected in the tests performed, and loss of nodes impacted total time to compute minimally
18
Conclusion, Part Deux. Recovery performance was outstanding – nodes were disconnected for 1 minute and aside for a couple seconds of resync and overhead reintegrated without trouble.
19
The Final Word Ultimately, Hadoop performed above and beyond expectations, proving to be a valid and relatively inexpensive way to handle managing large volumes of certain kinds of data when used above 4 nodes. Excellent recovery and performance, and relatively easy to use.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.