U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1 Department of Computer Science and Engineering; 2 Binghamton University Apache Hadoop has a longer runtime than necessary. For smaller tasks, it is exceedingly stagnant. It generates excess network traffic, neglects to thoroughly utilize multi-core architecture, and is in need of a more time efficient model for I/O tasks. A new implementation that partially models the structure of Hadoop, but evades this problem is possible with a focus on hastening tasks that run for minutes or hours, rather than days or months. The new implementation excludes unnecessary fault tolerance and redundant data copies. It includes an asynchronous parallel design and more efficient use of memory, disk access and network bandwidth with respect to time. Grid computing tasks that are I/O intensive have a runtime of Θ(nlog(n)) with Hadoop, compared to the new model, which is O(log(n)) as the number of nodes increase. However, as the input size increases, the runtime for both are Θ(n), but the alternative is sloped significantly less. Tasks that are purely CPU intensive were shown to take two times longer than required on a cluster of quad-core machines. ABSTRACTHOW IT WORKS The inspiring goal of this research was simply to be able to parse XML files ranging from 25 megabytes to a few gigabytes. The most widely used framework for this type of task, Apache Hadoop, gives this ability, with assurance that the job will almost always finish. Hadoop was designed for mission critical applications, such as use with NASA, where a single run may continue for weeks or months. Often times, nodes will fail causing the process to have to restart. Since modern computers are for the most part reliable enough to continue working for a couple days without fault, we decided to not implement as much fault tolerance as Hadoop. Although the design was successful enough for long periods of time, it did not however perform well for tasks that run for minutes, hours or a few days. As a test, a simple 7 byte size task took just over a minute. Most of that time was spent during map reduction! A closer look showed that very large packets were being sent regularly to the node computers and choking performance. The tests show in figures 4, 7, and 8, took an input file filled with one number (0, 1, …, 9) per line and the nodes translate that in to the respective word (zero, one, …, nine). The process of this requires an even amount of IO and CPU. The largest bottleneck for both the “Alternative” and Hadoop, was the network bandwidth. All servers used 1 gigabit connections, but large input sizes costed Hadoop much more than it did for the “Alternative”. The tests shown were conducted on quad core Intel Xeon 2.66GHz. Similar to Hadoop, the “Alternative” is a Java software framework that supports data intensive distributed applications. I single large task is split and distributed. Each fragment is processed and returned to the master node to be assembled into a single resultant file. Figures 1 and 2 show this. Figure 1 – Distributed computing. A single task is broken into several smaller tasks. Figure 2 – Two example of how an XML file may be split for 3 and 5 nodes. Rather than following Hadoop’s linearity, the program was designed to run asynchronously for time efficiency. This model thoroughly increased CPU and IO utilization in a good way. Initially, fault tolerance was at a bare minimum for speed, but later we realized that it could still be fast with it and not become as stagnant as Hadoop. Figure 3 (left) - The initial implementation resulted in this structure, but later, scheduling was added for flow control, additional fault tolerance for problems caused by bad nodes or unavailable ports, and special debugging tools. Figure 7 - Hadoop converges to about 40 seconds with the input of 25 MB, however the “Alternative” converges to about 1.5 seconds. Figure 4 – 5 nodes and various input sizes. Both are running at Θ(n). Some tests that are not shown uses 8 core machines (Intel Xeon 2.33GHz) and the results from these were even more impressive than those shown here. Since the “Alternative” used multiple ports to prevent blocking and Hadoop generated extra traffic while using a single port for data transfer, it’s very clear who had the advantage. Figure 5 – Using an input file of 25MB with one number per line, Fibonacci(N+25) is calculated recursively. Hadoop used about 40% of the CPU compared to the “Alternative”, which used about 90%. The result clearly being that the “Alternative” had at most a 2.5x speedup. The asynchronous design used gave more parallelization. Figure 8 – 400MB and variant quantity of nodes. Hadoop’s runtime was Θ(nlog(n)), compared to the “Alternative” model, which ran at O(log(n)). As the number of nodes increase greatly, Hadoop begins to take even longer than it did with less nodes. Hadoop Alternative Hadoop Alternative Hadoop Alternative RESULTS