HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla
Background Amount of data that needs to be stored for analyzing is exploding On the other hand, analyzing performance can’t be compromized despite the increase in data amount Efficient high-end proprietary machines are expensive
Parallel databases Shared-nothing MPP architecture (a collection of independent machines, each with local hard disk and main memory, connected together on high-speed network) Machines are cheaper, lower-end, commodity hardware Scales well up to a point, tens of nodes Good performance Poor fault tolerance Problems with heterogeneous environment (machines must be equal in performance) Good support for flexible query interface
MapReduce systems Cheap Scales well to thousands of nodes Good support for heterogeneous environment Good fault tolerance Performance issues compared to parallel DBs Generally no support for SQL (excluding eg. Hive)
What is HadoopDB Recent study at Yale University, Database Research Dep. Hybrid architecture of parallel databases and MapReduce system The idea is to combine the best qualities of both technologies Multiple single-node databases are connected using Hadoop as the task coordinator and network communication layer Queries are distributed across the nodes by MapReduce framework, but as much work as possible is done in the database node
HadoopDB architecture Reference: Azza Abouzeid, Kamil BajdaPawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
Desired properties of HadoopDB Performance Fault tolerance Support for heterogeneous environment Flexible query interface
Study benchmark systems Hadoop system HadoopDB Vertica DBMS-X
Benchmark tasks Data loading Grep task Selection task Aggregation task Join task UDF Aggregation task Fault tolerance and heterogeneous environment
Results 1/2 Reference: Azza Abouzeid, Kamil BajdaPawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
Results 2/2 Reference: Azza Abouzeid, Kamil BajdaPawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
Conclusions HadoopDB is close in performance to parallel databases HadoopDB is able to operate in truly heterogeneous environment and has the fault tolerance of Hadoop environment Equal licensing costs to Hadoop Better performance expected in future
Further reading HadoopDB Project. Web page: http://db.cs.yale.edu/hadoopdb/hadoopdb.html Azza Abouzeid, Kamil BajdaPawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Hadoop Project. Hadoop Cluster Setup. Web page: http://hadoop.apache.org/core/docs/current/cluster_setup.html .