Download presentation
Presentation is loading. Please wait.
Published byHarry Noah Gregory Modified over 9 years ago
1
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
2
MAP REDUCE AND PARALLEL DBMS ARE COMPLEMENTARY In 2010, MapReduce (MR) has been hailed as a revolutionary new platform for large-scale, massively parallel data access In 2010, some proponents claimed the extreme scalability of MR will relegate relational database management systems (DBMS) to the status of legacy technology It’s later found that using MR systems to perform tasks that are best suited for DBMSs yields less than satisfactory results As such, MR complements DBMS technology rather than compete with it Parallel DBMS were first available nearly two decades ago As robust high performing platforms they provide a high-level programming environment that is inherently parallelizable It is possible to write almost any parallel processing task as a set of database queries or a set of MR jobs
3
THE SHARED-NOTHING ARCHITECTURE OF PARALLEL DBMS The initial parallel DBMS systems used the shared-nothing architecture and used horizontal partitioning of relational tables The use of horizontal partitioning is critical to obtaining scalable performance of SQL queries This leads to the concept of partitioned execution of SQL operators like selection, aggregation, join etc.
4
HORIZONTAL PARTITIONING The idea behind horizontal partitioning is to distribute the rows of the relational table across the nodes of a cluster so that they can be processed in parallel
5
MAP REDUCE EXAMPLE IN PARALLEL DBMS SELECT custId, amount FROM Sales WHERE date BETWEEN “12/01/2009” AND “12/25/2009” Sales table is round-robin partitioned across the nodes in the cluster Each SELECT operator scans the fragment of the Sales table stored at each node Any rows satisfying the date predicate are passed to a SHUFFLE operator that dynamically repartitions the rows This is done by hashing on the custId Rows are aggregated at each node to find final total for each customer
6
MAP-REDUCE ADVANTAGES MR is advantageous with ETL and read once data sets. DBMS must parse and verify each datum in the tuples before loading while MR does not. The Distributed infrastructure used to implement MR is cheap Horizontal scalability of MR is better than Parallel DBMS MR is an open source project with detailed documentation There is no popular open source project on parallel DBMS and all the popular ones are from commercial vendors
7
Comparison - Parallel DBMS over MapReduce Experimental setup Used most popular implementations of MR and Parallel DBMS Results presented are those achieved after best tuning Task NameHadoopDBMS-XVerticaHadoop/ DBMS-X Hadoop/ Vertica MR Grep task284s194s108s1.5x2.6x Web log task1146s740s268s1.6x4.3x Join task1158s32s55s36.3x21.0x 1. MR task - Each system must scan through a data set of 100B records looking for a three-character pattern. 2. Web log task - Conventional SQL aggregation with a GROUP BY clause on a table of user visits in a Web server log 3. Join task - Fairly complex join operation over two tables requiring an additional aggregation and fitering operation
8
Reasons why PDBMS outperforms MapReduce in experiment 1.Repetitive record parsing - the default configuration of Hadoop stores data in the accompanying distributed file system (HDFS), in the same textual format in which the data was generated. 2.Compression - enabling data compression in the DBMSs delivered a much more significant performance gain than seen in MR. Reason unknown. 3.Pipelining - Though writing data structures to disk gives Hadoop a convenient way to checkpoint the output of intermediate map jobs, thereby improving fault tolerance, it adds significant performance overhead. 4.Scheduling - In a parallel DBMS, each node knows exactly what it must do and when it must do it according to the distributed query plan. Each task in an MR system is scheduled on processing nodes one storage block at a time. 5.Column-oriented storage - In a column store-based database (such as Vertica), the system reads only the attributes necessary for solving the user query.
9
Conclusion MR has some good qualities: Out-of-the-box-experience, Most database systems cannot deal with tables stored in the file system DBMSs have some good qualities: Technologies and techniques for efficient query parallel execution, use of higher level languages. Parallel DBMSs excel at efficient querying of large data sets MR style systems excel at complex analytics and ETL tasks. Neither is good at what the other does well. Hence, the two technologies are complementary. An ideal system would therefore be a “HYBRID” system. HadoopDB, 4 Hive, 21 Aster, Greenplum, Cloudera, and Vertica all have commercially available products or prototypes in this “hybrid” category.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.