A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
Copyright 2009 by CEBT Center for E-Business Technology MapReduce vs. Parallel DBMS
Copyright 2009 by CEBT MapReduce Center for E-Business Technology 한재선, SearchDay2008,
Copyright 2009 by CEBT Architectural Differences Parallel DBMSMapReduce Schema SupportOX IndexingOX Programming Model Stating what you want (SQL) Presenting an algorithm (C/C++, Java, …) OptimizationOX FlexibilityGood Fault ToleranceGood Center for E-Business Technology
Copyright 2009 by CEBT Benchmark Environment (1/2) Systems Hadoop: The most popular open-source MR implementation DBMS-X: a parallel DBMS that stores data in a row-based format Vertica: a column-based parallel DBMS All Three systems were deployed on a 100-node cluster Analytical Tasks Data Loading Selection Task Aggregation Task Join Task UDF Aggregation Task Center for E-Business Technology
Copyright 2009 by CEBT Benchmark Environment (2/2) Dataset Documents : 600,000 unique documents for each node 155 million UserVisits records (20GB/node) 18 million Rankings records (1GB/node) Center for E-Business Technology
Copyright 2009 by CEBT 1. Data Loading Center for E-Business Technology loading time Reorganization
Copyright 2009 by CEBT 2. Selection Task The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user- defined threshold Query SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x; x = 10, which yields approximately 36,000 records per data file on each node For MR, implementing the same task with Java language Center for E-Business Technology
Copyright 2009 by CEBT 2. Selection Task - Result Center for E-Business Technology time for combining the output into a single file (Additional MR) time for combining the output into a single file (Additional MR) Processing time
Copyright 2009 by CEBT 3. Aggregation Task The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column Query SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; This task always produces 2.5 million records Center for E-Business Technology
Copyright 2009 by CEBT 3. Aggregation Task - Result Center for E-Business Technology
Copyright 2009 by CEBT 4. Join Task The join task consists of two sub-tasks that perform a complex calculation on two data sets In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval Query SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘ ’) AND Date(‘ ’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Center for E-Business Technology
Copyright 2009 by CEBT 4. Join Task - Result Center for E-Business Technology
Copyright 2009 by CEBT 5. UDF Aggregation Task The final task is to compute the inlink count for each document in the dataset Query SELECT INTO Temp F(contents) FROM Document; – F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database – With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count SELECT url, SUM(value) FROM Temp GROUP BY url; Center for E-Business Technology
Copyright 2009 by CEBT 5. UDF Aggregation Task - Result Center for E-Business Technology
Copyright 2009 by CEBT Conclusion Center for E-Business Technology MapReduce < Parallel DBMS
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin VLDB Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
Copyright 2009 by CEBT HadoopDB The Basic Idea (An Architectural Hybrid of MR & DBMS) To use MR as the communication layer above multiple nodes running single-node DBMS instances Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases Center for E-Business Technology
Copyright 2009 by CEBT The Architecture of HadoopDB Center for E-Business Technology
Copyright 2009 by CEBT HadoopDB – Join Task Center for E-Business Technology