A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Copyright  2009 by CEBT Center for E-Business Technology MapReduce vs. Parallel DBMS

Copyright  2009 by CEBT MapReduce Center for E-Business Technology 한재선, SearchDay2008, http://nexr.tistory.com

Copyright  2009 by CEBT Architectural Differences Parallel DBMSMapReduce Schema SupportOX IndexingOX Programming Model Stating what you want (SQL) Presenting an algorithm (C/C++, Java, …) OptimizationOX FlexibilityGood Fault ToleranceGood Center for E-Business Technology

Copyright  2009 by CEBT Benchmark Environment (1/2)  Systems Hadoop: The most popular open-source MR implementation DBMS-X: a parallel DBMS that stores data in a row-based format Vertica: a column-based parallel DBMS  All Three systems were deployed on a 100-node cluster  Analytical Tasks Data Loading Selection Task Aggregation Task Join Task UDF Aggregation Task Center for E-Business Technology

Copyright  2009 by CEBT Benchmark Environment (2/2)  Dataset Documents : 600,000 unique documents for each node 155 million UserVisits records (20GB/node) 18 million Rankings records (1GB/node) Center for E-Business Technology

Copyright  2009 by CEBT 1. Data Loading Center for E-Business Technology loading time Reorganization

Copyright  2009 by CEBT 2. Selection Task  The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user- defined threshold  Query SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x; x = 10, which yields approximately 36,000 records per data file on each node  For MR, implementing the same task with Java language Center for E-Business Technology

Copyright  2009 by CEBT 2. Selection Task - Result Center for E-Business Technology time for combining the output into a single file (Additional MR) time for combining the output into a single file (Additional MR) Processing time

Copyright  2009 by CEBT 3. Aggregation Task  The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column  Query SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; This task always produces 2.5 million records Center for E-Business Technology

Copyright  2009 by CEBT 3. Aggregation Task - Result Center for E-Business Technology

Copyright  2009 by CEBT 4. Join Task  The join task consists of two sub-tasks that perform a complex calculation on two data sets In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval  Query SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Center for E-Business Technology

Copyright  2009 by CEBT 4. Join Task - Result Center for E-Business Technology

Copyright  2009 by CEBT 5. UDF Aggregation Task  The final task is to compute the inlink count for each document in the dataset  Query SELECT INTO Temp F(contents) FROM Document; – F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database – With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count SELECT url, SUM(value) FROM Temp GROUP BY url; Center for E-Business Technology

Copyright  2009 by CEBT 5. UDF Aggregation Task - Result Center for E-Business Technology

Copyright  2009 by CEBT Conclusion Center for E-Business Technology MapReduce < Parallel DBMS

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin VLDB 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Copyright  2009 by CEBT HadoopDB  The Basic Idea (An Architectural Hybrid of MR & DBMS) To use MR as the communication layer above multiple nodes running single-node DBMS instances  Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases Center for E-Business Technology

Copyright  2009 by CEBT The Architecture of HadoopDB Center for E-Business Technology

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Similar presentations

Presentation on theme: "A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Similar presentations

Presentation on theme: "A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael."— Presentation transcript:

Similar presentations

About project

Feedback