A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Shark:SQL and Rich Analytics at Scale
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
CS347: MapReduce CS Motivation for Map-Reduce Distribution makes simple computations complex Communication Load balancing Fault tolerance … What.
Summary of “ Oracle does about-face on NoSQL ” Jaikumar Vijayan, ComputerWorld, Oct 4th, 2011 Presented by: James Klassen.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.
Cloud Computing Other Mapreduce issues Keke Chen.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: OldSQL vs. NoSQL vs. NewSQL.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
MapReduce VS Parallel DBMSs
李智宇、 林威宏、 施閔耀. + Outline Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Cloud Computing Other High-level parallel processing languages Keke Chen.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24 th, 2012 John Kubiatowicz and Anthony.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Research Meeting Jaeseok Myung. Copyright  2009 by CEBT Summary  TA DB : project 3, midterm(24 명 응시 ) WEC : report, project (android), classroom,
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
DB Tuning : Chapter 10. Optimizer Center for E-Business Technology Seoul National University Seoul, Korea 이상근 Intelligent Database Systems Lab School of.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Paper By: Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica Presentaed By :Jacob Komarovski Based on the slides of :Kirti.
EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Hadoop.
Hadoop MapReduce Framework
Map Reduce.
A Comparison of Approaches to Large-Scale Data Analysis
Cse 344 May 4th – Map/Reduce.
Interpret the execution mode of SQL query in F1 Query paper
Charles Tappert Seidenberg School of CSIS, Pace University
EECS 262a Advanced Topics in Computer Systems Lecture 21 Comparison of Parallel DB, CS, MR and Spark November 11th, 2018 John Kubiatowicz Electrical.
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Copyright  2009 by CEBT Center for E-Business Technology MapReduce vs. Parallel DBMS

Copyright  2009 by CEBT MapReduce Center for E-Business Technology 한재선, SearchDay2008,

Copyright  2009 by CEBT Architectural Differences Parallel DBMSMapReduce Schema SupportOX IndexingOX Programming Model Stating what you want (SQL) Presenting an algorithm (C/C++, Java, …) OptimizationOX FlexibilityGood Fault ToleranceGood Center for E-Business Technology

Copyright  2009 by CEBT Benchmark Environment (1/2)  Systems Hadoop: The most popular open-source MR implementation DBMS-X: a parallel DBMS that stores data in a row-based format Vertica: a column-based parallel DBMS  All Three systems were deployed on a 100-node cluster  Analytical Tasks Data Loading Selection Task Aggregation Task Join Task UDF Aggregation Task Center for E-Business Technology

Copyright  2009 by CEBT Benchmark Environment (2/2)  Dataset Documents : 600,000 unique documents for each node 155 million UserVisits records (20GB/node) 18 million Rankings records (1GB/node) Center for E-Business Technology

Copyright  2009 by CEBT 1. Data Loading Center for E-Business Technology loading time Reorganization

Copyright  2009 by CEBT 2. Selection Task  The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user- defined threshold  Query SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x; x = 10, which yields approximately 36,000 records per data file on each node  For MR, implementing the same task with Java language Center for E-Business Technology

Copyright  2009 by CEBT 2. Selection Task - Result Center for E-Business Technology time for combining the output into a single file (Additional MR) time for combining the output into a single file (Additional MR) Processing time

Copyright  2009 by CEBT 3. Aggregation Task  The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column  Query SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; This task always produces 2.5 million records Center for E-Business Technology

Copyright  2009 by CEBT 3. Aggregation Task - Result Center for E-Business Technology

Copyright  2009 by CEBT 4. Join Task  The join task consists of two sub-tasks that perform a complex calculation on two data sets In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval  Query SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘ ’) AND Date(‘ ’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Center for E-Business Technology

Copyright  2009 by CEBT 4. Join Task - Result Center for E-Business Technology

Copyright  2009 by CEBT 5. UDF Aggregation Task  The final task is to compute the inlink count for each document in the dataset  Query SELECT INTO Temp F(contents) FROM Document; – F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database – With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count SELECT url, SUM(value) FROM Temp GROUP BY url; Center for E-Business Technology

Copyright  2009 by CEBT 5. UDF Aggregation Task - Result Center for E-Business Technology

Copyright  2009 by CEBT Conclusion Center for E-Business Technology MapReduce < Parallel DBMS

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin VLDB Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Copyright  2009 by CEBT HadoopDB  The Basic Idea (An Architectural Hybrid of MR & DBMS) To use MR as the communication layer above multiple nodes running single-node DBMS instances  Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases Center for E-Business Technology

Copyright  2009 by CEBT The Architecture of HadoopDB Center for E-Business Technology

Copyright  2009 by CEBT HadoopDB – Join Task Center for E-Business Technology