A Comparison of Approaches to Large-Scale Data Analysis

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Distributed Computations
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.
Cloud Computing Other Mapreduce issues Keke Chen.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: OldSQL vs. NoSQL vs. NewSQL.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce VS Parallel DBMSs
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24 th, 2012 John Kubiatowicz and Anthony.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Image taken from: slideshare
Database Management System
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Lecture 16: Data Storage Wednesday, November 6, 2006.
Spark Presentation.
Map Reduce.
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Lecture 11: DMBS Internals
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
Evaluation of Relational Operations: Other Operations
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cse 344 May 2nd – Map/reduce.
Cse 344 May 4th – Map/Reduce.
Physical Database Design
Cse 344 APRIL 23RD – Indexing.
Ch 4. The Evolution of Analytic Scalability
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Declarative Transfer Learning from Deep CNNs at Scale
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Chapter 11 Database Performance Tuning and Query Optimization
Query Processing.
The Gamma Database Machine Project
Lecture 20: Representing Data Elements
Presentation transcript:

A Comparison of Approaches to Large-Scale Data Analysis By seven authors from five different institutions Presented by Zhiqin Chen

Why not use a parallel DBMS instead? Commercially available for 20 years e.g. Microsoft, Oracle … Robust High performance Provides high-level programming environment You can write almost any parallel processing task as either a set of database queries or a set of MapReduce jobs

Outline Comparison Benchmark & Results Conclusion Architectural differences Benchmark & Results 5 tasks Load time Query time Conclusion Show where each system is the right choice

Architectural Differences: Data Storage MapReduce Raw (in-situ) data Parallel DBMS Standard relational tables Most tables are partitioned over the nodes

Architectural Differences Schema: MR doesn’t require schema; DBMS does Write a custom parser vs. Specify the “shape” Indexing Optimization MR provides no built in support

Architectural Differences: Programming Model Codasyl vs. Relational Codasyl Presenting an algorithm for data access “The assembly language of DBMS access” Relational Stating what you want Conference/Committee on Data Systems Languages

Architectural Differences: Expressiveness Flexibility vs. Simplicity Almost all of the major DBMS products support user-defined functions (UDFs) *UDFs are problematic

Architectural Differences: Fault Tolerance Data transfer Strategy Pull vs. Push MR supports mid-query fault tolerance Output files of the Map phase are materialized locally Pipelines of MR jobs write intermediate results to files DBMSs typically don’t Matters when the number of nodes gets large

The benchmark and experiments

Hardware 100-node Linux cluster at U. Wisconsin “Shared nothing” Local disk and local memory Connected by LAN

Software Hadoop DBMS-X Vertica Publicly available open-source version of MapReduce DBMS-X Parallel shared-nothing row store from a major vendor Partitioned, sorted, indexed and compressed beneficially Vertica Parallel shared-nothing column-oriented database Sorted, indexed and compressed beneficially

“DeWitt Clause”

Software Hadoop DBMS-X Vertica Publicly available open-source version of MapReduce DBMS-X Parallel shared-nothing row store from a major vendor Partitioned, sorted, indexed and compressed beneficially Vertica Parallel shared-nothing column-oriented database

Grep Used in original MapReduce paper Look for 3 character pattern in 90 byte field of 100 byte records with schema 0.01% of records CREATE TABLE Data ( key VARCHAR(10) PRIMARY KEY, field VARCHAR(90) ); SELECT * FROM Data WHERE field LIKE '%XYZ%' ;

Load times – Grep (535MB/node) optimization, compression, indexing… DBMS-X: proportional increase , sequencial read Hadoop: same, just copy and duplicate

Load times – Grep (1TB/cluster) 10-40 GB/node

Query times - Grep (535MB/node) MR startup cost dominates 10-25s in short running queries additional MR job to merge results into a single file

Query times - Grep (1TB/cluster) 10-40 GB/node

Analytical tasks Simple HTML document processing Documents Rankings 600,000 documents/node ~8 GB/node Randomly generated with unique URL Embeds random URLs to other documents Rankings ~1 GB/node UserVisits ~20 GB/node

Analytical tasks: schema CREATE TABLE UserVisits ( sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT ); CREATE TABLE Documents ( url VARCHAR(100) PRIMARY KEY, contents TEXT ); CREATE TABLE Rankings ( pageURL VARCHAR(100) PRIMARY KEY, pageRank INT, avgDuration INT );

Load times – UserVisits (20GB/node)

Aggregation task To calculate the total adRevenue generated for each sourceIP in the UserVisits grouped by sourceIP Nodes need to exchange intermediate data with one another in order to compute the final value Produces ~2.5 million records (53 MB) SELECT sourceIP, SUM( adRevenue ) FROM UserVisits GROUP BY sourceIP;

Query times - Aggregation Runtime dominated by scanning and communication cost Vertica fast: column store , decrease when more nodes

Aggregation task (variation) To calculate the total adRevenue generated for each sourceIP in the UserVisits grouped by the seven-character prefix of the sourceIP To measure the effect of reducing the total number of groups on query performance Produces ~2,000 records (24KB) SELECT SUBSTR( sourceIP, 1, 7 ), SUM( adRevenue ) FROM UserVisits GROUP BY SUBSTR( sourceIP, 1, 7 );

Query times – Aggregation var. Runtime dominated by scanning the entire dataset

UDF task compute the inlink count for each document in the dataset First read each document and search for all URLs Then, for each unique URL, count the number of unique pages that reference the URL MR is believed to be commonly used for this type of task (should perform well)

UDF task In SQL, UDF to extract URLs followed by an aggregation Neither DBMS made this easy Vertica didn’t support UDFs! Use external program to populate temporary tables DBMS-X had buggy BLOBs UDF read documents from file system Hadoop makes such tasks extremely easy to write SELECT INTO Temp F( contents ) FROM Documents; SELECT url, SUM( value ) FROM Temp GROUP BY url;

Query times - UDF 1 2 ① query execution ②UDF to load the data into the table additional MR job to merge results into a single file MR: additional job time increase, more data to combine Dbms-x worse than hadoop due to UDF interaction with file sys Vertica -> parse data outside dbms and write on local disk before load into dbms 1 2

Discussion System setup Task Start-up parallel DBMSs are much more challenging than Hadoop to install and configure properly Task Start-up Hadoop has “cold start” nature parallel DBMSs are started at OS boot time, thus always “warm” On occasion, this combination of manual and automatic changes resulted in a configuration for DBMS-X that caused it to refuse to boot the next time the system started. DBMSX,on the other hand, was difficult to configure properly and required repeated assistance from the vendor to obtain a configuration that performed well.

Discussion “MapReduce is a GO SLOW command for OLAP Queries.” Loading Hadoop load times are faster Loading is just copying no indexing, no optimization Hadoop query times are a lot slower DBMS-X was 3.2 times faster than Hadoop Vertica was 2.3 times faster than DBMS-X “MapReduce is a GO SLOW command for OLAP Queries.” -- from a talk in Brown University (youtube)

When to choose MapReduce? Load times – UserVisits (20GB/node)

Query times - Join

When to choose MapReduce? Load times – UserVisits (20GB/node) Query times - Join

When to choose MapReduce? Load times – UserVisits (20GB/node) Query times - Join

When to choose MapReduce? MapReduce is designed for one-off processing tasks Where fast load times are important No repeated access Data with no schema or structure & UDFs No compelling reason to choose MR over a database for traditional database workloads

Thank you. Q&A

Parallel DBMS query execution Filtering: performed in parallel on each node Join: based on the size of data tables Small: replicate it on all nodes, compute in parallel Huge: need re-hash and redistribution Aggregation: Each node computes its own portion A final “roll-up”

Hardware 100-node Linux cluster at U. Wisconsin “Shared nothing” Local disk and local memory Connected by LAN Can 100 nodes represent real world systems? At 100 nodes we already see significant difference Very few applications really need 1000 nodes eBay uses just 72 nodes Fox Interactive Media uses 40 nodes

Selection task A lightweight filter to find the pageURLs in the Rankings table with a pageRank above a userdefined threshold ~36,000 records per data file on each node SELECT pageURL, pageRank FROM Rankings WHERE pageRank > 10;

Query times - Selection Vertica: cost low but increase Node still execute the query using same time System flooded with control messages

Join Task Consisting two sub-tasks that perform a complex calculation on two data sets First part: find the sourceIP that generated the most revenue within a particular date range Second part: calculate the average pageRank of those pages visited during this interval Produces ~134,000 records

Join Task SELECT INTO Temp sourceIP, AVG( pageRank ) as avgPageRank, SUM( adRevenue ) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date( '2000-01-15' ) AND Date( '2000-01-22' ) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

Join Task MapReduce does not provide join 3 separate jobs executed one after one Filter UserVisits, join with Rankings Compute total adRevenue and average pageRank based on sourceIP Get largest total adRevenue from previous outputs

Query times - Join Complete scan vs. Indexed & partitioned by join key (join locally) MR: 600 to read, 300 to parse, CPU limits

Discussion Compression parallel DBMS allows for optional compression Vertica’s execution engine operates directly on compressed data Hadoop supports data compression, yet not improving performance

Discussion User-level Aspect MR is easy to start but hard to maintain MR lacks additional tools (for tuning, debugging, etc.)

Conclusion MapReduce advantage DBMS advantage Easy to setup, easy to use Fault tolerance Fast load times One-off processing DBMS advantage Fast query times Supporting tools Repeated re-access