A Comparison of Approaches to Large-Scale Data Analysis Erik Paulson Computer Sciences Department.

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Distributed Computations

Distributed Computations MapReduce

Cloud Computing Other Mapreduce issues Keke Chen.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Ch 4. The Evolution of Analytic Scalability

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

MapReduce VS Parallel DBMSs

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.

MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Presenters: Rezan Amiri Sahar Delroshan

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Hadoop implementation of MapReduce computational model Ján Vaňo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.

Tallahassee, Florida, 2016 COP5725 Advanced Database Systems MapReduce Spring 2016.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

INTRODUCTION TO BIGDATA & HADOOP

Introduction to MapReduce and Hadoop

MapReduce Simplied Data Processing on Large Clusters

Overview of big data tools

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

A Comparison of Approaches to Large-Scale Data Analysis Erik Paulson Computer Sciences Department

For later this afternoon Some citations for things I’ll mention today spring12.html

Today’s Class Quick Overview of Big Data, Parallel Databases and MapReduce Controversy, or what happens when a blog gets out of hand A comparison of approaches to large scale data analysis

Lots of data Google processes 20 PB a day (2008) Wayback Machine has 3 PB TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

Not Just Internet Companies A new “leg” of science? Experiment, Theory Simulation, and “Data Exploration” or “in-ferro”? Ocean Observatories, Ecological Observatories (NEON) Sloan Digital Sky Survey, Large Synoptic Survey Telescope

More Big Data Netflix and Amazon recommendations Fraud detection Google auto-suggest, translation –Also, which ad should you be shown Full corpus studies in Humanities Coming soon to campus: early academic intervention

It doesn’t have to be “Big” Megabytes are still interesting –moving to the web has made integration much easier –Tools are better –Making statistics sexy More examples –2012 campaigns –Mashups –Crisis Response – Ushahidi

Speed-up and Scale-up Performance Challenges: –Run lots of transactions quickly –Run a single large query quickly – our focus today “Speed-up” – 2 times the hardware, same work, ½ the time to finish “Scale-up” – 2 times the hardware, 2 times the work, same time to finish

How do we get data to the workers? Compute Nodes NAS SAN What’s the problem here? This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

Distributed File System Don’t move data to workers… move workers to the data! –Store data on the local disks of nodes in the cluster –Start up the workers on the node that has the data local Why? –Not enough RAM to hold all the data in memory –Disk access is slow, but disk throughput is reasonable A distributed file system is the answer –GFS (Google File System) for Google’s MapReduce –HDFS (Hadoop Distributed File System) for Hadoop This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

GFS: Assumptions Commodity hardware over “exotic” hardware –Scale “out”, not “up” High component failure rates –Inexpensive commodity components fail all the time “Modest” number of huge files –Multi-gigabyte files are common, if not encouraged Files are write-once, mostly appended to –Perhaps concurrently Large streaming reads over random access –High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003) This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

GFS: Design Decisions Files stored as chunks –Fixed size (64MB) Reliability through replication –Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata –Simple centralized management No data caching –Little benefit due to large datasets, streaming reads Simplify the API –Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas) This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

From GFS to HDFS Terminology differences: –GFS master = Hadoop namenode –GFS chunkservers = Hadoop datanodes Functional differences: –No file appends in HDFS (planned feature) –HDFS performance is (likely) slower For the most part, we’ll use the Hadoop terminology… This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

Adapted from (Ghemawat et al., SOSP 2003) (file name, block id) (block id, block location) instructions to datanode datanode state (block id, byte range) block data HDFS namenode HDFS datanode Linux file system … HDFS datanode Linux file system … File namespace /foo/bar block 3df2 Application HDFS Client HDFS Architecture This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

Horizontal Partitioning DATE AMOUNT $ $ $ $ $ $ $15.00 Server image from fundraw.com

Partitioning Options Round-Robin: When a new tuple comes in, put it at the next node, wrap around when needed Range Partition: Similar data goes to same node –Partition by date, ID range, etc –Not always clear how to pick boundaries! Hash Partition: Apply hash f() to attributes to decide node –Hash on join key means all joins are local

Parallel Databases: Three Key Techniques Data partitioning between storage nodes Pipelining of tuples between operators Partitioned Execution of relational operators across multiple processors Need new operators: split(shuffle) and merge

Pipelining SELECT S.sname from RESERVES R JOIN SAILORS S on S.SID = R.SID where R.bid = 100 AND S.rating > 5 ReservesSailors Bid = 100Rating > 5 S.sid = R.sid sname Node 1

Pipelining – with partitioning, splitting, and merging SELECT S.sname from RESERVES R JOIN SAILORS S on S.SID = R.SID where R.bid = 100 AND S.rating > 5 ReservesSailors Bid = 100Rating > 5 S.sid = R.sid sname Node 1 Merge Split ReservesSailors Bid = 100Rating > 5 S.sid = R.sid sname Node 2 Merge Split Merge

Massively parallel data processing Programming Model vs. Execution Platform Programs consist of only two functions: Map( ) → list( ) Reduce(k2, list(v2)) → (k3, list(v3)) Often, you’d like k2 and k3 to be the same so you can apply reduce to intermediate results MapReduce Overview

namenode daemon Putting everything together… datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node namenodejob submission node jobtracker This slide courtesy of Professor Jimmy Lin of UMD, from “Data-Intensive Information Processing Applications”

CITY AMOUNT Los Angles $19.00 San Fran $25.00 Verona $53.00 Houston $12.00 El Paso $45.00 Waunakee $99.00 Cupertino $15.00 CITY AMOUNT Dallas $10.00 San Diego $25.00 Madison $53.00 Eau Claire $12.00 Austin $45.00 San Jose $99.00 San Diego $15.00 MapReduce Example Input A B MapOutput TEXAS Houston,$12 El Paso,$45 CAL Los Angeles,$19 San Fran,$25, Cupertino,$15 WISC Verona, $53 Waunakee,$ 99 TEXAS Dallas,$10 Austion,$45 CAL San Diego,$25 San Jose,$99, San Diego,$15 WISC Madison, $53 Eau Claire,$12 A B Reduce Workers TEXAS Houston,$12 El Paso,$45 CAL Los Angeles,$19 San Fran,$25, Cupertino,$15 WISC Verona, $53 Waunakee,$ 99 TEXAS Dallas,$10 Austion,$45 CAL San Diego,$25 San Jose,$99, San Diego,$15 WISC Madison, $53 Eau Claire,$12 C D Texas $112 WISC $217 Cal $198 Reduce Output A E

The Data Center Is The Computer “ I predict MapReduce will inspire new ways of thinking about the design and programming of large distributed systems. If MapReduce is the first instruction of the ‘data center computer,’ I can’t wait to see the rest of the instruction set…” -David Patterson Communications of the ACM (January 2008)

kerfuffle |k ə r ˈ f ə f ə l| (noun): A commotion or fuss

Timeline, continued MapReduce or Parallel Database Management Systems (pDBMS) can be used to analyze large datasets, so appropriate to compare them Proposed a few thought experiments for simple benchmarks

Timeline, continued Better understand each system through a comparison

My SIGMOD and CACM Co-Authors  Daniel Abadi (Yale)  David DeWitt (Microsoft)  Samuel Madden* (MIT)  Andrew Pavlo* (Brown)  Alexander Rasin (Brown)  Michael Stonebraker (MIT) * (Primary creators of these slides – Thanks!)

Shared Features “Shared Nothing” –Cluster fault tolerance –Push plans to local node –Hash or other partitioning for parallelism MapReduce ancestors –Functional programming –Systems community pDBMS ancestors –Many 80s research projects

Differences: Data MapReduce operates on in-situ data, without requiring transformations or loading Schemas: –MapReduce doesn’t require them, DBMSs do –Easy to write simple MR problems –No logical data independence –Google is addressing this with Protocol Buffers, see “MapReduce: A Flexible Data Processing Tool” in January 2010 CACM

Differences: Programming Model Common to write chain of MapReduce jobs to perform a task –Analogous to multiple joins/subqueries in DBMSs No built-in optimizer in MapReduce to order or unnest steps Indexing in MapReduce –Possible, but up to the programmer –No optimizer or statistics to select good plan at runtime

Differences: Intermediate Results MapReduce writes intermediate results to disk pDBMS usually pushes results to next stage immediately MapReduce trades speed for mid-query fault tolerance –Either system could make other decision –See “MapReduce Online” from Berkeley

MapReduce and Databases  Understand loading and execution behaviors for common processing tasks.  Large-scale data access (>1TB):  Analytical query workloads  Bulk loads  Non-transactional  Benchmark  Tasks that either system should execute well

Details 100 node Linux cluster at Wisconsin –Compression enabled in all systems Hadoop –0.19.0, Java 1.6 DBMS-X –Parallel shared-nothing row store from a major vendor –Hash-partitioned, sorted and indexed Vertica –Parallel shared-nothing column-oriented database –sorted beneficially XXX

Grep Task  Find 3-byte pattern in 100-byte record  1 match per 10,000 records  Data set:  10-byte unique key, 90-byte value  1TB spread across 25, 50, or 100 nodes  10 billion records  Original MR Paper (Dean et al 2004)  Speedup experiment

Grep Task: Data Load Times (Seconds) 1TB of data, distributed over nodes

Grep Task: Query Time 1TB of data, distributed over nodes (Seconds) -Near Linear speedup -DBMSs helped by compression and fast parsing

Analysis Tasks  Simple web processing schema  Data set:  600k HTML Documents (6GB/node)  155 million UserVisit records (20GB/node)  18 million Rankings records (1GB/node) Full Details of schema in paper

Aggregation Task  Scaleup Experiment  Simple query to find adRevenue by IP prefix SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM userVistits GROUP BY SUBSTR(sourceIP, 1, 7)

Aggregation Task: Query Time

Other Tasks Paper reports selection (w/index) and join tasks –pDBMSs outperform Hadoop on these Vertica does very well on some tasks thanks to column-oriented nature Hadoop join code non-trivial

UDF task –One iteration of simplified pagerank UDF support disappointingly awful in pDBMSs benchmarked –Vertica: no support –DBMS-X: buggy Other systems could be better

Discussion Hadoop was much easier to set up –But by end of CACM 2010, we gave it as much tuning as other systems Hadoop load times can be faster –“Few-off” processings, ETL Hadoop query times are slower –Parsing, validating, and boxing data –Execution Method

 Hadoop is slow to start executing programs:  10 seconds until first Map starts.  25 seconds until all 100 nodes are executing.  7 buffer copies per record before reaching Map function [1].  Parallel DBMSs are always “warm” [1] The Anatomy of Hadoop I/O Pipeline - August 27 th, Hadoop Task Execution

 Hadoop has to parse/cast values every time:  SequenceFiles provide serialized key/value.  Multi-attribute values must still handled by user code.  DBMSs parse records at load time:  Allows for efficient storage and retrieval. Repetitive Data Parsing

Jeffrey Dean and Sanjay Ghemawat MapReduce: A Flexible Data Processing Tool CACM’10 Key points: Flaws in benchmark. Fault-tolerance in large clusters. Data Parsing MapReduce the model versus implementation Google’s Response

MR can load and execute queries in the same time that it takes DBMS-X just to load. Alternatives to reading all of the input data: Select files based on naming convention. Use alternative storage (BigTable). Combining final reduce output. Google’s Response: Flaws

Largest known database installations: Greenplum – 96 nodes – 4.5 PB (eBay) [1] Teradata – 72 nodes – 2+ PB (eBay) [1] Largest known MR installations: Hadoop – 3658 nodes – 1 PB (Yahoo) [2] Hive – 600+ nodes – 2.5 PB (Facebook) [3] [1] eBay’s two enormous data warehouses – April 30 th, [2] Hadoop Sorts a Petabyte in Hours and a Terabyte in 62 Seconds – May 11 th, [3] Hive - A Petabyte Scale Data Warehouse using Hadoop – June 10 th, Google’s Response: Cluster Size

MapReduce enables parallel computations not easily performed in a DBMS: Stitching satellite images for Google Earth. Generating inverted index for Google Search. Processing road segments for Google Maps. Programming Model vs. Execution Platform Google’s Response: Functionality

Our CACM Takeaway – A Sweetspot for ETL “Read Once” data sets: Read data from several different sources. Parse and clean. Perform complex transformations. Decide what attribute data to store. Load the information into a DBMS. Allows for quick-and-dirty data analysis.

Big Data in the Cloud Age For about minimum wage, you can have a 100 node cluster –Preconfigured to run Hadoop jobs, no less! People will use what’s available –The cheaper the better The database community did (does) not have a cheap and ready to download answer for this environment

Take away MapReduce goodness: –Ease of use, immediate results –Attractive fault tolerance –Applicability to other domains –Fast Load Times and in-situ data Database goodness: –Fast query times –Schemas, indexing, transactions, declarative languages –Supporting tools and enterprise features

Where are we today Hadoop improvements: –YARN: More flexible execution environment –Better data encoding options: ORCFile, Parquet –Hive and Impala run hot –System catalogs and query optimizers DBMS Improvements: –More expressive, syntax to run MapReduce –External Tables on Distributed Filesystems –Multi Cluster aware/Query planners may run MapReduce jobs

Pipelining SELECT S.sname from RESERVES R JOIN SAILORS S on S.SID = R.SID where R.bid = 100 AND S.rating > 5 ReservesSailors Bid = 100Rating > 5 S.sid = R.sid sname Node 1

Extra slides

Pipelining – with partitioning, splitting, and merging SELECT S.sname from RESERVES R JOIN SAILORS S on S.SID = R.SID where R.bid = 100 AND S.rating > 5 ReservesSailors Bid = 100Rating > 5 S.sid = R.sid sname Node 1 Merge Split ReservesSailors Bid = 100Rating > 5 S.sid = R.sid sname Node 2 Merge Split Merge

Backup slide #1 Implementation Refinement Aggregation Task (50 nodes) Expanded schemas Community Feedback Compression 64-bit Version New Version Code Tuning JVM Reuse