MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

Use of the SPSSMR Data Model at ATP 12 January 2004.
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
Distributed Computations
PHP (2) – Functions, Arrays, Databases, and sessions.
Databases Chapter Distinguish between the physical and logical view of data Describe how data is organized: characters, fields, records, tables,
Benchmarking XML storage systems Information Systems Lab HS 2007 Final Presentation © ETH Zürich | Benchmarking XML.
Ch1: File Systems and Databases Hachim Haddouti
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Chapter 1 Overview of Databases and Transaction Processing.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
MapReduce VS Parallel DBMSs
WHAT THE MARKET-LEADING DBMS VENDORS DON’T WANT YOU TO KNOW Disruption is gathering steam.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
Towards a Grid-based DBMS Craig Thompson University of Arkansas In certain high-end data-centric applications, practitioners are discovering that traditional.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
Database Management Systems (DBMS)
Database Management Systems (DBMS)
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Chapter 1 Overview of Databases and Transaction Processing.
BIG DATA/ Hadoop Interview Questions.
Edexcel OnCourse Databases Unit 9. Edexcel OnCourse Database Structure Presentation Unit 9Slide 2 What is a Database? Databases are everywhere! Student.
Image taken from: slideshare
CS 405G: Introduction to Database Systems
Hadoop.
HADOOP ADMIN: Session -2
Map Reduce.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ch 4. The Evolution of Analytic Scalability
5/7/2019 Map Reduce Map reduce.
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

Curt Monash Analyst since 1981  Covered DBMS since the pre-relational days  Also analytics, search, etc. Publicly available research  Blogs, including DBMS2 (  Feed at User and vendor consulting

Agenda Introduction and truisms MapReduce overview MapReduce specifics SQL and MapReduce together

Monash’s First Law of Commercial Semantics Bad jargon drives out good For example: “Relational”, “Parallel”, “MapReduce”

Where to measure database technology Language interpretation and execution capabilities  Functionality  Speed Administrative capabilities How well it all works  Fit and finish  Reliability How much it all – really – costs You can do anything in 0s and 1s … but how much effort will it actually take?

What’s hard about parallelization* Getting the right data … … to the right nodes … … at the right time … … while dealing with errors … … and without overloading the network Otherwise, programming a grid is a lot like programming a single node. *in general -- not just for “database” technology

MPP DBMS are good at parallelization … … under three assumptions, namely: You can express the job nicely in SQL … ... or whatever other automatically-parallel languages the DBMS offers You don’t really need query fault-tolerance …  … which is usually the case unless you have 1000s of nodes There’s enough benefit to storing the data in tables to justify the overhead

SQL commonly gets frustrating … … when you’re dealing with sequences of events or relationships, because: Self-joins are expensive Programming is hard when you’re not sure how long the sequence is For example:  Clickstreams  Financial data time series  Social network graph analysis

The pure MapReduce alternative Lightweight approach to parallelization The only absolute requirement is a certain simple programming model …  … so simple that parallelization is “automatic” …  … and very friendly to procedural languages It doesn’t require a DBMS on the back end  No SQL required! Non-DBMS implementations commonly have query fault-tolerance But you have to take care of optimizing data redistribution yourself

MapReduce evolution Used under-the-covers for quite a while Named and popularized by Google Open-sourced in Hadoop Widely adopted by big web companies Integrated (at various levels) into MPP RDBMS Adopted for social network analysis Explored/investigated for data mining applications ???

M/R use cases -- large-scale ETL Text indexing  This is how Google introduced the MapReduce concept Time series disaggregation  Clickstream sessionization and analytics  Stock trade pattern identification Relationship graph traversal

M/R use cases – hardcore arithmetic Statistical routines Data “cooking”

The essence of MapReduce “Map” steps Data redistribution “Reduce” steps In strict alternation … … or not-so-strict

“Map” step basics (reality) Input = anything  Set of data  Output of previous Reduce step Output = anything  There’s an obvious key

Map step basics (formality) Input = { pairs} Output = { pairs} Input and output key types don’t have to be the same “Embarrassingly parallel” based on key

Map step examples Word count  Input format = document/text string  Output format = Text indexing  Input format = document/text string  Output format = Log parsing  Input format = log file  Output format =

Reduce step basics Input = { pairs}, where all the keys are equal Output = { pairs}, where the set commonly has cardinality = 1 Input and output key types don’t have to be the same Just like Map, “embarrassingly parallel” based on key

Reduce step examples Word count  Input format =  Output format = Text indexing  Input format =  Output format = Log parsing  E.g., input format =  E.g., output format =

More honoured in the breach than in the observance!

Sometimes the Reduce step is trivial MapReduce for data mining Partition on some key Calculate a single vector* for each whole partition Aggregate the vectors Hooray! *Algorithm-dependent

Sometimes Reduce doesn’t reduce Tick stream data “cooking” can increase its size by one to two orders of magnitude Sessionization might just add a column – SessionID – to records  Or is that a Map step masquerading as a Reduce?

Some reasons to integrate SQL and MapReduce JOINs were invented for a reason So was SQL 2003 It’s kind of traditional to keep data in an RDBMS

Some ways to integrate SQL and MapReduce A SQL layer built on a MapReduce engine  E.g., Facebook’s Hive over Hadoop  But building a DBMS-equivalent is hard MapReduce invoking SQL SQL invoking MapReduce  Aster’s SQL M/R

To materialize or not to materialize? DBMS avoidance of intermediate materialization  much better performance Classic MapReduce intermediate materialization  query fault-tolerance How much does query fault-tolerance matter?  (Query duration) x (Node count) vs.  Node MTTF DBMS-style materialization strategies usually win

Other reasons to put your data in a real database Query response time General performance Backup Security General administration SQL syntax General programmability and connectivity

Aspects of Aster’s approach to MapReduce Data stored in a database MapReduce execution managed by a DBMS Flexible MapReduce syntax MapReduce invoked via SQL

Further information Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 com