MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com http://www.monash.com http://www.DBMS2.com

Curt Monash Analyst since 1981  Covered DBMS since the pre-relational days  Also analytics, search, etc. Publicly available research  Blogs, including DBMS2 (http://www.DBMS2.com)  Feed at http://www.monash.com/blogs.html User and vendor consulting

Agenda Introduction and truisms MapReduce overview MapReduce specifics SQL and MapReduce together

Monash’s First Law of Commercial Semantics Bad jargon drives out good For example: “Relational”, “Parallel”, “MapReduce”

Where to measure database technology Language interpretation and execution capabilities  Functionality  Speed Administrative capabilities How well it all works  Fit and finish  Reliability How much it all – really – costs You can do anything in 0s and 1s … but how much effort will it actually take?

What’s hard about parallelization* Getting the right data … … to the right nodes … … at the right time … … while dealing with errors … … and without overloading the network Otherwise, programming a grid is a lot like programming a single node. *in general -- not just for “database” technology

MPP DBMS are good at parallelization … … under three assumptions, namely: You can express the job nicely in SQL … ... or whatever other automatically-parallel languages the DBMS offers You don’t really need query fault-tolerance …  … which is usually the case unless you have 1000s of nodes There’s enough benefit to storing the data in tables to justify the overhead

SQL commonly gets frustrating … … when you’re dealing with sequences of events or relationships, because: Self-joins are expensive Programming is hard when you’re not sure how long the sequence is For example:  Clickstreams  Financial data time series  Social network graph analysis

The pure MapReduce alternative Lightweight approach to parallelization The only absolute requirement is a certain simple programming model …  … so simple that parallelization is “automatic” …  … and very friendly to procedural languages It doesn’t require a DBMS on the back end  No SQL required! Non-DBMS implementations commonly have query fault-tolerance But you have to take care of optimizing data redistribution yourself

MapReduce evolution Used under-the-covers for quite a while Named and popularized by Google Open-sourced in Hadoop Widely adopted by big web companies Integrated (at various levels) into MPP RDBMS Adopted for social network analysis Explored/investigated for data mining applications ???

M/R use cases -- large-scale ETL Text indexing  This is how Google introduced the MapReduce concept Time series disaggregation  Clickstream sessionization and analytics  Stock trade pattern identification Relationship graph traversal

M/R use cases – hardcore arithmetic Statistical routines Data “cooking”

The essence of MapReduce “Map” steps Data redistribution “Reduce” steps In strict alternation … … or not-so-strict

“Map” step basics (reality) Input = anything  Set of data  Output of previous Reduce step Output = anything  There’s an obvious key

Map step basics (formality) Input = { pairs} Output = { pairs} Input and output key types don’t have to be the same “Embarrassingly parallel” based on key

Map step examples Word count  Input format = document/text string  Output format = Text indexing  Input format = document/text string  Output format = Log parsing  Input format = log file  Output format =

Reduce step basics Input = { pairs}, where all the keys are equal Output = { pairs}, where the set commonly has cardinality = 1 Input and output key types don’t have to be the same Just like Map, “embarrassingly parallel” based on key

Reduce step examples Word count  Input format =  Output format = Text indexing  Input format =  Output format = Log parsing  E.g., input format =  E.g., output format =

More honoured in the breach than in the observance!

Sometimes the Reduce step is trivial MapReduce for data mining Partition on some key Calculate a single vector* for each whole partition Aggregate the vectors Hooray! *Algorithm-dependent

Sometimes Reduce doesn’t reduce Tick stream data “cooking” can increase its size by one to two orders of magnitude Sessionization might just add a column – SessionID – to records  Or is that a Map step masquerading as a Reduce?

Some reasons to integrate SQL and MapReduce JOINs were invented for a reason So was SQL 2003 It’s kind of traditional to keep data in an RDBMS

Some ways to integrate SQL and MapReduce A SQL layer built on a MapReduce engine  E.g., Facebook’s Hive over Hadoop  But building a DBMS-equivalent is hard MapReduce invoking SQL SQL invoking MapReduce  Aster’s SQL M/R

To materialize or not to materialize? DBMS avoidance of intermediate materialization  much better performance Classic MapReduce intermediate materialization  query fault-tolerance How much does query fault-tolerance matter?  (Query duration) x (Node count) vs.  Node MTTF DBMS-style materialization strategies usually win

Other reasons to put your data in a real database Query response time General performance Backup Security General administration SQL syntax General programmability and connectivity

Aspects of Aster’s approach to MapReduce Data stored in a database MapReduce execution managed by a DBMS Flexible MapReduce syntax MapReduce invoked via SQL

Further information Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com http://www.monash.com http://www.DBMS2 com

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

Similar presentations

Presentation on theme: "MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

Similar presentations

Presentation on theme: "MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2"— Presentation transcript:

Similar presentations

About project

Feedback