MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
Curt Monash Analyst since 1981 Covered DBMS since the pre-relational days Also analytics, search, etc. Publicly available research Blogs, including DBMS2 ( Feed at User and vendor consulting
Agenda Introduction and truisms MapReduce overview MapReduce specifics SQL and MapReduce together
Monash’s First Law of Commercial Semantics Bad jargon drives out good For example: “Relational”, “Parallel”, “MapReduce”
Where to measure database technology Language interpretation and execution capabilities Functionality Speed Administrative capabilities How well it all works Fit and finish Reliability How much it all – really – costs You can do anything in 0s and 1s … but how much effort will it actually take?
What’s hard about parallelization* Getting the right data … … to the right nodes … … at the right time … … while dealing with errors … … and without overloading the network Otherwise, programming a grid is a lot like programming a single node. *in general -- not just for “database” technology
MPP DBMS are good at parallelization … … under three assumptions, namely: You can express the job nicely in SQL … ... or whatever other automatically-parallel languages the DBMS offers You don’t really need query fault-tolerance … … which is usually the case unless you have 1000s of nodes There’s enough benefit to storing the data in tables to justify the overhead
SQL commonly gets frustrating … … when you’re dealing with sequences of events or relationships, because: Self-joins are expensive Programming is hard when you’re not sure how long the sequence is For example: Clickstreams Financial data time series Social network graph analysis
The pure MapReduce alternative Lightweight approach to parallelization The only absolute requirement is a certain simple programming model … … so simple that parallelization is “automatic” … … and very friendly to procedural languages It doesn’t require a DBMS on the back end No SQL required! Non-DBMS implementations commonly have query fault-tolerance But you have to take care of optimizing data redistribution yourself
MapReduce evolution Used under-the-covers for quite a while Named and popularized by Google Open-sourced in Hadoop Widely adopted by big web companies Integrated (at various levels) into MPP RDBMS Adopted for social network analysis Explored/investigated for data mining applications ???
M/R use cases -- large-scale ETL Text indexing This is how Google introduced the MapReduce concept Time series disaggregation Clickstream sessionization and analytics Stock trade pattern identification Relationship graph traversal
M/R use cases – hardcore arithmetic Statistical routines Data “cooking”
The essence of MapReduce “Map” steps Data redistribution “Reduce” steps In strict alternation … … or not-so-strict
“Map” step basics (reality) Input = anything Set of data Output of previous Reduce step Output = anything There’s an obvious key
Map step basics (formality) Input = { pairs} Output = { pairs} Input and output key types don’t have to be the same “Embarrassingly parallel” based on key
Map step examples Word count Input format = document/text string Output format = Text indexing Input format = document/text string Output format = Log parsing Input format = log file Output format =
Reduce step basics Input = { pairs}, where all the keys are equal Output = { pairs}, where the set commonly has cardinality = 1 Input and output key types don’t have to be the same Just like Map, “embarrassingly parallel” based on key
Reduce step examples Word count Input format = Output format = Text indexing Input format = Output format = Log parsing E.g., input format = E.g., output format =
More honoured in the breach than in the observance!
Sometimes the Reduce step is trivial MapReduce for data mining Partition on some key Calculate a single vector* for each whole partition Aggregate the vectors Hooray! *Algorithm-dependent
Sometimes Reduce doesn’t reduce Tick stream data “cooking” can increase its size by one to two orders of magnitude Sessionization might just add a column – SessionID – to records Or is that a Map step masquerading as a Reduce?
Some reasons to integrate SQL and MapReduce JOINs were invented for a reason So was SQL 2003 It’s kind of traditional to keep data in an RDBMS
Some ways to integrate SQL and MapReduce A SQL layer built on a MapReduce engine E.g., Facebook’s Hive over Hadoop But building a DBMS-equivalent is hard MapReduce invoking SQL SQL invoking MapReduce Aster’s SQL M/R
To materialize or not to materialize? DBMS avoidance of intermediate materialization much better performance Classic MapReduce intermediate materialization query fault-tolerance How much does query fault-tolerance matter? (Query duration) x (Node count) vs. Node MTTF DBMS-style materialization strategies usually win
Other reasons to put your data in a real database Query response time General performance Backup Security General administration SQL syntax General programmability and connectivity
Aspects of Aster’s approach to MapReduce Data stored in a database MapReduce execution managed by a DBMS Flexible MapReduce syntax MapReduce invoked via SQL
Further information Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 com