Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern
Motivation MapReduce Program Base data Result data Bigtable / HBase
Motivation View Definition Base data Materialized view
incrementalMapReduce Program Motivation incrementalMapReduce Program MapReduce Program Base data Result data Bigtable / HBase
Agenda Related Work Case study Incremental view maintenance Summary Delta Algorithm Conclusion and future work
Related Work Caching intermediate results DryadInc Incoop Incremental programming models Google Percolator Continuous bulk processing (CBP) L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009 P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011 D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010 D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010
Challenges Programming model Efficient access paths SQL / relational algebra vs. MapReduce Efficient access paths No secondary indexes in Hbase Support for transactions Only single-row transactions in Hbase
Case Study Word histograms Reverse web-link graphs Term-vectors per host Count of URL access frequency Inverted Indexes J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
Computing Reverse Web-Link Graphs <html> ... </html> Computing Reverse Web-Link Graphs <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> Thomas Jörg, Technische Universität Kaiserslautern 9 <html> ... </html> <html> ... </html> <html> ... </html>
Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> <a href="a.htm"> ...</a> <a href="b.htm"> </html>
Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, b.htm a.htm, {b.htm} b.htm, b.htm
Summary Delta Algorithm CREATE VIEW Parts AS SELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcnt FROM Orders GROUP BY partID SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcnt FROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions ) GROUP BY partID I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997 W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000
Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, b.htm a.htm, {b.htm} b.htm, b.htm
Achieving Self-Maintainability Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, 1] b.htm, {[a.htm, 2], [b.htm, 1]} b.htm, [a.htm, 1] b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, [b.htm, 1] a.htm, {[b.htm, 1]} b.htm, [b.htm, 1]
Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> <a href="a.htm"> </html> <html> <a href="b.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html>
Summary Delta Algorithm in MapReduce a.htm (deleted) Map Shuffle Reduce <html> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, -1] b.htm, [a.htm, -1] b.htm, {[a.htm, -1]} a.htm, {[a.htm, +1]} a.htm (inserted) <html> <a href="b.htm"> ...</a> <a href="a.htm"> </html> b.htm, [a.htm, +1] a.htm, [a.htm, +1]
Delta Installation Approaches MapReduce Base deltas Materialized view Increment Installation Materialized view MapReduce Base deltas Materialized view Overwrite Installation
Case Study – Lessons Learned Numerical aggregation Word histogram URL access frequency Set aggregation Reverse web-link graph Inverted index Multiset aggregation Term-vector per host
General Solution Self-maintainable aggregates Computed in three steps Translation Grouping Aggregation commutative and associative binary function inverse elements Abelian group
Case Study – Lessons Learned Numerical aggregation Word histogram URL access frequency Set aggregation Reverse web-link graph Inverted index Multiset aggregation Term-vector per host Translation function: Translate web pages into (word, 1) Aggregation function: Abelian group (Natural numbers, +) Translation function: Translate web pages into (link target, link source) Aggregation function: Abelian group (Power-multiset of URLs, multiset union)
Evaluation y-axis: Elapsed time [min] x-axis: Updates in base documents [%]
Conclusion & Future Work View Maintenance in MapReduce Case study Summary delta algorithm Self-maintainable aggregations Future Work Broader class of MapReduce programs High-level MapReduce languages, e.g. Jaql or PigLatin