Download presentation
1
Incremental Recomputations in MapReduce
Thomas Jörg University of Kaiserslautern
2
Motivation MapReduce Program Base data Result data Bigtable / HBase
3
Motivation View Definition Base data Materialized view
4
incrementalMapReduce Program
Motivation incrementalMapReduce Program MapReduce Program Base data Result data Bigtable / HBase
5
Agenda Related Work Case study Incremental view maintenance
Summary Delta Algorithm Conclusion and future work
6
Related Work Caching intermediate results
DryadInc Incoop Incremental programming models Google Percolator Continuous bulk processing (CBP) L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009 P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011 D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010 D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010
7
Challenges Programming model Efficient access paths
SQL / relational algebra vs. MapReduce Efficient access paths No secondary indexes in Hbase Support for transactions Only single-row transactions in Hbase
8
Case Study Word histograms Reverse web-link graphs
Term-vectors per host Count of URL access frequency Inverted Indexes J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
9
Computing Reverse Web-Link Graphs
<html> ... </html> Computing Reverse Web-Link Graphs <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> Thomas Jörg, Technische Universität Kaiserslautern 9 <html> ... </html> <html> ... </html> <html> ... </html>
10
Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm">
<a href="a.htm"> ...</a> <a href="b.htm"> </html>
11
Computing Reverse Web-Link Graphs
Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, b.htm a.htm, {b.htm} b.htm, b.htm
12
Summary Delta Algorithm
CREATE VIEW Parts AS SELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcnt FROM Orders GROUP BY partID SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcnt FROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions ) GROUP BY partID I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997 W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000
13
Computing Reverse Web-Link Graphs
Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, b.htm a.htm, {b.htm} b.htm, b.htm
14
Achieving Self-Maintainability
Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, 1] b.htm, {[a.htm, 2], [b.htm, 1]} b.htm, [a.htm, 1] b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, [b.htm, 1] a.htm, {[b.htm, 1]} b.htm, [b.htm, 1]
15
Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm">
<a href="a.htm"> </html> <html> <a href="b.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html>
16
Summary Delta Algorithm in MapReduce
a.htm (deleted) Map Shuffle Reduce <html> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, -1] b.htm, [a.htm, -1] b.htm, {[a.htm, -1]} a.htm, {[a.htm, +1]} a.htm (inserted) <html> <a href="b.htm"> ...</a> <a href="a.htm"> </html> b.htm, [a.htm, +1] a.htm, [a.htm, +1]
17
Delta Installation Approaches
MapReduce Base deltas Materialized view Increment Installation Materialized view MapReduce Base deltas Materialized view Overwrite Installation
18
Case Study – Lessons Learned
Numerical aggregation Word histogram URL access frequency Set aggregation Reverse web-link graph Inverted index Multiset aggregation Term-vector per host
19
General Solution Self-maintainable aggregates Computed in three steps
Translation Grouping Aggregation commutative and associative binary function inverse elements Abelian group
20
Case Study – Lessons Learned
Numerical aggregation Word histogram URL access frequency Set aggregation Reverse web-link graph Inverted index Multiset aggregation Term-vector per host Translation function: Translate web pages into (word, 1) Aggregation function: Abelian group (Natural numbers, +) Translation function: Translate web pages into (link target, link source) Aggregation function: Abelian group (Power-multiset of URLs, multiset union)
21
Evaluation y-axis: Elapsed time [min]
x-axis: Updates in base documents [%]
22
Conclusion & Future Work
View Maintenance in MapReduce Case study Summary delta algorithm Self-maintainable aggregations Future Work Broader class of MapReduce programs High-level MapReduce languages, e.g. Jaql or PigLatin
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.