Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011. 09. 30. Presented by Yang Byoung Ju

One-pass algorithm ▶ Algorithm which reads its input exactly once, without unbounded buffering ▶ Generally requires O(n) time and less than O(n) storage ▶ Example problems solvable by one-pass algorithm  Find the K largest elements  Find sum, mean, variance of the elements of the list  Find the most or least frequent elements ▶ Example problems not solvable by one-pass algorithm  Fine the middle element of the list  Sort the list

Introduction ▶ Real-time analytics using incremental one-pass processing requires the ability to collect and analyze enormous datasets efficiently ▶ But, MapReduce is not well-suited for incremental one-pass analytics since it is designed for batch processing ▶ Also, MapReduce mechanism for parallel processing based on a sort-merge technique is subject to significant CPU and I/O bottleneck ▶ This paper introduces a new platform which (1) reads input data only once, (2) performs incremental processing as more data is read, and (3) utilizes system resources efficiently to achieve high performance and scalability

MapReduce Review

Benchmarking results of Hadoop ▶ ‘Click stream’ sessionization MetricSession. Input256GB Map output269GB Reduce spill370GB Reduce output256GB Running time4860 sec (a) Hadoop: Task timeline (b) Hadoop: CPU utilization(c) Hadoop: CPU iowait

Benchmarking results of Hadoop ▶ The sorting step of sort-merge incurs high CPU cost ▶ Multi-pass merge in sort-merge is blocking and can incur high I/O cost given sustantial intermediate data ▶ Using extra storage devices and alternative storage architectures does not eliminate blocking or the I/O bottleneck ▶ The Hadoop Online Prototype with pipelining does not eliminate blocking, the I/O bottleneck, or the CPU bottleneck

A new hash-based platform ▶ This paper propose a new data analysis platform that transforms MapReduce computation into incremental one-pass processing “Group data by key, then apply the reduce function to each group” ▶ The first mechanism replaces the widely used sort-merge implementation for partitioning and parallel processing with a purely hash-based framework to minimize computational and I/O bottlenecks as well as blocking ▶ The second mechanism brings the benefits of fast in-memory processing by identifying popular keys

1. A basic Hash Technique (MR-Hash) ▶ MR-hash, exactly matches the current MapReduce model that collects all the values of the same key into a list and feeds the entire list to the reduce function ▶ Map side – avoid CPU cost of sorting ▶ Reduce side – allow early answer to be returned from Bucket 1

2. An Incremental Hash Technique (INC-Hash) ▶ Designed for reduce function that permit incremental processing (simple aggregates like sum, count, sublinear-space algorithms) ▶ Init() reduces the amount of data output from the mapper ▶ Recuder only need to hold collapsed, compact state ▶ Query answer can be derived as soon as relevant data available init( ) - reduces a sequence of data items to a state cb( ) - reduces a sequence of states to a state fn( ) - produces a final answer from a state (a,3) (a,5) (a,4) (a,1) (a,2) (a,10-3) (a,5-2) (a,15-5)(a,3) init( ) cb( ) fn( ) Map Reduce

3. A Dynamic Incremental Hash (DINC-Hash) ▶ Dynamically determine which keys should be processed in memory and which keys shoulc be written to disk ▶ Greater I/O efficiency – hot keys are in memory ▶ Faster query answer – usually hot keys are more important New (k,s) k exists in hashtable increase c update s c[j]=0 for some j Initially, all c=0 (1,k,s) -> (c[j],k[j],s[j]) write (k,s) to disk and c[j]-- for all j Ye s No

Prototype Implementation ▶ Hash based Map Output - MapOutputBuffer (manage buffer, patition data) is replaced ▶ Hash Thread - InMemFSMerge (in-memory on-disk merge) is replaced with MR-Hash, INC-Hash, or DINC-Hash - Byte array-based memory manager

Performance Evaluation ▶ 236GB WorldCup click stream dataset - sessionization: split the click of each user into sessions - user click counting: count the number of clicks by each user - frequent user identification: find user who click at least 50 ▶ 156GB GOV2 dataset - trigram counting: report trigam that appears more than 1,000 ▶ 11 nodes (1 head + 10 compute node) - CentOS 5.4, 2.83GHz Intel Xeon (quad)cores, 8GB RAM - JVM Heap size: 1GB - Hadoop 0.20.1, default setting - map buffer: 140 MB, reduce buffer: 500MB

Performance Evaluation ▶ By supporting incremental processing, INC-Hash can provide earyly output, and generates less spill data, which reduces the running time. (a) Sessionization(b) User click couning (c) Frequent user identification Sessionization1-pass SMMR-hashINC-hash Running time (s)442435772258 Map CPU time (s)936566571 Reduce CPU time (s) 11041033565 Map output (GB)245 Reduce spill (GB)25025651

Conclusion ▶ Sort-merge implementation for MapReduce poses fundamental barrier to incremental one-pass analytics ▶ This paper proposed a new data analysis platform the employs a purely hash-based framework, with various techniques to enable incremental processing and fast in-memory processing for frequent keys

Q & A Thank you

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Similar presentations

Presentation on theme: "Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Similar presentations

Presentation on theme: "Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011."— Presentation transcript:

Similar presentations

About project

Feedback