Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Similar presentations


Presentation on theme: "Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011."— Presentation transcript:

1 Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011. 09. 30. Presented by Yang Byoung Ju

2 Page 2 One-pass algorithm ▶ Algorithm which reads its input exactly once, without unbounded buffering ▶ Generally requires O(n) time and less than O(n) storage ▶ Example problems solvable by one-pass algorithm  Find the K largest elements  Find sum, mean, variance of the elements of the list  Find the most or least frequent elements ▶ Example problems not solvable by one-pass algorithm  Fine the middle element of the list  Sort the list

3 Page 3 Introduction ▶ Real-time analytics using incremental one-pass processing requires the ability to collect and analyze enormous datasets efficiently ▶ But, MapReduce is not well-suited for incremental one-pass analytics since it is designed for batch processing ▶ Also, MapReduce mechanism for parallel processing based on a sort-merge technique is subject to significant CPU and I/O bottleneck ▶ This paper introduces a new platform which (1) reads input data only once, (2) performs incremental processing as more data is read, and (3) utilizes system resources efficiently to achieve high performance and scalability

4 Page 4 MapReduce Review

5 Page 5 Benchmarking results of Hadoop ▶ ‘Click stream’ sessionization MetricSession. Input256GB Map output269GB Reduce spill370GB Reduce output256GB Running time4860 sec (a) Hadoop: Task timeline (b) Hadoop: CPU utilization(c) Hadoop: CPU iowait

6 Page 6 Benchmarking results of Hadoop ▶ The sorting step of sort-merge incurs high CPU cost ▶ Multi-pass merge in sort-merge is blocking and can incur high I/O cost given sustantial intermediate data ▶ Using extra storage devices and alternative storage architectures does not eliminate blocking or the I/O bottleneck ▶ The Hadoop Online Prototype with pipelining does not eliminate blocking, the I/O bottleneck, or the CPU bottleneck

7 Page 7 A new hash-based platform ▶ This paper propose a new data analysis platform that transforms MapReduce computation into incremental one-pass processing “Group data by key, then apply the reduce function to each group” ▶ The first mechanism replaces the widely used sort-merge implementation for partitioning and parallel processing with a purely hash-based framework to minimize computational and I/O bottlenecks as well as blocking ▶ The second mechanism brings the benefits of fast in-memory processing by identifying popular keys

8 Page 8 1. A basic Hash Technique (MR-Hash) ▶ MR-hash, exactly matches the current MapReduce model that collects all the values of the same key into a list and feeds the entire list to the reduce function ▶ Map side – avoid CPU cost of sorting ▶ Reduce side – allow early answer to be returned from Bucket 1

9 Page 9 2. An Incremental Hash Technique (INC-Hash) ▶ Designed for reduce function that permit incremental processing (simple aggregates like sum, count, sublinear-space algorithms) ▶ Init() reduces the amount of data output from the mapper ▶ Recuder only need to hold collapsed, compact state ▶ Query answer can be derived as soon as relevant data available init( ) - reduces a sequence of data items to a state cb( ) - reduces a sequence of states to a state fn( ) - produces a final answer from a state (a,3) (a,5) (a,4) (a,1) (a,2) (a,10-3) (a,5-2) (a,15-5)(a,3) init( ) cb( ) fn( ) Map Reduce

10 Page 10 3. A Dynamic Incremental Hash (DINC-Hash) ▶ Dynamically determine which keys should be processed in memory and which keys shoulc be written to disk ▶ Greater I/O efficiency – hot keys are in memory ▶ Faster query answer – usually hot keys are more important New (k,s) k exists in hashtable increase c update s c[j]=0 for some j Initially, all c=0 (1,k,s) -> (c[j],k[j],s[j]) write (k,s) to disk and c[j]-- for all j Ye s No

11 Page 11 Prototype Implementation ▶ Hash based Map Output - MapOutputBuffer (manage buffer, patition data) is replaced ▶ Hash Thread - InMemFSMerge (in-memory on-disk merge) is replaced with MR-Hash, INC-Hash, or DINC-Hash - Byte array-based memory manager

12 Page 12 Performance Evaluation ▶ 236GB WorldCup click stream dataset - sessionization: split the click of each user into sessions - user click counting: count the number of clicks by each user - frequent user identification: find user who click at least 50 ▶ 156GB GOV2 dataset - trigram counting: report trigam that appears more than 1,000 ▶ 11 nodes (1 head + 10 compute node) - CentOS 5.4, 2.83GHz Intel Xeon (quad)cores, 8GB RAM - JVM Heap size: 1GB - Hadoop 0.20.1, default setting - map buffer: 140 MB, reduce buffer: 500MB

13 Page 13 Performance Evaluation ▶ By supporting incremental processing, INC-Hash can provide earyly output, and generates less spill data, which reduces the running time. (a) Sessionization(b) User click couning (c) Frequent user identification Sessionization1-pass SMMR-hashINC-hash Running time (s)442435772258 Map CPU time (s)936566571 Reduce CPU time (s) 11041033565 Map output (GB)245 Reduce spill (GB)25025651

14 Page 14 Conclusion ▶ Sort-merge implementation for MapReduce poses fundamental barrier to incremental one-pass analytics ▶ This paper proposed a new data analysis platform the employs a purely hash-based framework, with various techniques to enable incremental processing and fast in-memory processing for frequent keys

15 Page 15 Q & A Thank you


Download ppt "Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011."

Similar presentations


Ads by Google