Download presentation
Presentation is loading. Please wait.
Published byAnnice Allen Modified over 9 years ago
1
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011. 09. 30. Presented by Yang Byoung Ju
2
Page 2 One-pass algorithm ▶ Algorithm which reads its input exactly once, without unbounded buffering ▶ Generally requires O(n) time and less than O(n) storage ▶ Example problems solvable by one-pass algorithm Find the K largest elements Find sum, mean, variance of the elements of the list Find the most or least frequent elements ▶ Example problems not solvable by one-pass algorithm Fine the middle element of the list Sort the list
3
Page 3 Introduction ▶ Real-time analytics using incremental one-pass processing requires the ability to collect and analyze enormous datasets efficiently ▶ But, MapReduce is not well-suited for incremental one-pass analytics since it is designed for batch processing ▶ Also, MapReduce mechanism for parallel processing based on a sort-merge technique is subject to significant CPU and I/O bottleneck ▶ This paper introduces a new platform which (1) reads input data only once, (2) performs incremental processing as more data is read, and (3) utilizes system resources efficiently to achieve high performance and scalability
4
Page 4 MapReduce Review
5
Page 5 Benchmarking results of Hadoop ▶ ‘Click stream’ sessionization MetricSession. Input256GB Map output269GB Reduce spill370GB Reduce output256GB Running time4860 sec (a) Hadoop: Task timeline (b) Hadoop: CPU utilization(c) Hadoop: CPU iowait
6
Page 6 Benchmarking results of Hadoop ▶ The sorting step of sort-merge incurs high CPU cost ▶ Multi-pass merge in sort-merge is blocking and can incur high I/O cost given sustantial intermediate data ▶ Using extra storage devices and alternative storage architectures does not eliminate blocking or the I/O bottleneck ▶ The Hadoop Online Prototype with pipelining does not eliminate blocking, the I/O bottleneck, or the CPU bottleneck
7
Page 7 A new hash-based platform ▶ This paper propose a new data analysis platform that transforms MapReduce computation into incremental one-pass processing “Group data by key, then apply the reduce function to each group” ▶ The first mechanism replaces the widely used sort-merge implementation for partitioning and parallel processing with a purely hash-based framework to minimize computational and I/O bottlenecks as well as blocking ▶ The second mechanism brings the benefits of fast in-memory processing by identifying popular keys
8
Page 8 1. A basic Hash Technique (MR-Hash) ▶ MR-hash, exactly matches the current MapReduce model that collects all the values of the same key into a list and feeds the entire list to the reduce function ▶ Map side – avoid CPU cost of sorting ▶ Reduce side – allow early answer to be returned from Bucket 1
9
Page 9 2. An Incremental Hash Technique (INC-Hash) ▶ Designed for reduce function that permit incremental processing (simple aggregates like sum, count, sublinear-space algorithms) ▶ Init() reduces the amount of data output from the mapper ▶ Recuder only need to hold collapsed, compact state ▶ Query answer can be derived as soon as relevant data available init( ) - reduces a sequence of data items to a state cb( ) - reduces a sequence of states to a state fn( ) - produces a final answer from a state (a,3) (a,5) (a,4) (a,1) (a,2) (a,10-3) (a,5-2) (a,15-5)(a,3) init( ) cb( ) fn( ) Map Reduce
10
Page 10 3. A Dynamic Incremental Hash (DINC-Hash) ▶ Dynamically determine which keys should be processed in memory and which keys shoulc be written to disk ▶ Greater I/O efficiency – hot keys are in memory ▶ Faster query answer – usually hot keys are more important New (k,s) k exists in hashtable increase c update s c[j]=0 for some j Initially, all c=0 (1,k,s) -> (c[j],k[j],s[j]) write (k,s) to disk and c[j]-- for all j Ye s No
11
Page 11 Prototype Implementation ▶ Hash based Map Output - MapOutputBuffer (manage buffer, patition data) is replaced ▶ Hash Thread - InMemFSMerge (in-memory on-disk merge) is replaced with MR-Hash, INC-Hash, or DINC-Hash - Byte array-based memory manager
12
Page 12 Performance Evaluation ▶ 236GB WorldCup click stream dataset - sessionization: split the click of each user into sessions - user click counting: count the number of clicks by each user - frequent user identification: find user who click at least 50 ▶ 156GB GOV2 dataset - trigram counting: report trigam that appears more than 1,000 ▶ 11 nodes (1 head + 10 compute node) - CentOS 5.4, 2.83GHz Intel Xeon (quad)cores, 8GB RAM - JVM Heap size: 1GB - Hadoop 0.20.1, default setting - map buffer: 140 MB, reduce buffer: 500MB
13
Page 13 Performance Evaluation ▶ By supporting incremental processing, INC-Hash can provide earyly output, and generates less spill data, which reduces the running time. (a) Sessionization(b) User click couning (c) Frequent user identification Sessionization1-pass SMMR-hashINC-hash Running time (s)442435772258 Map CPU time (s)936566571 Reduce CPU time (s) 11041033565 Map output (GB)245 Reduce spill (GB)25025651
14
Page 14 Conclusion ▶ Sort-merge implementation for MapReduce poses fundamental barrier to incremental one-pass analytics ▶ This paper proposed a new data analysis platform the employs a purely hash-based framework, with various techniques to enable incremental processing and fast in-memory processing for frequent keys
15
Page 15 Q & A Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.