Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar Presented by Yang Byoung Ju
Page 2 One-pass algorithm ▶ Algorithm which reads its input exactly once, without unbounded buffering ▶ Generally requires O(n) time and less than O(n) storage ▶ Example problems solvable by one-pass algorithm Find the K largest elements Find sum, mean, variance of the elements of the list Find the most or least frequent elements ▶ Example problems not solvable by one-pass algorithm Fine the middle element of the list Sort the list
Page 3 Introduction ▶ Real-time analytics using incremental one-pass processing requires the ability to collect and analyze enormous datasets efficiently ▶ But, MapReduce is not well-suited for incremental one-pass analytics since it is designed for batch processing ▶ Also, MapReduce mechanism for parallel processing based on a sort-merge technique is subject to significant CPU and I/O bottleneck ▶ This paper introduces a new platform which (1) reads input data only once, (2) performs incremental processing as more data is read, and (3) utilizes system resources efficiently to achieve high performance and scalability
Page 4 MapReduce Review
Page 5 Benchmarking results of Hadoop ▶ ‘Click stream’ sessionization MetricSession. Input256GB Map output269GB Reduce spill370GB Reduce output256GB Running time4860 sec (a) Hadoop: Task timeline (b) Hadoop: CPU utilization(c) Hadoop: CPU iowait
Page 6 Benchmarking results of Hadoop ▶ The sorting step of sort-merge incurs high CPU cost ▶ Multi-pass merge in sort-merge is blocking and can incur high I/O cost given sustantial intermediate data ▶ Using extra storage devices and alternative storage architectures does not eliminate blocking or the I/O bottleneck ▶ The Hadoop Online Prototype with pipelining does not eliminate blocking, the I/O bottleneck, or the CPU bottleneck
Page 7 A new hash-based platform ▶ This paper propose a new data analysis platform that transforms MapReduce computation into incremental one-pass processing “Group data by key, then apply the reduce function to each group” ▶ The first mechanism replaces the widely used sort-merge implementation for partitioning and parallel processing with a purely hash-based framework to minimize computational and I/O bottlenecks as well as blocking ▶ The second mechanism brings the benefits of fast in-memory processing by identifying popular keys
Page 8 1. A basic Hash Technique (MR-Hash) ▶ MR-hash, exactly matches the current MapReduce model that collects all the values of the same key into a list and feeds the entire list to the reduce function ▶ Map side – avoid CPU cost of sorting ▶ Reduce side – allow early answer to be returned from Bucket 1
Page 9 2. An Incremental Hash Technique (INC-Hash) ▶ Designed for reduce function that permit incremental processing (simple aggregates like sum, count, sublinear-space algorithms) ▶ Init() reduces the amount of data output from the mapper ▶ Recuder only need to hold collapsed, compact state ▶ Query answer can be derived as soon as relevant data available init( ) - reduces a sequence of data items to a state cb( ) - reduces a sequence of states to a state fn( ) - produces a final answer from a state (a,3) (a,5) (a,4) (a,1) (a,2) (a,10-3) (a,5-2) (a,15-5)(a,3) init( ) cb( ) fn( ) Map Reduce
Page A Dynamic Incremental Hash (DINC-Hash) ▶ Dynamically determine which keys should be processed in memory and which keys shoulc be written to disk ▶ Greater I/O efficiency – hot keys are in memory ▶ Faster query answer – usually hot keys are more important New (k,s) k exists in hashtable increase c update s c[j]=0 for some j Initially, all c=0 (1,k,s) -> (c[j],k[j],s[j]) write (k,s) to disk and c[j]-- for all j Ye s No
Page 11 Prototype Implementation ▶ Hash based Map Output - MapOutputBuffer (manage buffer, patition data) is replaced ▶ Hash Thread - InMemFSMerge (in-memory on-disk merge) is replaced with MR-Hash, INC-Hash, or DINC-Hash - Byte array-based memory manager
Page 12 Performance Evaluation ▶ 236GB WorldCup click stream dataset - sessionization: split the click of each user into sessions - user click counting: count the number of clicks by each user - frequent user identification: find user who click at least 50 ▶ 156GB GOV2 dataset - trigram counting: report trigam that appears more than 1,000 ▶ 11 nodes (1 head + 10 compute node) - CentOS 5.4, 2.83GHz Intel Xeon (quad)cores, 8GB RAM - JVM Heap size: 1GB - Hadoop , default setting - map buffer: 140 MB, reduce buffer: 500MB
Page 13 Performance Evaluation ▶ By supporting incremental processing, INC-Hash can provide earyly output, and generates less spill data, which reduces the running time. (a) Sessionization(b) User click couning (c) Frequent user identification Sessionization1-pass SMMR-hashINC-hash Running time (s) Map CPU time (s) Reduce CPU time (s) Map output (GB)245 Reduce spill (GB)
Page 14 Conclusion ▶ Sort-merge implementation for MapReduce poses fundamental barrier to incremental one-pass analytics ▶ This paper proposed a new data analysis platform the employs a purely hash-based framework, with various techniques to enable incremental processing and fast in-memory processing for frequent keys
Page 15 Q & A Thank you