Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
On the Locality of Java 8 Streams in Real- Time Big Data Applications Yu Chan Ian Gray Andy Wellings Neil Audsley Real-Time Systems Group, Computer Science.
SkewTune: Mitigating Skew in MapReduce Applications
15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Spark: Cluster Computing with Working Sets
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
MapReduce VS Parallel DBMSs
Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Improving Network I/O Virtualization for Cloud Computing.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Memory Management during Run Generation in External Sorting – Larson & Graefe.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
CS4432: Database Systems II Query Processing- Part 2.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Computing & Information Sciences Kansas State University Monday, 03 Nov 2008CIS 560: Database System Concepts Lecture 27 of 42 Monday, 03 November 2008.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
CS 540 Database Management Systems
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Distributed Video Transcoding System based on MapReduce for Video Content Delivery Myoungjin Kim', Hanku Lee l 'z* Hyeokju Lee' and Seungho Han' ' Department.
Efficient Evaluation of XQuery over Streaming Data
Seth Pugsley, Jeffrey Jestes,
Large-scale file systems and Map-Reduce
Spark Presentation.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Database Management Systems (CS 564)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Sorting.
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar Presented by Yang Byoung Ju

Page 2 One-pass algorithm ▶ Algorithm which reads its input exactly once, without unbounded buffering ▶ Generally requires O(n) time and less than O(n) storage ▶ Example problems solvable by one-pass algorithm  Find the K largest elements  Find sum, mean, variance of the elements of the list  Find the most or least frequent elements ▶ Example problems not solvable by one-pass algorithm  Fine the middle element of the list  Sort the list

Page 3 Introduction ▶ Real-time analytics using incremental one-pass processing requires the ability to collect and analyze enormous datasets efficiently ▶ But, MapReduce is not well-suited for incremental one-pass analytics since it is designed for batch processing ▶ Also, MapReduce mechanism for parallel processing based on a sort-merge technique is subject to significant CPU and I/O bottleneck ▶ This paper introduces a new platform which (1) reads input data only once, (2) performs incremental processing as more data is read, and (3) utilizes system resources efficiently to achieve high performance and scalability

Page 4 MapReduce Review

Page 5 Benchmarking results of Hadoop ▶ ‘Click stream’ sessionization MetricSession. Input256GB Map output269GB Reduce spill370GB Reduce output256GB Running time4860 sec (a) Hadoop: Task timeline (b) Hadoop: CPU utilization(c) Hadoop: CPU iowait

Page 6 Benchmarking results of Hadoop ▶ The sorting step of sort-merge incurs high CPU cost ▶ Multi-pass merge in sort-merge is blocking and can incur high I/O cost given sustantial intermediate data ▶ Using extra storage devices and alternative storage architectures does not eliminate blocking or the I/O bottleneck ▶ The Hadoop Online Prototype with pipelining does not eliminate blocking, the I/O bottleneck, or the CPU bottleneck

Page 7 A new hash-based platform ▶ This paper propose a new data analysis platform that transforms MapReduce computation into incremental one-pass processing “Group data by key, then apply the reduce function to each group” ▶ The first mechanism replaces the widely used sort-merge implementation for partitioning and parallel processing with a purely hash-based framework to minimize computational and I/O bottlenecks as well as blocking ▶ The second mechanism brings the benefits of fast in-memory processing by identifying popular keys

Page 8 1. A basic Hash Technique (MR-Hash) ▶ MR-hash, exactly matches the current MapReduce model that collects all the values of the same key into a list and feeds the entire list to the reduce function ▶ Map side – avoid CPU cost of sorting ▶ Reduce side – allow early answer to be returned from Bucket 1

Page 9 2. An Incremental Hash Technique (INC-Hash) ▶ Designed for reduce function that permit incremental processing (simple aggregates like sum, count, sublinear-space algorithms) ▶ Init() reduces the amount of data output from the mapper ▶ Recuder only need to hold collapsed, compact state ▶ Query answer can be derived as soon as relevant data available init( ) - reduces a sequence of data items to a state cb( ) - reduces a sequence of states to a state fn( ) - produces a final answer from a state (a,3) (a,5) (a,4) (a,1) (a,2) (a,10-3) (a,5-2) (a,15-5)(a,3) init( ) cb( ) fn( ) Map Reduce

Page A Dynamic Incremental Hash (DINC-Hash) ▶ Dynamically determine which keys should be processed in memory and which keys shoulc be written to disk ▶ Greater I/O efficiency – hot keys are in memory ▶ Faster query answer – usually hot keys are more important New (k,s) k exists in hashtable increase c update s c[j]=0 for some j Initially, all c=0 (1,k,s) -> (c[j],k[j],s[j]) write (k,s) to disk and c[j]-- for all j Ye s No

Page 11 Prototype Implementation ▶ Hash based Map Output - MapOutputBuffer (manage buffer, patition data) is replaced ▶ Hash Thread - InMemFSMerge (in-memory on-disk merge) is replaced with MR-Hash, INC-Hash, or DINC-Hash - Byte array-based memory manager

Page 12 Performance Evaluation ▶ 236GB WorldCup click stream dataset - sessionization: split the click of each user into sessions - user click counting: count the number of clicks by each user - frequent user identification: find user who click at least 50 ▶ 156GB GOV2 dataset - trigram counting: report trigam that appears more than 1,000 ▶ 11 nodes (1 head + 10 compute node) - CentOS 5.4, 2.83GHz Intel Xeon (quad)cores, 8GB RAM - JVM Heap size: 1GB - Hadoop , default setting - map buffer: 140 MB, reduce buffer: 500MB

Page 13 Performance Evaluation ▶ By supporting incremental processing, INC-Hash can provide earyly output, and generates less spill data, which reduces the running time. (a) Sessionization(b) User click couning (c) Frequent user identification Sessionization1-pass SMMR-hashINC-hash Running time (s) Map CPU time (s) Reduce CPU time (s) Map output (GB)245 Reduce spill (GB)

Page 14 Conclusion ▶ Sort-merge implementation for MapReduce poses fundamental barrier to incremental one-pass analytics ▶ This paper proposed a new data analysis platform the employs a purely hash-based framework, with various techniques to enable incremental processing and fast in-memory processing for frequent keys

Page 15 Q & A Thank you