Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Overview of MapReduce and Hadoop
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
D ata I ntensive S uper C omputing Randal E. Bryant Carnegie Mellon University.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Distributed Computations MapReduce
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)
B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/ Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce B.Ramamurthy & K.Madurai 1. Motivation Process lots of data Google processed about 24 petabytes of data per day in A single machine A.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Chapter 1 Characterization of Distributed Systems
Hadoop Aakash Kag What Why How 1.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Hadoop Technopoints.
Introduction to MapReduce
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University

– 2 – Programming with MapReduce Background Developed at Google for aggregating web data Dean & Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004Strengths Easy way to write scalable parallel programs Powerful programming model Beyond web search applications Runtime system automatically handles many of the challenges of parallel programming Scheduling, load balancing, fault tolerance

– 3 – Overall Execution Model General Form Input Large set of files Compute Aggregate information Output Files containing aggregations Example: Word Count Index Input cached web pages Stored on cluster of 1000 machines, each with own local disk Compute Index of words with occurrence counts Output File containing count for each word

– 4 – MapReduce Programming Map Function generating keyword/value pairs from input file E.g., word/count for each word in documentReduce Function aggregating values for single keyword E.g.,Sum word counts M x1x1 M x2x2 M x3x3 M xnxn k1k1 Map Reduce k1k1 krkr    Key-Value Pairs

– 5 – MapReduce Implementation (Somewhat naïve implementation)Map Spawn mapping task for each input file Execute on processor local to file Generate file for each keyword/valueShuffle Redistribute files by hashing keywords K –> P h(K)Reduce Spawn reduce task for each keyword On processor to which keyword hashes P h(K)

– 6 – Appealing Features Ease of Programming Programmer provides only two functions Express in terms of computation over data, not detailed execution on systemRobustness Tolerant to failures of disks, processors, network Source files stored redundantly Runtime monitor detects and reexecutes failed tasks Dynamic scheduling automatically adapts to resource limitations

– 7 – Tolerating Failures Dean & Ghemawat, OSDI 2004 Sorting 10 million 100-byte records with 1800 processors Proactively restart delayed computations to achieve better performance and fault tolerance

– 8 – Our Data-Driven World Science Data bases from astronomy, genomics, natural languages, seismic modeling, …Humanities Scanned books, historic documents, …Commerce Corporate sales, stock market transactions, census, airline traffic, …Entertainment Internet images, Hollywood movies, MP3 files, …Medicine MRI & CT scans, patient records, …

– 9 – “Big Data” Computing: Beyond Web Search Application Domains Rely on large, ever-changing data sets Collecting & maintaining data is major effort Computational Requirements Extract information from large volumes of raw dataHypothesis Can apply MapReduce style computation to many other application domains Give it a Try! Hadoop: Open source implementation of parallel file system & MapReduce

– 10 – Q1: Workload Characteristics Hardware 1000s of “nodes” Each with processor(s), disk(s), network interface High-speed, local network using commodity technology E.g., gigabit ethernet with switches Data Organization Distributed file system providing uniform name space and redundant storageComputation Each task executed as separate process with file I/O Rely on file system for data transfer

– 11 – Q2: Hardware/Software Challenges Performance Issues Disk bandwidth limitations  3.6 hours to read data from 1TB disk Data transfer across network Process & file I/O overhead Runtime Issues Detecting and mitigating effects of failed components

– 12 – Q3: Benchmarking Challenges Generalizing Results Beyond specific data set & cluster configuration Performance depends on many different factors Can we predict how program will scale? Identifying Bottlenecks Many interacting parts to system Evaluating Robustness Creating realistic failure modes

– 13 – Q4: University Contributions Currently: Industry ahead of universities Dealing with massive data sets Computing at very large scale Developing new programming/runtime approaches Google, Yahoo!, Microsoft University Role More open and systematic inquiry Apply to noncommercial problems Extend and improve programming model and notations Expose students to emerging styles of computing

– 14 – Background Information “Data-Intensive Supercomputing: The case for DISC” Tech Report: CMU-CS Available from