Download presentation
Presentation is loading. Please wait.
Published byMervyn McKenzie Modified over 9 years ago
1
Benchmarking MapReduce-Style Parallel Computing http://www.cs.cmu.edu/~bryant Randal E. Bryant Carnegie Mellon University
2
– 2 – Programming with MapReduce Background Developed at Google for aggregating web data Dean & Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004Strengths Easy way to write scalable parallel programs Powerful programming model Beyond web search applications Runtime system automatically handles many of the challenges of parallel programming Scheduling, load balancing, fault tolerance
3
– 3 – Overall Execution Model General Form Input Large set of files Compute Aggregate information Output Files containing aggregations Example: Word Count Index Input 10 10 cached web pages Stored on cluster of 1000 machines, each with own local disk Compute Index of words with occurrence counts Output File containing count for each word
4
– 4 – MapReduce Programming Map Function generating keyword/value pairs from input file E.g., word/count for each word in documentReduce Function aggregating values for single keyword E.g.,Sum word counts M x1x1 M x2x2 M x3x3 M xnxn k1k1 Map Reduce k1k1 krkr Key-Value Pairs
5
– 5 – MapReduce Implementation (Somewhat naïve implementation)Map Spawn mapping task for each input file Execute on processor local to file Generate file for each keyword/valueShuffle Redistribute files by hashing keywords K –> P h(K)Reduce Spawn reduce task for each keyword On processor to which keyword hashes P h(K)
6
– 6 – Appealing Features Ease of Programming Programmer provides only two functions Express in terms of computation over data, not detailed execution on systemRobustness Tolerant to failures of disks, processors, network Source files stored redundantly Runtime monitor detects and reexecutes failed tasks Dynamic scheduling automatically adapts to resource limitations
7
– 7 – Tolerating Failures Dean & Ghemawat, OSDI 2004 Sorting 10 million 100-byte records with 1800 processors Proactively restart delayed computations to achieve better performance and fault tolerance
8
– 8 – Our Data-Driven World Science Data bases from astronomy, genomics, natural languages, seismic modeling, …Humanities Scanned books, historic documents, …Commerce Corporate sales, stock market transactions, census, airline traffic, …Entertainment Internet images, Hollywood movies, MP3 files, …Medicine MRI & CT scans, patient records, …
9
– 9 – “Big Data” Computing: Beyond Web Search Application Domains Rely on large, ever-changing data sets Collecting & maintaining data is major effort Computational Requirements Extract information from large volumes of raw dataHypothesis Can apply MapReduce style computation to many other application domains Give it a Try! Hadoop: Open source implementation of parallel file system & MapReduce
10
– 10 – Q1: Workload Characteristics Hardware 1000s of “nodes” Each with processor(s), disk(s), network interface High-speed, local network using commodity technology E.g., gigabit ethernet with switches Data Organization Distributed file system providing uniform name space and redundant storageComputation Each task executed as separate process with file I/O Rely on file system for data transfer
11
– 11 – Q2: Hardware/Software Challenges Performance Issues Disk bandwidth limitations 3.6 hours to read data from 1TB disk Data transfer across network Process & file I/O overhead Runtime Issues Detecting and mitigating effects of failed components
12
– 12 – Q3: Benchmarking Challenges Generalizing Results Beyond specific data set & cluster configuration Performance depends on many different factors Can we predict how program will scale? Identifying Bottlenecks Many interacting parts to system Evaluating Robustness Creating realistic failure modes
13
– 13 – Q4: University Contributions Currently: Industry ahead of universities Dealing with massive data sets Computing at very large scale Developing new programming/runtime approaches Google, Yahoo!, Microsoft University Role More open and systematic inquiry Apply to noncommercial problems Extend and improve programming model and notations Expose students to emerging styles of computing
14
– 14 – Background Information “Data-Intensive Supercomputing: The case for DISC” Tech Report: CMU-CS-07-128 Available from http://www.cs.cmu.edu/~bryant
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.