Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS-4513 Distributed Computing Systems Hugh C. Lauer

Similar presentations


Presentation on theme: "CS-4513 Distributed Computing Systems Hugh C. Lauer"— Presentation transcript:

1 CS-4513 Distributed Computing Systems Hugh C. Lauer
Map-Reduce Assumptions: Graduate level Operating Systems Making Choices about operation systems Why a micro-century? …just about enough time for one concept CS-4513 Distributed Computing Systems Hugh C. Lauer

2 MapReduce Programming model and implementation …
… for processing very large data sets Many terabytes, on clusters of distributed computers Supports a broad variety of real-world tasks Foundation of Google’s applications CS-4513 D-term 2009 (Special Lecture) MapReduce

3 Why MapReduce An important new model for distributed and parallel computing Fundamentally different from traditional models of parallelism Data Parallelism Task Parallelism Pipelined Parallelism An abstraction to automate the mechanics of data handling and to let the programmer concentrate on semantics of a problem CS-4513 D-term 2009 (Special Lecture) MapReduce

4 Last Year in CS-4513 Divided class into four teams
Each team to research and teach one aspect The abstraction itself and its algorithms Distributed MapReduce Class of problems that MapReduce can help solve Google File System to support MapReduce Today’s material is drawn from those presentations CS-4513 D-term 2009 (Special Lecture) MapReduce

5 Google Cluster 1000’s of PC-class systems High-speed interconnect
Dual proc. x86, 4-8 GB RAM Commodity disks High-speed interconnect Mb/sec Distributed, replicated file system, optimized for GByte-size files Reading and appending Non-negligible failure rates CS-4513 D-term 2009 (Special Lecture) MapReduce

6 Typical Applications Search TBytes for words or phrases
Create Page Rank among pages Conceptually simple Devilishly difficult to implement in distrib. environment CS-4513 D-term 2009 (Special Lecture) MapReduce

7 Basic Abstraction Partition application into two functions
Map Reduce Both written by programmer Let system partition execution among distributed platforms Scheduling, communication, synchronization, fault tolerance, reliability, etc. As of January 2008 10,000 separate MapReduce programs developed within Google 100,000 MapReduce jobs per day 20 Petabytes of data processed per day CS-4513 D-term 2009 (Special Lecture) MapReduce

8 Map and Reduce Map – written by programmer System
Takes input key-value pairs Generates set of intermediate key-value pairs System Organizes intermediate pairs by key Reduce – written by programmer Processes or merges all values for a given key Iterates through all keys CS-4513 D-term 2009 (Special Lecture) MapReduce

9 Example – Count Occurrences of Words in Collection of Documents
Pseudo-code:– map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Note: key is not used in this simple application CS-4513 D-term 2009 (Special Lecture) MapReduce

10 Example – Count Occurrences of Words (continued)
Pseudo-code:– reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); CS-4513 D-term 2009 (Special Lecture) MapReduce

11 Example – Count Occurrences of Words (continued)
MapReduce specification Names of input and output files Tuning parameters Expressed as C++ main() function Linked with MapReduce library CS-4513 D-term 2009 (Special Lecture) MapReduce

12 Full C++ Text of Word Frequency Application
Approximately 70 lines of C++ code Dean, J. and Ghemawat, S. “MapReduce: Simplified data processing on large clusters,” In Proceedings of Operating Systems Design and Implementation (OSDI). San Francisco, CA, pp (.pdf). Note: This paper is an earlier version of the CACM paper distributed by to class. It contains some details not included in the CACM paper. CS-4513 D-term 2009 (Special Lecture) MapReduce

13 Other Examples Distributed grep Count of URL access frequency
Key is pattern to search for; values are the lines to search Count of URL access frequency Similar to word count; from web access logs Reverse web-link graph Obtain list of sources for URL target Large-scale indexing Google production search service CS-4513 D-term 2009 (Special Lecture) MapReduce

14 What It Does Map: (k1, v1)  list(k2, v2)
Reduce: (k2, list(v2))  list(v2) MapReduce library: Converts input arguments to many (k1, v1) pairs; calls Map for each pair Reorganizes intermediate lists from Map Calls Reduce for each intermediate key k2 CS-4513 D-term 2009 (Special Lecture) MapReduce

15 Brute-Force Implementation
CS-4513 D-term 2009 (Special Lecture) MapReduce

16 Brute-Force Implementation
Step 0: split input files into pieces–16-64 Mbyte CS-4513 D-term 2009 (Special Lecture) MapReduce

17 Brute-Force Implementation
Step 1:– Fork User Program Many distributed processes Scattered across cluster One designated as Master Brute-Force Implementation CS-4513 D-term 2009 (Special Lecture) MapReduce

18 Brute-Force Implementation
Step 2:– Master assigns worker tasks Manages results Monitors behavior & faults CS-4513 D-term 2009 (Special Lecture) MapReduce

19 Brute-Force Implementation
Step 3:– Map workers Read input splits via GFS Parse key-value pairs Passes pair to Map function Buffers output in local mem. CS-4513 D-term 2009 (Special Lecture) MapReduce

20 Brute-Force Implementation
Step 4:–Intermediate files To local disk (via GFS) Notify master CS-4513 D-term 2009 (Special Lecture) MapReduce

21 Brute-Force Implementation
Step 5:–Reduce worker Reads int. data (streaming) Sorts by intermediate key CS-4513 D-term 2009 (Special Lecture) MapReduce

22 Brute-Force Implementation
Step 6:–Call Reduce function For each key, list of values Writes output file Notifies master CS-4513 D-term 2009 (Special Lecture) MapReduce

23 Result One output file for each Reduce worker
Combined by application program or Passed to another Map-Reduce call Or another distributed application CS-4513 D-term 2009 (Special Lecture) MapReduce

24 Questions? This presentation is stored at //

25 Distributed System Issues
Fault-tolerance Distributed file access Scalable performance CS-4513 D-term 2009 (Special Lecture) MapReduce

26 Managing Faults and Failures
In a cluster of 1800 nodes, there will always be a handful of failures Question: with 1800 hard drives, 100,000 hours MTBF, what is MBTF of a drive in cluster? Some processors may be “slow” – called stragglers Intermittent memory or bus errors Recoverable disk or network errors Over-scheduling by system CS-4513 D-term 2009 (Special Lecture) MapReduce

27 Managing Faults and Failures (continued)
Master task periodically pings worker tasks If no response, starts a new worker task with same responsibility New worker reads data from different replica Also, starts backup tasks for stragglers Just in case! Whichever task finishes first “wins” Other task(s) shut down Performance penalty for backup tasks A few percent loss in system resources Enormous improvement in response time CS-4513 D-term 2009 (Special Lecture) MapReduce

28 Questions?

29 Google File System Assumptions
System failures are the norm System stores mostly large (multi-gigabyte) files Expected read operations Large streaming accesses (> 1MByte per access) Few random accesses (a few KB out of someplace random) Expected write operations Long appending writes Multiple clients appending concurrently Updates in place to the middle of a file are extremely rare … and expensive Bandwidth trumps latency CS-4513 D-term 2009 (Special Lecture) MapReduce

30 Google File System (continued)
One Master server per cluster Many Chunk servers in each cluster Clients CS-4513 D-term 2009 (Special Lecture) MapReduce

31 Google File System (continued)
Files partitioned into 64 MByte chunks Each chunk is replicated across chunk servers Chunk server stores its chunks in traditional Linux files on node of the cluster At least three replicas per chunk (different servers) No caching of file data (not useful in streaming!) Dynamic replication if chunk server fails CS-4513 D-term 2009 (Special Lecture) MapReduce

32 Google File System (continued)
Master maintains metadata & chunk info Also garbage collection All data transactions are between clients and chunk servers Transactions between client and Master for control and info only Atomic transactions, replicated log Master can be restarted on a different node as necessary Also chunk servers CS-4513 D-term 2009 (Special Lecture) MapReduce

33 Reference Ghemawat, Sanjay, Gobioff, Howard, and Leung, Shun-Tak, “The Google File System,” Proceedings of the 2003 Symposium on Operating System Principles, Bolton Landing (Lake George), NY, October (.pdf) CS-4513 D-term 2009 (Special Lecture) MapReduce

34 Additonal Reference Dean, Jeffrey, and Ghemawat, Sanjay, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol 51, #1, January 2008, pp (.pdf) CS-4513 D-term 2009 (Special Lecture) MapReduce

35 Questions?


Download ppt "CS-4513 Distributed Computing Systems Hugh C. Lauer"

Similar presentations


Ads by Google