CS-4513 Distributed Computing Systems Hugh C. Lauer Map-Reduce Assumptions: Graduate level Operating Systems Making Choices about operation systems Why a micro-century? …just about enough time for one concept CS-4513 Distributed Computing Systems Hugh C. Lauer
MapReduce Programming model and implementation … … for processing very large data sets Many terabytes, on clusters of distributed computers Supports a broad variety of real-world tasks Foundation of Google’s applications CS-4513 D-term 2009 (Special Lecture) MapReduce
Why MapReduce An important new model for distributed and parallel computing Fundamentally different from traditional models of parallelism Data Parallelism Task Parallelism Pipelined Parallelism An abstraction to automate the mechanics of data handling and to let the programmer concentrate on semantics of a problem CS-4513 D-term 2009 (Special Lecture) MapReduce
Last Year in CS-4513 Divided class into four teams Each team to research and teach one aspect The abstraction itself and its algorithms Distributed MapReduce Class of problems that MapReduce can help solve Google File System to support MapReduce Today’s material is drawn from those presentations CS-4513 D-term 2009 (Special Lecture) MapReduce
Google Cluster 1000’s of PC-class systems High-speed interconnect Dual proc. x86, 4-8 GB RAM Commodity disks High-speed interconnect 100-1000 Mb/sec Distributed, replicated file system, optimized for GByte-size files Reading and appending Non-negligible failure rates CS-4513 D-term 2009 (Special Lecture) MapReduce
Typical Applications Search TBytes for words or phrases Create Page Rank among pages Conceptually simple Devilishly difficult to implement in distrib. environment CS-4513 D-term 2009 (Special Lecture) MapReduce
Basic Abstraction Partition application into two functions Map Reduce Both written by programmer Let system partition execution among distributed platforms Scheduling, communication, synchronization, fault tolerance, reliability, etc. As of January 2008 10,000 separate MapReduce programs developed within Google 100,000 MapReduce jobs per day 20 Petabytes of data processed per day CS-4513 D-term 2009 (Special Lecture) MapReduce
Map and Reduce Map – written by programmer System Takes input key-value pairs Generates set of intermediate key-value pairs System Organizes intermediate pairs by key Reduce – written by programmer Processes or merges all values for a given key Iterates through all keys CS-4513 D-term 2009 (Special Lecture) MapReduce
Example – Count Occurrences of Words in Collection of Documents Pseudo-code:– map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Note: key is not used in this simple application CS-4513 D-term 2009 (Special Lecture) MapReduce
Example – Count Occurrences of Words (continued) Pseudo-code:– reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); CS-4513 D-term 2009 (Special Lecture) MapReduce
Example – Count Occurrences of Words (continued) MapReduce specification Names of input and output files Tuning parameters Expressed as C++ main() function Linked with MapReduce library CS-4513 D-term 2009 (Special Lecture) MapReduce
Full C++ Text of Word Frequency Application Approximately 70 lines of C++ code Dean, J. and Ghemawat, S. “MapReduce: Simplified data processing on large clusters,” In Proceedings of Operating Systems Design and Implementation (OSDI). San Francisco, CA, 2004. pp. 137-150. (.pdf). Note: This paper is an earlier version of the CACM paper distributed by e-mail to class. It contains some details not included in the CACM paper. CS-4513 D-term 2009 (Special Lecture) MapReduce
Other Examples Distributed grep Count of URL access frequency Key is pattern to search for; values are the lines to search Count of URL access frequency Similar to word count; from web access logs Reverse web-link graph Obtain list of sources for URL target Large-scale indexing Google production search service CS-4513 D-term 2009 (Special Lecture) MapReduce
What It Does Map: (k1, v1) list(k2, v2) Reduce: (k2, list(v2)) list(v2) MapReduce library: Converts input arguments to many (k1, v1) pairs; calls Map for each pair Reorganizes intermediate lists from Map Calls Reduce for each intermediate key k2 CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 0: split input files into pieces–16-64 Mbyte CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 1:– Fork User Program Many distributed processes Scattered across cluster One designated as Master Brute-Force Implementation CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 2:– Master assigns worker tasks Manages results Monitors behavior & faults CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 3:– Map workers Read input splits via GFS Parse key-value pairs Passes pair to Map function Buffers output in local mem. CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 4:–Intermediate files To local disk (via GFS) Notify master CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 5:–Reduce worker Reads int. data (streaming) Sorts by intermediate key CS-4513 D-term 2009 (Special Lecture) MapReduce
Brute-Force Implementation Step 6:–Call Reduce function For each key, list of values Writes output file Notifies master CS-4513 D-term 2009 (Special Lecture) MapReduce
Result One output file for each Reduce worker Combined by application program or Passed to another Map-Reduce call Or another distributed application CS-4513 D-term 2009 (Special Lecture) MapReduce
Questions? This presentation is stored at //www.cs.wpi.edu/~lauer/MapReduce--D-Term-09.ppt
Distributed System Issues Fault-tolerance Distributed file access Scalable performance CS-4513 D-term 2009 (Special Lecture) MapReduce
Managing Faults and Failures In a cluster of 1800 nodes, there will always be a handful of failures Question: with 1800 hard drives, 100,000 hours MTBF, what is MBTF of a drive in cluster? Some processors may be “slow” – called stragglers Intermittent memory or bus errors Recoverable disk or network errors Over-scheduling by system CS-4513 D-term 2009 (Special Lecture) MapReduce
Managing Faults and Failures (continued) Master task periodically pings worker tasks If no response, starts a new worker task with same responsibility New worker reads data from different replica Also, starts backup tasks for stragglers Just in case! Whichever task finishes first “wins” Other task(s) shut down Performance penalty for backup tasks A few percent loss in system resources Enormous improvement in response time CS-4513 D-term 2009 (Special Lecture) MapReduce
Questions?
Google File System Assumptions System failures are the norm System stores mostly large (multi-gigabyte) files Expected read operations Large streaming accesses (> 1MByte per access) Few random accesses (a few KB out of someplace random) Expected write operations Long appending writes Multiple clients appending concurrently Updates in place to the middle of a file are extremely rare … and expensive Bandwidth trumps latency CS-4513 D-term 2009 (Special Lecture) MapReduce
Google File System (continued) One Master server per cluster Many Chunk servers in each cluster Clients CS-4513 D-term 2009 (Special Lecture) MapReduce
Google File System (continued) Files partitioned into 64 MByte chunks Each chunk is replicated across chunk servers Chunk server stores its chunks in traditional Linux files on node of the cluster At least three replicas per chunk (different servers) No caching of file data (not useful in streaming!) Dynamic replication if chunk server fails CS-4513 D-term 2009 (Special Lecture) MapReduce
Google File System (continued) Master maintains metadata & chunk info Also garbage collection All data transactions are between clients and chunk servers Transactions between client and Master for control and info only Atomic transactions, replicated log Master can be restarted on a different node as necessary Also chunk servers CS-4513 D-term 2009 (Special Lecture) MapReduce
Reference Ghemawat, Sanjay, Gobioff, Howard, and Leung, Shun-Tak, “The Google File System,” Proceedings of the 2003 Symposium on Operating System Principles, Bolton Landing (Lake George), NY, October 2003. (.pdf) CS-4513 D-term 2009 (Special Lecture) MapReduce
Additonal Reference Dean, Jeffrey, and Ghemawat, Sanjay, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol 51, #1, January 2008, pp. 107-113. (.pdf) CS-4513 D-term 2009 (Special Lecture) MapReduce
Questions?