Google’s Map Reduce
Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of commodity Linux nodes – Gigabit Ethernet interconnect How to organize computations on this architecture?
Cluster Architecture Mem Disk CPU Mem Disk CPU … Switch Each rack contains nodes Mem Disk CPU Mem Disk CPU … Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks
Map Reduce Map-reduce is a high-level programming system that allows database processes to be written simply. The user writes code for two functions, map and reduce. A master controller divides the input data into chunks, and assigns different processors to execute the map function on each chunk. Other processors, perhaps the same ones, are then assigned to perform the reduce function on pieces of the output from the map function.
Data Organization Data is assumed stored in files. – Typically, the files are very large compared with the files found in conventional systems. For example, one file might be all the tuples of a very large relation. Or, the file might be a terabyte of "market baskets,“ Or, the file might be the "transition matrix of the Web," which is a representation of the graph with all Web pages as nodes and hyperlinks as edges. Files are divided into chunks, which might be complete cylinders of a disk, and are typically many megabytes.
The Map Function Input is a set of key-value records. Executed by one or more processes, located at any number of processors. – Each map process is given a chunk of the entire input data on which to work. Output is a list of key-value pairs. – The types of keys and values for the output of the map function need not be the same as the types of input keys and values. – The "keys" that are output from the map function are not true keys in the database sense. That is, there can be many pairs with the same key value.
Map Example Constructing an Inverted Index Input is a collection of documents, Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Map Function Input is a set of (i,d) pairs – i is document ID – d is corresponding document. The map function scans d and for each word w it finds, it emits the pair (w, i). – Notice that in the output, the word is the key and the document ID is the associated value. Output of map is a list of word-ID pairs. – Not necessary to catch duplicate words in the document; the elimination of duplicates can be done later, at the reduce phase. – The intermediate result is the collection of all word-ID pairs created from all the documents in the input database.
Note. The output of a map-reduce algorithm is always a set of key-value pairs. Useful in some applications to compose two or more map-reduce operations.
The Reduce Function The second user-defined function, reduce, is also executed by one or more processes, located at any number of processors. Input a key value from the intermediate result, together with the list of all values that appear with this key in the intermediate result. The reduce function itself combines the list of values associated with a given key k.
Reduce Example Constructing an Inverted Index Input is a collection of documents, Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Reduce Function The intermediate result consists of pairs of the form (w, [i 1, i 2,…,i n ]), – where the i's are a list of document ID's, one for each occurrence of word w. The reduce function takes a list of ID's, eliminates duplicates, and sorts the list of unique ID's.
Parallelism This organization of the computation makes excellent use of whatever parallelism is available. The map function works on a single document, so we could have as many processes and processors as there are documents in the database. The reduce function works on a single word, so we could have as many processes and processors as there are words in the database. Of course, it is unlikely that we would use so many processors in practice.
Another Example – Word Count Construct a word count. For each word w that appears at least once in our database of documents, output pair (w, c), where c is the number of times w appears among all the documents. The map function Input is a document. Goes through the document, and each time it encounters another word w, it emits the pair (w, 1). Intermediate result is a list of pairs (w 1,1), (w 2,1),…. The reduce function Input is a pair (w, [1, 1,...,1]), with a 1 for each occurrence of word w. Sums the 1's, producing the count. Output is word-count pairs (w,c).
What about Joins? R(A, B) S(B, C) The map function Input is key-value pairs (X, t), – X is either R or S, – t is a tuple of the relation named by X. Output is a single pair (b, (R, a)) or (b, (S, c)) depending on X – b is the B-value of t. – b is the B-value of t (if X=R). – c is the C-value of t (if X=C). The reduce function Input is a pair (b, [(R,a), (S,c), …]). Extracts all the A-values associated with R and all C-values associated with S. These are paired in all possible ways, with the b in the middle to form a tuple of the result.
Reading Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters DISCO (Nokia) - Open Source Erlang implementation with Python interface MapReduce Hadoop (Apache) – Open Source implementation of MapReduce
Word Count in DISCO def fun_map(e, params): return [(w, 1) for w in e.split()] def fun_reduce(iter, out, params): stats = {} for word, count in iter: if word in stats: stats[word] += int(count) else: stats[word] = int(count) for word, total in stats.iteritems(): out.add(word, total)
Word Count in DISCO import sys from disco import Disco, result_iterator master = sys.argv[1] print "Starting Disco job.." print "Go to %s to see status of the job." % master results = Disco(master).new_job( name = "wordcount", input = [" map = fun_map, reduce = fun_reduce).wait() print "Job done. Results:" for word, frequency in result_iterator(results): print word, frequency
Word Count in DISCO mkdir bigtxt split -l bigfile.txt bigtxt/bigtxt- After running these lines, the directory bigtxt contains many files, named like bigtxt-aa, bigtxt-ab etc. which each contain 100,000 lines (except the last chunk that might contain less).
Decision Trees Key observation (RainForest): – The best split for a node can be determined if we have the AVC-sets for the node (AVC stands for Attribute-Value, Classlabel). – AVC-sets are typically small and probably fit in main memory. E.g. – AVC-set for an attribute "age" and class "car type" has a cardinality not more than 100*5. Remarks: – AVC-sets aren't a compact representation of the dataset. We can't reconstruct the dataset from the AVC-sets. – Although the AVC-sets can be small, the dataset can be very big! Challenge: – Computing the AVC-sets.
AVC-sets in Map Reduce The map function Input is (rid, tuple) pairs, Output is a list of ((a,v,c), 1) pairs, where – a is attribute name – v is attribute value – c is class label The reduce function Input is a pair ((a,v,c), [1, 1, …, 1]). Adds up the 1's.