Download presentation
Presentation is loading. Please wait.
Published byStanley Chapman Modified over 9 years ago
1
04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.facebook.com/group.php?gid=29670204725 http://cyrus.cs.ucdavis.edu/~wu/ecs251
2
Programming Model l Input-key\value pair Output- key\value pair l MapReduce Library contains 2 functions: Map Reduce l Input key\value pair Intermediate key\value pair l MapReduce library groups all intermediate values with the same intermediate key I l Intermediate key I Smaller set of values and values for I MAP REDUCE 2
3
MapReduce : Example l Counting number of occurrences of each word in a large collection of documents. l doc name & doc contents word & its occurrences l word & list of counts sum of all counts for word l Input and output types: map(k1,v1) list(k2,v2) reduce(k2,list(v2)) list(v2) MAP REDUCE 3
4
MapReduce : Execution MapReduce : Execution 4
5
04/27/2011DHT5
6
04/27/2011DHT6 Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log
7
04/27/2011DHT7 Map and Reduce l The idea of Map, and Reduce is 40+ year old –Present in all Functional Programming Languages. –See, e.g., APL, Lisp and ML l Alternate names for Map: Apply-All l Higher Order Functions –take function definitions as arguments, or –return a function as output l Map and Reduce are higher-order functions.
8
04/27/2011DHT8
9
04/27/2011DHT9 GFS: Google File System l “failures” are norm l Multiple-GB files are common l Append rather than overwrite –Random writes are rare l Can we relax the consistency?
10
04/27/2011DHT10
11
04/27/2011DHT11 # an input reader # a Map function # a partition function # a compare function # a Reduce function # an output write
12
04/27/2011DHT12 Map: A Higher Order Function l F(x: int) returns r: int l Let V be an array of integers. l W = map(F, V) –W[i] = F(V[i]) for all I –i.e., apply F to every element of V
13
04/27/2011DHT13 Map Examples in Haskell l map (+1) [1,2,3,4,5] == [2, 3, 4, 5, 6] l map (toLower) "abcDEFG12!@#“ == "abcdefg12!@#“ l map (`mod` 3) [1..10] == [1, 2, 0, 1, 2, 0, 1, 2, 0, 1]
14
04/27/2011DHT14 Word Count Example l Read text files and count how often words occur. –The input is text files –The output is a text file l each line: word, tab, count l Map: Produce pairs of (word, count) l Reduce: For each word, sum up the counts.
15
04/27/2011DHT15 I am a tiger, you are also a tiger a,2 also,1 am,1 are,1 I,1 tiger,2 you,1 I,1 am,1 a,1 tiger,1 you,1 are,1 also,1 a, 1 tiger,1 a,2 also,1 am,1 are,1 I, 1 tiger,2 you,1 reduce map a, 1 also,1 am,1 are,1 I,1 tiger,1 you,1
16
04/27/2011DHT16 Grep Example l Search input files for a given pattern l Map: emits a line if pattern is matched l Reduce: Copies results to output
17
04/27/2011DHT17 Inverted Index Example l Generate an inverted index of words from a given set of files l Map: parses a document and emits pairs l Reduce: takes all pairs for a given word, sorts the docId values, and emits a pair
18
04/27/2011DHT18 Execution on Clusters 1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return
19
04/27/2011DHT19 Pair 19 Reduce Input Output Row Data keyvalues Map Reduce key1val key2val key1val …… Map Input Output Select Key key1 val ….val
20
04/27/2011DHT20 split 0 split 1 split 2 split 3 split 4 part0 map reduce part1 input HDFS sort/copy merge output HDFS
21
04/27/2011DHT21
22
04/27/2011DHT22
23
04/27/2011DHT23 Class MR { Class Mapper … { } Class Reducer … { } main(){ JobConf conf = new JobConf( “ MR.class ” ); conf.setMapperClass(Mapper.class); conf.setReduceClass(Reducer.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }} Map function Reduce function Other parts of program Map Reduce Config
24
04/27/2011DHT24 class MyMap extends MapReduceBase implements Mapper { // global variables public void map ( key, value, OutputCollector output, Reporter reporter) throws IOException { // local variables and program output.collect( NewKey, NewValue); } 123456789123456789 INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY
25
04/27/2011DHT25 class MyRed extends MapReduceBase implements Reducer { // global variables public void reduce ( key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { // local variables and program output.collect( NewKey, NewValue); } 123456789123456789 INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY
26
04/27/2011DHT26
27
04/27/2011DHT27
28
04/27/2011DHT28
29
04/27/2011DHT29 Complete web search engine –Nutch = Crawler + Indexer/Searcher (Lucene) + GUI »+ Plugins »+MapReduce & Distributed FS (Hadoop) Java based, open source, many customizable scripts available at (http://lucene.apache.org/nutch/)http://lucene.apache.org/nutch/ Features: –Customizable –Extensible (e.g. extend to Solr for enhanced portability)
30
04/27/2011DHT30
31
04/27/2011DHT31 Data Structures used by Nutch Web Database or WebDB –Mirrors the properties/structure of web graph being crawled Segment –Intermediate index –Contains pages fetched in a single run Index –Final inverted index obtained by “merging” segments (Lucene)
32
04/27/2011DHT32 WebDB Customized graph database Used by Crawler only Persistent storage for “pages” & “links” –Page DB: Indexed by URL and hash; contains content, outlinks, fetch information & score –Link DB: contains “source to target” links, anchor text
33
04/27/2011DHT33 Crawling Cyclic process –crawler generates a set of fetchlists from the WebDB –fetchers downloads the content from the Web –the crawler updates the WebDB with new links that were found –and then the crawler generates a new set of fetchlists –And Repeat till you reach the “depth”
34
04/27/2011DHT34 Indexing Iterate through all k page sets in parallel, constructing inverted index Creates a “searchable document” of: –URL text –Content text –Incoming anchor text Other content types might have a different document fields –Eg, email has sender/receiver –Any searchable field end-user will want Uses Lucene text indexer
35
04/27/2011DHT35 Lucene Open source search project –http://lucene.apache.orghttp://lucene.apache.org Index & search local files –Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ http://www.apache.org/dyn/closer.cgi/lucene/java/ –Extract files –Build an index for a directory java org.apache.lucene.demo.IndexFiles dir_path –Try search at command line: java org.apache.lucene.demo.SearchFiles
36
04/27/2011DHT36 Lucene’s Open Architecture Spring 200836 File System WWW IMAP Server FS Crawler Larm PDF HTML DOC TXT … TXT parser PDF parser HTML parser Lucene Docu- ments Stop Analyzer CN/DE/ Analyzer Standard Analyzer indexer Index searcher CrawlingParsingIndexing Searching Lucene
37
04/27/2011DHT37 Index Document Field NameValue
38
04/27/2011DHT38 Create an Analyser WhitespaceAnalyzer –divides text at whitespace SimpleAnalyzer –divides text at non-letters –convert to lower case StopAnalyzer –SimpleAnalyzer –removes stop words StandardAnalyzer –good for most European Languages –removes stop words –convert to lower case
39
04/27/2011DHT39
40
04/27/2011DHT40 Inverted Index (Inverted File) Doc 1: Penn State Football … football Doc 2: Football players … State Posting id worddocoffset 1footballDoc 13 67 Doc 21 2pennDoc 11 3playersDoc 22 4stateDoc 12 Doc 213 Posting Table
41
04/27/2011DHT41 Query Term Dictionary (Random file access) Term Info Index (in Memory) Constant time Frequency File (Random file access) Constant time Position File (Random file access) Constant time Field info (in Memory) Constant time
42
04/27/2011DHT42 Map/Reduce Cluster Implementation split 0 split 1 split 2 split 3 split 4 Output 0 Output 1 Input files Output files M map tasks R reduce tasks Intermediate files Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition
43
04/27/2011DHT43 Execution
44
04/27/2011DHT44 Hadoop Usage at Facebook l Data warehouse running Hive l 600 machines, 4800 cores, 2.4 PB disk l 3200 jobs per day l 50+ engineers have used Hadoop
45
04/27/2011DHT45 Facebook Data Pipeline Web ServersScribe Servers Network Storage Hadoop Cluster Oracle RAC MySQL Analysts Hive Queries Summaries
46
04/27/2011DHT46 Facebook Job Types l Production jobs: load data, compute statistics, detect spam, etc l Long experiments: machine learning, etc l Small ad-hoc queries: Hive jobs, sampling GOAL: Provide fast response times for small jobs and guaranteed service levels for production jobs GOAL: Provide fast response times for small jobs and guaranteed service levels for production jobs
47
04/27/2011DHT47
48
04/27/2011DHT48 Cloud Computing Scheduling l FIFO, Fair-Sharing l Job scheduling with “constraints” –Dependency –Priority-oriented –Soft Deadline
49
04/27/2011DHT49 Hive l Developed at Facebook l Used for majority of Facebook jobs l “Relational database” built on Hadoop –Maintains list of table schemas –SQL-like query language (HQL) –Can call Hadoop Streaming scripts from HQL –Supports table partitioning, clustering, complex data types, some optimizations
50
04/27/2011DHT50 Creating a Hive Table CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Partitioning breaks table into separate files for each ( dt, country ) pair Ex: /hive/page_view/dt=2008-06-08,country=US /hive/page_view/dt=2008-06-08,country=CA
51
04/27/2011DHT51 Simple Query SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; Hive only reads partition 2008-03-01,* instead of scanning entire table Find all page views coming from xyz.com on March 31 st :
52
04/27/2011DHT52 Aggregation and Joins SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id) FROM page_views pv JOIN user u ON (pv.userid = u.id) GROUP BY pv.page_url, u.gender WHERE pv.date = '2008-03-03'; Count users who visited each page by gender: Sample output:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.