04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,

04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.facebook.com/group.php?gid=29670204725 http://cyrus.cs.ucdavis.edu/~wu/ecs251

Programming Model l Input-key\value pair Output- key\value pair l MapReduce Library contains 2 functions:  Map  Reduce l Input key\value pair Intermediate key\value pair l MapReduce library groups all intermediate values with the same intermediate key I l Intermediate key I Smaller set of values and values for I MAP REDUCE 2

MapReduce : Example l Counting number of occurrences of each word in a large collection of documents. l doc name & doc contents word & its occurrences l word & list of counts sum of all counts for word l Input and output types: map(k1,v1) list(k2,v2) reduce(k2,list(v2)) list(v2) MAP REDUCE 3

MapReduce : Execution MapReduce : Execution 4

04/27/2011DHT5

04/27/2011DHT6 Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log

04/27/2011DHT7 Map and Reduce l The idea of Map, and Reduce is 40+ year old –Present in all Functional Programming Languages. –See, e.g., APL, Lisp and ML l Alternate names for Map: Apply-All l Higher Order Functions –take function definitions as arguments, or –return a function as output l Map and Reduce are higher-order functions.

04/27/2011DHT8

04/27/2011DHT9 GFS: Google File System l “failures” are norm l Multiple-GB files are common l Append rather than overwrite –Random writes are rare l Can we relax the consistency?

04/27/2011DHT10

04/27/2011DHT11 # an input reader # a Map function # a partition function # a compare function # a Reduce function # an output write

04/27/2011DHT12 Map: A Higher Order Function l F(x: int) returns r: int l Let V be an array of integers. l W = map(F, V) –W[i] = F(V[i]) for all I –i.e., apply F to every element of V

04/27/2011DHT13 Map Examples in Haskell l map (+1) [1,2,3,4,5] == [2, 3, 4, 5, 6] l map (toLower) "abcDEFG12!@#“ == "abcdefg12!@#“ l map (`mod` 3) [1..10] == [1, 2, 0, 1, 2, 0, 1, 2, 0, 1]

04/27/2011DHT14 Word Count Example l Read text files and count how often words occur. –The input is text files –The output is a text file l each line: word, tab, count l Map: Produce pairs of (word, count) l Reduce: For each word, sum up the counts.

04/27/2011DHT15 I am a tiger, you are also a tiger a,2 also,1 am,1 are,1 I,1 tiger,2 you,1 I,1 am,1 a,1 tiger,1 you,1 are,1 also,1 a, 1 tiger,1 a,2 also,1 am,1 are,1 I, 1 tiger,2 you,1 reduce map a, 1 also,1 am,1 are,1 I,1 tiger,1 you,1

04/27/2011DHT16 Grep Example l Search input files for a given pattern l Map: emits a line if pattern is matched l Reduce: Copies results to output

04/27/2011DHT17 Inverted Index Example l Generate an inverted index of words from a given set of files l Map: parses a document and emits pairs l Reduce: takes all pairs for a given word, sorts the docId values, and emits a pair

04/27/2011DHT18 Execution on Clusters 1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return

04/27/2011DHT19 Pair 19 Reduce Input Output Row Data keyvalues Map Reduce key1val key2val key1val …… Map Input Output Select Key key1 val ….val

04/27/2011DHT20 split 0 split 1 split 2 split 3 split 4 part0 map reduce part1 input HDFS sort/copy merge output HDFS

04/27/2011DHT21

04/27/2011DHT22

04/27/2011DHT23 Class MR { Class Mapper … { } Class Reducer … { } main(){ JobConf conf = new JobConf( “ MR.class ” ); conf.setMapperClass(Mapper.class); conf.setReduceClass(Reducer.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }} Map function Reduce function Other parts of program Map Reduce Config

04/27/2011DHT24 class MyMap extends MapReduceBase implements Mapper { // global variables public void map ( key, value, OutputCollector output, Reporter reporter) throws IOException { // local variables and program output.collect( NewKey, NewValue); } 123456789123456789 INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY

04/27/2011DHT25 class MyRed extends MapReduceBase implements Reducer { // global variables public void reduce ( key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { // local variables and program output.collect( NewKey, NewValue); } 123456789123456789 INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY INPUT KEY INPUT VALUE OUTPUT VALUE OUTPUT KEY

04/27/2011DHT26

04/27/2011DHT27

04/27/2011DHT28

04/27/2011DHT29 Complete web search engine –Nutch = Crawler + Indexer/Searcher (Lucene) + GUI »+ Plugins »+MapReduce & Distributed FS (Hadoop) Java based, open source, many customizable scripts available at (http://lucene.apache.org/nutch/)http://lucene.apache.org/nutch/ Features: –Customizable –Extensible (e.g. extend to Solr for enhanced portability)

04/27/2011DHT30

04/27/2011DHT31 Data Structures used by Nutch Web Database or WebDB –Mirrors the properties/structure of web graph being crawled Segment –Intermediate index –Contains pages fetched in a single run Index –Final inverted index obtained by “merging” segments (Lucene)

04/27/2011DHT32 WebDB Customized graph database Used by Crawler only Persistent storage for “pages” & “links” –Page DB: Indexed by URL and hash; contains content, outlinks, fetch information & score –Link DB: contains “source to target” links, anchor text

04/27/2011DHT33 Crawling Cyclic process –crawler generates a set of fetchlists from the WebDB –fetchers downloads the content from the Web –the crawler updates the WebDB with new links that were found –and then the crawler generates a new set of fetchlists –And Repeat till you reach the “depth”

04/27/2011DHT34 Indexing Iterate through all k page sets in parallel, constructing inverted index Creates a “searchable document” of: –URL text –Content text –Incoming anchor text Other content types might have a different document fields –Eg, email has sender/receiver –Any searchable field end-user will want Uses Lucene text indexer

04/27/2011DHT35 Lucene Open source search project –http://lucene.apache.orghttp://lucene.apache.org Index & search local files –Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ http://www.apache.org/dyn/closer.cgi/lucene/java/ –Extract files –Build an index for a directory java org.apache.lucene.demo.IndexFiles dir_path –Try search at command line: java org.apache.lucene.demo.SearchFiles

04/27/2011DHT36 Lucene’s Open Architecture Spring 200836 File System WWW IMAP Server FS Crawler Larm PDF HTML DOC TXT … TXT parser PDF parser HTML parser Lucene Docu- ments Stop Analyzer CN/DE/ Analyzer Standard Analyzer indexer Index searcher CrawlingParsingIndexing Searching Lucene

04/27/2011DHT37 Index Document Field NameValue

04/27/2011DHT38 Create an Analyser WhitespaceAnalyzer –divides text at whitespace SimpleAnalyzer –divides text at non-letters –convert to lower case StopAnalyzer –SimpleAnalyzer –removes stop words StandardAnalyzer –good for most European Languages –removes stop words –convert to lower case

04/27/2011DHT39

04/27/2011DHT40  Inverted Index (Inverted File) Doc 1: Penn State Football … football Doc 2: Football players … State Posting id worddocoffset 1footballDoc 13 67 Doc 21 2pennDoc 11 3playersDoc 22 4stateDoc 12 Doc 213 Posting Table

04/27/2011DHT41 Query Term Dictionary (Random file access) Term Info Index (in Memory) Constant time Frequency File (Random file access) Constant time Position File (Random file access) Constant time Field info (in Memory) Constant time

04/27/2011DHT42 Map/Reduce Cluster Implementation split 0 split 1 split 2 split 3 split 4 Output 0 Output 1 Input files Output files M map tasks R reduce tasks Intermediate files Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition

04/27/2011DHT43 Execution

04/27/2011DHT44 Hadoop Usage at Facebook l Data warehouse running Hive l 600 machines, 4800 cores, 2.4 PB disk l 3200 jobs per day l 50+ engineers have used Hadoop

04/27/2011DHT45 Facebook Data Pipeline Web ServersScribe Servers Network Storage Hadoop Cluster Oracle RAC MySQL Analysts Hive Queries Summaries

04/27/2011DHT46 Facebook Job Types l Production jobs: load data, compute statistics, detect spam, etc l Long experiments: machine learning, etc l Small ad-hoc queries: Hive jobs, sampling GOAL: Provide fast response times for small jobs and guaranteed service levels for production jobs GOAL: Provide fast response times for small jobs and guaranteed service levels for production jobs

04/27/2011DHT47

04/27/2011DHT48 Cloud Computing Scheduling l FIFO, Fair-Sharing l Job scheduling with “constraints” –Dependency –Priority-oriented –Soft Deadline

04/27/2011DHT49 Hive l Developed at Facebook l Used for majority of Facebook jobs l “Relational database” built on Hadoop –Maintains list of table schemas –SQL-like query language (HQL) –Can call Hadoop Streaming scripts from HQL –Supports table partitioning, clustering, complex data types, some optimizations

04/27/2011DHT50 Creating a Hive Table CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Partitioning breaks table into separate files for each ( dt, country ) pair Ex: /hive/page_view/dt=2008-06-08,country=US /hive/page_view/dt=2008-06-08,country=CA

04/27/2011DHT51 Simple Query SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; Hive only reads partition 2008-03-01,* instead of scanning entire table Find all page views coming from xyz.com on March 31 st :

04/27/2011DHT52 Aggregation and Joins SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id) FROM page_views pv JOIN user u ON (pv.userid = u.id) GROUP BY pv.page_url, u.gender WHERE pv.date = '2008-03-03'; Count users who visited each page by gender: Sample output:

04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,

Similar presentations

Presentation on theme: "04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,

Similar presentations

Presentation on theme: "04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,"— Presentation transcript:

Similar presentations

About project

Feedback