Download presentation
Presentation is loading. Please wait.
Published byJonas McLaughlin Modified over 9 years ago
1
Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan © 2010, Le Zhao 1
2
Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 2
3
Outline Why MapReduce (Hadoop) –Why go large scale –Compared to other parallel computing models –Hadoop related tools MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 3
4
Why NOT to do parallel computing Concerns: a parallel system needs to provide: –Data distribution –Computation distribution –Fault tolerance –Job scheduling © 2010, Le Zhao 4
5
Why MapReduce (Hadoop) Previous parallel computation models –1) scp + ssh »Manual everything –2) network cross-mounted disks + condor/torque »No data distr, disk access is bottleneck »Can only partition totally distributed computation »No fault tolerance »Prioritized job scheduling © 2010, Le Zhao 5
6
Hadoop Parallel batch computation –Data distribution »Hadoop Distributed File System (HDFS) »Like Linux FS, but with automatic data repetition –Computation distribution »Automatic, user only need to specify #input_splits »Can distribute aggregation computations as well –Fault tolerance »Automatic recovery from failure »Speculative execution (a backup task) –Job scheduling »Ok, but still relies on the politeness of users © 2010, Le Zhao 6
7
How you can use Hadoop Hadoop Streaming –Quick hacking – much like shell scripting »Uses STDIN & STDOUT carry data »cat file | mapper | sort | reducer > output –Easier to use legacy code, all programming languages Hadoop Java API –Build large systems »More data types »More control over Hadoop’s behavior »Easier debugging with Java’s error stacktrace display –NetBeans plugin for Hadoop provides easy programming »http://hadoopstudio.org/docs.html © 2010, Le Zhao 7
8
Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 8
9
© 2009, Jamie Callan 9 Map and Reduce MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list –Each object is independent »Order is unimportant »Maps can be done in parallel –The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course
10
© 2010, Jamie Callan 10 MapReduce Input reader –Divide input into splits, assign each split to a Map processor Map –Apply the Map function to each record in the split –Each Map function returns a list of (key, value) pairs Shuffle/Partition and Sort –Shuffle distributes sorting & aggregation to many reducers –All records for key k are directed to the same reduce processor –Sort groups the same keys together, and prepares for aggregation Reduce –Apply the Reduce function to each key –The result of the Reduce function is a list of (key, value) pairs
11
MapReduce in One Picture © 2010, Le Zhao 11 Tom White, Hadoop: The Definitive Guide
12
Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking –Two simple use cases –Two more advanced & useful MapReduce tricks –Two MapReduce applications Manipulating large data © 2010, Le Zhao 12
13
MapReduce Use Case (1) – Map Only Data distributive tasks – Map Only E.g. classify individual documents Map does everything –Input: (docno, doc_content), … –Output: (docno, [class, class, …]), … No reduce © 2010, Le Zhao 13
14
MapReduce Use Case (2) – Filtering and Accumulation Filtering & Accumulation – Map and Reduce E.g. Counting total enrollments of two given classes Map selects records and outputs initial counts –In: (Jamie, 11741), (Tom, 11493), … –Out: (11741, 1), (11493, 1), … Shuffle/Partition by class_id Sort –In: (11741, 1), (11493, 1), (11741, 1), … –Out: (11493, 1), …, (11741, 1), (11741, 1), … Reduce accumulates counts –In: (11493, [1, 1, …]), (11741, [1, 1, …]) –Sum and Output: (11493, 16), (11741, 35) © 2010, Le Zhao 14
15
MapReduce Use Case (3) – Database Join Problem: Massive lookups –Given two large lists: (URL, ID) and (URL, doc_content) pairs –Produce (ID, doc_content) Solution: Database join Input stream: both (URL, ID) and (URL, doc_content) lists –(http://del.icio.us/post, 0), (http://digg.com/submit, 1), … –(http://del.icio.us/post, ), (http://digg.com/submit, ), … Map simply passes input along, Shuffle and Sort on URL (group ID & doc_content for the same URL together) –Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, ), (http://digg.com/submit, ), (http://digg.com/submit, 1), … Reduce outputs result stream of (ID, doc_content) pairs –In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), … –Out: (0, ), (1, ), … © 2010, Le Zhao 15
16
MapReduce Use Case (4) – Secondary Sort Problem: Sorting on values E.g. Reverse graph edge directions & output in node order –Input: adjacency list of graph (3 nodes and 4 edges) (3, [1, 2]) (1, [3]) (1, [2, 3]) (2, [1, 3]) (3, [1]) Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort Map –In: (3, [1, 2]), (1, [2, 3]). –Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) –Out: (, [3]), (, [3]), (, [1]), (, [1]). –Copy node_ids from value to key. 12 3 12 3 © 2010, Le Zhao 16
17
MapReduce Use Case (4) – Secondary Sort Secondary Sort (ctd.) Shuffle on Key.field1, and Sort on whole Key (both fields) –In: (, [3]), (, [3]), (, [1]), (, [1]) –Out: (, [3]), (, [1]), (, [3]), (, [1]) Grouping comparator –Merge according to part of the key –Out: (, [3]), (, [1, 3]), (, [1]) this will be the reducer’s input Reduce –Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) © 2010, Le Zhao 17
18
Using MapReduce to Construct Indexes: Preliminaries Construction of binary inverted lists Input: documents: (docid, [term, term..]), (docid, [term,..]),.. Output: (term, [docid, docid, …]) –E.g., (apple, [1, 23, 49, 127, …]) Binary inverted lists fit on a slide more easily Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer Not an external document id such as a url MapReduce elements Combiner, Secondary Sort, complex keys, Sorting on keys’ fields © 2010, Jamie Callan 18
19
Using MapReduce to Construct Indexes: A Simple Approach A simple approach to creating binary inverted lists Each Map task is a document parser –Input: A stream of documents –Output: A stream of (term, docid) tuples »(long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … Shuffle sorts tuples by key and routes tuples to Reducers Reducers convert streams of keys into streams of inverted lists –Input:(long, 1) (long, 127) (long, 49) (long, 23) … –The reducer sorts the values for a key and builds an inverted list »Longest inverted list must fit in memory –Output: (long, [df:492, docids:1, 23, 49, 127, …]) © 2010, Jamie Callan 19
20
Using MapReduce to Construct Indexes: A Simple Approach A more succinct representation of the previous algorithm Map: (docid 1, content 1 ) (t 1, docid 1 ) (t 2, docid 1 ) … Shuffle by t Sort by t (t 5, docid 1 ) (t 4, docid 3 ) … (t 4, docid 3 ) (t 4, docid 1 ) (t 5, docid 1 ) … Reduce: (t 4, [docid 3 docid 1 …]) (t, ilist) docid:a unique integer t:a term, e.g., “apple” ilist:a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory © 2010, Jamie Callan 20
21
Using MapReduce to Construct Indexes: Using Combine Map:(docid 1, content 1 ) (t 1, ilist 1,1 ) (t 2, ilist 2,1 ) (t 3, ilist 3,1 ) … –Each output inverted list covers just one document Combine Sort by t Combine: (t 1 [ilist 1,2 ilist 1,3 ilist 1,1 …]) (t 1, ilist 1,27 ) –Each output inverted list covers a sequence of documents Shuffle by t Sort by t (t 4, ilist 4,1 ) (t 5, ilist 5,3 ) … (t 4, ilist 4,2 ) (t 4, ilist 4,4 ) (t 4, ilist 4,1 ) … Reduce: (t 7, [ilist 7,2, ilist 3,1, ilist 7,4, …]) (t 7, ilist final ) ilist i,j :the j’th inverted list fragment for term i © 2010, Jamie Callan 21
22
© 2010, Jamie Callan 22 Using MapReduce to Construct Indexes Parser / Indexer Parser / Indexer Parser / Indexer : : : : : : Merger : : A-F Documents Inverted Lists Map/Combine Processors Inverted List Fragments Processors Shuffle/SortReduce G-P Q-Z
23
Using MapReduce to Construct Partitioned Indexes Map: (docid 1, content 1 ) ([p, t 1 ], ilist 1,1 ) Combine to sort and group values ([p, t 1 ] [ilist 1,2 ilist 1,3 ilist 1,1 …]) ([p, t 1 ], ilist 1,27 ) Shuffle by p Sort values by [p, t] Reduce: ([p, t 7 ], [ilist 7,2, ilist 7,1, ilist 7,4, …]) ([p, t 7 ], ilist final ) p: partition (shard) id © 2010, Jamie Callan 23
24
Using MapReduce to Construct Indexes: Secondary Sort So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? Map: (docid 1, content 1 ) ([t 1, fd 1,1 ], ilist 1,1 ) Combine to sort and group values Shuffle by t Sort by [t, fd], then Group by t (Secondary Sort) ([t 7, fd 7,2 ], ilist 7,2 ), ([t 7, fd 7,1 ], ilist 7,1 ) … (t 7, [ilist 7,1, ilist 7,2, …]) Reduce: (t 7, [ilist 7,1, ilist 7,2, …]) (t 7, ilist final ) Values arrive in order, so Reduce can stream its output fd i,j is the first docid in ilist i,j © 2010, Jamie Callan 24
25
Using MapReduce to Construct Indexes: Putting it All Together Map: (docid 1, content 1 ) ([p, t 1, fd 1,1 ], ilist 1,1 ) Combine to sort and group values ([p, t 1, fd 1,1 ] [ilist 1,2 ilist 1,3 ilist 1,1 …]) ([p, t 1, fd 1,27 ], ilist 1,27 ) Shuffle by p Secondary Sort by [(p, t), fd] ([p, t 7 ], [ilist 7,2, ilist 7,1, ilist 7,4, …]) ([p, t 7 ], [ilist 7,1, ilist 7,2, ilist 7,4, …]) Reduce: ([p, t 7 ], [ilist 7,1, ilist 7,2, ilist 7,4, …]) ([p, t 7 ], ilist final ) © 2010, Jamie Callan 25
26
© 2010, Jamie Callan 26 Using MapReduce to Construct Indexes Parser / Indexer Parser / Indexer Parser / Indexer : : : : : : Merger : : Shard Documents Inverted Lists Map/Combine Processors Inverted List Fragments Processors Shuffle/SortReduce Shard
27
PageRank Calculation: Preliminaries One PageRank iteration: Input: –(id 1, [score 1 (t), out 11, out 12,..]), (id 2, [score 2 (t), out 21, out 22,..]).. Output: –(id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [score 2 (t+1), out 21, out 22,..]).. MapReduce elements Score distribution and accumulation Database join Side-effect files © 2010, Jamie Callan 27
28
PageRank: Score Distribution and Accumulation Map –In: (id 1, [score 1 (t), out 11, out 12,..]), (id 2, [score 2 (t), out 21, out 22,..]).. –Out: (out 11, score 1 (t) /n 1 ), (out 12, score 1 (t) /n 1 ).., (out 21, score 2 (t) /n 2 ),.. Shuffle & Sort by node_id –In: (id 2, score 1 ), (id 1, score 2 ), (id 1, score 1 ),.. –Out: (id 1, score 1 ), (id 1, score 2 ),.., (id 2, score 1 ),.. Reduce –In: (id 1, [score 1, score 2,..]), (id 2, [score 1,..]),.. –Out: (id 1, score 1 (t+1) ), (id 2, score 2 (t+1) ),.. © 2010, Jamie Callan 28
29
PageRank: Database Join to associate outlinks with score Map –In & Out: (id 1, score 1 (t+1) ), (id 2, score 2 (t+1) ),.., (id 1, [out 11, out 12,..]), (id 2, [out 21, out 22,..]).. Shuffle & Sort by node_id –Out: (id 1, score 1 (t+1) ), (id 1, [out 11, out 12,..]), (id 2, [out 21, out 22,..]), (id 2, score 2 (t+1) ),.. Reduce –In: (id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [out 21, out 22,.., score 2 (t+1) ]),.. –Out: (id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [score 2 (t+1), out 21, out 22,..]).. © 2010, Jamie Callan 29
30
PageRank: Side Effect Files for dangling nodes Dangling Nodes –Nodes with no outlinks (observed but not crawled URLs) –Score has no outlet »need to distribute to all graph nodes evenly Map for dangling nodes: –In:.., (id 3, [score 3 ]),.. –Out:.., ("*", 0.85×score 3 ),.. Reduce –In:.., ("*", [score 1, score 2,..]),.. –Out:.., everything else,.. –Output to side-effect: ("*", score), fed to Mapper of next iteration © 2010, Jamie Callan 30
31
Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 31
32
Manipulating Large Data Do everything in Hadoop (and HDFS) –Make sure every step is parallelized! –Any serial step breaks your design E.g. storing the URL list for a Web graph –Each node in Web graph has an id –[URL 1, URL 2, …], use line number as id – bottle neck –[(id 1, URL 1 ), (id 2, URL 2 ), …], explicit id © 2010, Le Zhao 32
33
Hadoop based Tools For Developing in Java, NetBeans plugin –http://www.hadoopstudio.org/docs.html Pig Latin, a SQL-like high level data processing script language Hive, Data warehouse, SQL Cascading, Data processing Mahout, Machine Learning algorithms on Hadoop HBase, Distributed data store as a large table More –http://hadoop.apache.org/ –http://en.wikipedia.org/wiki/Hadoop –Many other toolkits, Nutch, Cloud9, Ivory © 2010, Le Zhao 33
34
Get Your Hands Dirty Hadoop Virtual Machine –http://www.cloudera.com/developers/downloads/virtual- machine/ »This runs Hadoop 0.20 –An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.ht ml Amazon EC2 Various other Hadoop clusters around The NetBeans plugin simulates Hadoop –The workflow view works on Windows –Local running & debugging works on MacOS and Linux –http://www.hadoopstudio.org/docs.html © 2010, Le Zhao 34
35
Conclusions Why large scale MapReduce advantages Hadoop uses Use cases –Map only: for totally distributive computation –Map+Reduce: for filtering & aggregation –Database join: for massive dictionary lookups –Secondary sort: for sorting on values –Inverted indexing: combiner, complex keys –PageRank: side effect files Large data © 2010, Jamie Callan 35
36
© 2010, Jamie Callan 36 For More Information L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003. J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004. S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003. I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999. J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006. http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010. Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.