Tutorial for MapReduce (Hadoop) & Large Scale Processing

Tutorial for MapReduce (Hadoop) & Large Scale Processing
Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan © 2010, Le Zhao

Outline Why MapReduce (Hadoop) MapReduce basics
The MapReduce way of thinking Manipulating large data © 2010, Le Zhao

Outline Why MapReduce (Hadoop) Why go large scale
Compared to other parallel computing models Hadoop related tools MapReduce basics The MapReduce way of thinking Manipulating large data Could somebody give your own answer of why go large scale? © 2010, Le Zhao

Why NOT to do parallel computing
Concerns: a parallel system needs to provide: Data distribution Computation distribution Fault tolerance Job scheduling © 2010, Le Zhao

Why MapReduce (Hadoop)
Previous parallel computation models 1) scp + ssh Manual everything 2) network cross-mounted disks + condor/torque No data distr, disk access is bottleneck Can only partition totally distributed computation No fault tolerance Prioritized job scheduling © 2010, Le Zhao

Hadoop Parallel batch computation Data distribution
Hadoop Distributed File System (HDFS) Like Linux FS, but with automatic data repetition Computation distribution Automatic, user only need to specify #input_splits Can distribute aggregation computations as well Fault tolerance Automatic recovery from failure Speculative execution (a backup task) Job scheduling Ok, but still relies on the politeness of users © 2010, Le Zhao

How you can use Hadoop Hadoop Streaming
Quick hacking – much like shell scripting Uses STDIN & STDOUT carry data cat file | mapper | sort | reducer > output Easier to use legacy code, all programming languages Hadoop Java API Build large systems More data types More control over Hadoop’s behavior Easier debugging with Java’s error stacktrace display NetBeans plugin for Hadoop provides easy programming © 2010, Le Zhao

Map and Reduce MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list Each object is independent Order is unimportant Maps can be done in parallel The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course © 2009, Jamie Callan

MapReduce Input reader
Divide input into splits, assign each split to a Map processor Map Apply the Map function to each record in the split Each Map function returns a list of (key, value) pairs Shuffle/Partition and Sort Shuffle distributes sorting & aggregation to many reducers All records for key k are directed to the same reduce processor Sort groups the same keys together, and prepares for aggregation Reduce Apply the Reduce function to each key The result of the Reduce function is a list of (key, value) pairs © 2010, Jamie Callan

MapReduce in One Picture
Tom White, Hadoop: The Definitive Guide © 2010, Le Zhao

MapReduce Use Case (1) – Map Only
Data distributive tasks – Map Only E.g. classify individual documents Map does everything Input: (docno, doc_content), … Output: (docno, [class, class, …]), … No reduce © 2010, Le Zhao

MapReduce Use Case (2) – Filtering and Accumulation
Filtering & Accumulation – Map and Reduce E.g. Counting total enrollments of two given classes Map selects records and outputs initial counts In: (Jamie, 11741), (Tom, 11493), … Out: (11741, 1), (11493, 1), … Shuffle/Partition by class_id Sort In: (11741, 1), (11493, 1), (11741, 1), … Out: (11493, 1), …, (11741, 1), (11741, 1), … Reduce accumulates counts In: (11493, [1, 1, …]), (11741, [1, 1, …]) Sum and Output: (11493, 16), (11741, 35) © 2010, Le Zhao

MapReduce Use Case (3) – Database Join
Problem: Massive lookups Given two large lists: (URL, ID) and (URL, doc_content) pairs Produce (ID, doc_content) Solution: Database join Input stream: both (URL, ID) and (URL, doc_content) lists ( 0), ( 1), … ( <html0>), ( <html1>), … Map simply passes input along, Shuffle and Sort on URL (group ID & doc_content for the same URL together) Out: ( 0), ( <html0>), ( <html1>), ( 1), … Reduce outputs result stream of (ID, doc_content) pairs In: ( [0, html0]), ( [html1, 1]), … Out: (0, <html0>), (1, <html1>), … How to distinguish Url and doc_content? © 2010, Le Zhao

MapReduce Use Case (4) – Secondary Sort
Problem: Sorting on values E.g. Reverse graph edge directions & output in node order Input: adjacency list of graph (3 nodes and 4 edges) (3, [1, 2]) (1, [3]) (1, [2, 3])  (2, [1, 3]) (3, [1]) Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort Map In: (3, [1, 2]), (1, [2, 3]). Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]). Copy node_ids from value to key. 1 2 3  What a hack! Would be better if sort can access value as well as keys. © 2010, Le Zhao

MapReduce Use Case (4) – Secondary Sort
Secondary Sort (ctd.) Shuffle on Key.field1, and Sort on whole Key (both fields) In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]) Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1]) Grouping comparator Merge according to part of the key Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1]) this will be the reducer’s input Reduce Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) © 2010, Le Zhao

Using MapReduce to Construct Indexes: Preliminaries
Construction of binary inverted lists Input: documents: (docid, [term, term..]), (docid, [term, ..]), .. Output: (term, [docid, docid, …]) E.g., (apple, [1, 23, 49, 127, …]) Binary inverted lists fit on a slide more easily Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer Not an external document id such as a url MapReduce elements Combiner, Secondary Sort, complex keys, Sorting on keys’ fields © 2010, Jamie Callan

Using MapReduce to Construct Indexes: A Simple Approach
A simple approach to creating binary inverted lists Each Map task is a document parser Input: A stream of documents Output: A stream of (term, docid) tuples (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … Shuffle sorts tuples by key and routes tuples to Reducers Reducers convert streams of keys into streams of inverted lists Input: (long, 1) (long, 127) (long, 49) (long, 23) … The reducer sorts the values for a key and builds an inverted list Longest inverted list must fit in memory Output: (long, [df:492, docids:1, 23, 49, 127, …]) © 2010, Jamie Callan

Using MapReduce to Construct Indexes: A Simple Approach
A more succinct representation of the previous algorithm Map: (docid1, content1)  (t1, docid1) (t2, docid1) … Shuffle by t Sort by t (t5, docid1) (t4, docid3) …  (t4, docid3) (t4, docid1) (t5, docid1) … Reduce: (t4, [docid3 docid1 …])  (t, ilist) docid: a unique integer t: a term, e.g., “apple” ilist: a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory © 2010, Jamie Callan

Using MapReduce to Construct Indexes: Using Combine
Map: (docid1, content1)  (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) … Each output inverted list covers just one document Combine Sort by t Combine: (t1 [ilist1,2 ilist1,3 ilist1,1 …])  (t1, ilist1,27) Each output inverted list covers a sequence of documents Shuffle by t (t4, ilist4,1) (t5, ilist5,3) …  (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) … Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …])  (t7, ilistfinal) ilisti,j: the j’th inverted list fragment for term i © 2010, Jamie Callan

Using MapReduce to Construct Indexes
Inverted List Fragments Inverted Lists Documents Processors Processors Parser / Indexer A-F : Merger Parser / Indexer G-P : Merger : : : Parser / Indexer : : : Q-Z Merger Map/Combine Shuffle/Sort Reduce 22 © 2010, Jamie Callan

Using MapReduce to Construct Partitioned Indexes
Map: (docid1, content1)  ([p, t1], ilist1,1) Combine to sort and group values ([p, t1] [ilist1,2 ilist1,3 ilist1,1 …])  ([p, t1], ilist1,27) Shuffle by p Sort values by [p, t] Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], ilistfinal) p: partition (shard) id © 2010, Jamie Callan

Using MapReduce to Construct Indexes: Secondary Sort
So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? Map: (docid1, content1)  ([t1, fd1,1], ilist1,1) Combine to sort and group values Shuffle by t Sort by [t, fd], then Group by t (Secondary Sort) ([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) …  (t7, [ilist7,1, ilist7,2, …]) Reduce: (t7, [ilist7,1, ilist7,2, …])  (t7, ilistfinal) Values arrive in order, so Reduce can stream its output fdi,j is the first docid in ilisti,j © 2010, Jamie Callan

Using MapReduce to Construct Indexes: Putting it All Together
Map: (docid1, content1)  ([p, t1, fd1,1], ilist1,1) Combine to sort and group values ([p, t1, fd1,1] [ilist1,2 ilist1,3 ilist1,1 …])  ([p, t1, fd1,27], ilist1,27) Shuffle by p Secondary Sort by [(p, t), fd] ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])  ([p, t7], ilistfinal) © 2010, Jamie Callan

Using MapReduce to Construct Indexes
Inverted List Fragments Inverted Lists Documents Processors Processors Parser / Indexer Shard : Merger Parser / Indexer Shard : Merger : : : Parser / Indexer : : : Shard Merger Map/Combine Shuffle/Sort Reduce 26 © 2010, Jamie Callan

PageRank Calculation: Preliminaries
One PageRank iteration: Input: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. Output: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. MapReduce elements Score distribution and accumulation Database join Side-effect files © 2010, Jamie Callan

PageRank: Score Distribution and Accumulation
Map In: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. Out: (out11, score1(t)/n1), (out12, score1(t)/n1) .., (out21, score2(t)/n2), .. Shuffle & Sort by node_id In: (id2, score1), (id1, score2), (id1, score1), .. Out: (id1, score1), (id1, score2), .., (id2, score1), .. Reduce In: (id1, [score1, score2, ..]), (id2, [score1, ..]), .. Out: (id1, score1(t+1)), (id2, score2(t+1)), .. © 2010, Jamie Callan

PageRank: Database Join to associate outlinks with score
Map In & Out: (id1, score1(t+1)), (id2, score2(t+1)), .., (id1, [out11, out12, ..]), (id2, [out21, out22, ..]) .. Shuffle & Sort by node_id Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2, score2(t+1)), .. Reduce In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2(t+1)]), .. Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. © 2010, Jamie Callan

PageRank: Side Effect Files for dangling nodes
Nodes with no outlinks (observed but not crawled URLs) Score has no outlet need to distribute to all graph nodes evenly Map for dangling nodes: In: .., (id3, [score3]), .. Out: .., ("*", 0.85×score3), .. Reduce In: .., ("*", [score1, score2, ..]), .. Out: .., everything else, .. Output to side-effect: ("*", score), fed to Mapper of next iteration © 2010, Jamie Callan

Manipulating Large Data
Do everything in Hadoop (and HDFS) Make sure every step is parallelized! Any serial step breaks your design E.g. storing the URL list for a Web graph Each node in Web graph has an id [URL1, URL2, …], use line number as id – bottle neck [(id1, URL1), (id2, URL2), …], explicit id © 2010, Le Zhao

Hadoop based Tools For Developing in Java, NetBeans plugin
Pig Latin, a SQL-like high level data processing script language Hive, Data warehouse, SQL Cascading, Data processing Mahout, Machine Learning algorithms on Hadoop HBase, Distributed data store as a large table More Many other toolkits, Nutch, Cloud9, Ivory © 2010, Le Zhao

Get Your Hands Dirty Hadoop Virtual Machine
This runs Hadoop 0.20 An earlier Hadoop version is here Amazon EC2 Various other Hadoop clusters around The NetBeans plugin simulates Hadoop The workflow view works on Windows Local running & debugging works on MacOS and Linux © 2010, Le Zhao

Conclusions Why large scale MapReduce advantages Hadoop uses Use cases
Map only: for totally distributive computation Map+Reduce: for filtering & aggregation Database join: for massive dictionary lookups Secondary sort: for sorting on values Inverted indexing: combiner, complex keys PageRank: side effect files Large data © 2010, Jamie Callan

For More Information L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003. J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2) “Map/Reduce Tutorial”. Fetched January 21, 2010. Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010. © 2010, Jamie Callan

Tutorial for MapReduce (Hadoop) & Large Scale Processing

Similar presentations

Presentation on theme: "Tutorial for MapReduce (Hadoop) & Large Scale Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tutorial for MapReduce (Hadoop) & Large Scale Processing

Similar presentations

Presentation on theme: "Tutorial for MapReduce (Hadoop) & Large Scale Processing"— Presentation transcript:

Similar presentations

About project

Feedback