Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD

Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Motivation: Why MapReduce Model Programming Model Examples
Word Count Reverse Page Link

Reading Assignment Summary due Next Tuesday in class

Motivation, What is MapReduce

Dealing with Lots of Data
Example: 20+ billion web pages x 20KB = terabytes ~400 hard drives (1TB) just to store the web Even more to do something with the data One computer can read ~50MB/sec from disk Three months to read the web Solution: spread the work over many machines

Commodity Clusters Standard architecture emerging:
Cluster of commodity Linux nodes Gigabit ethernet interconnect How to organize computations on this architecture? Mask issues such as hardware failure

Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between
any pair of nodes in a rack Switch Switch Switch Mem Disk CPU Mem Disk CPU Mem Disk CPU Mem Disk CPU … … Each rack contains nodes

Motivation: Large Scale Data Processing
Many tasks composed of processing lots of data to produce lots of other data Large-Scale Data Processing Want to use 1000s of CPUs But don’t want hassle of managing things MapReduce provides User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Stable Storage First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Distributed File System
Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunkservers to access data

MapReduce Programming Model

History of Mapreduce

Hadoop Ecosystem

What is Map/Reduce Map/Reduce Many problems can be phrased this way
Programming model from LISP (and other functional languages) Many problems can be phrased this way Easy to distribute across nodes Imagine 10,000 machines ready to help you compute anything you could cast as a MapReduce problem! This is the abstraction Google is famous for authoring It hides LOTS of difficulty of writing parallel code! The system takes care of load balancing, dead machines, etc. Nice retry/failure semantics

Basic Ideas key1 Map1 reduce1 Key n Source Map i reduce2 data key1
Reduce n Map m Key n (key, value) “indexing”

Programming Concept Map Reduce
Perform a function on individual values in a data set to create a new list of values Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Reduce Combine values in a data set to create a new value Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏

MapReduce Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value)  list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value))  list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usu just one)

Examples

Warm up: Word Count We have a large file of words, Many words in each line Count the number of times each distinct word appears in the file(s)

Word Count using MapReduce
map(key = line, value=contents): for each word w in value: emit Intermediate(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each v in intermediate values: result += v emit(key,result)

Word Count, Illustrated
map(key=line, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob run see spot throw

MapReduce WordCount Java code

Map Function How it works
Input data: tuples (e.g., lines in a text file) Apply user-defined function to process data by keys Output (key, value) tuples The definition of the output keys is normally different from the input Under the hood: The data file is split and sent to different distributed maps (that the user does not know) Results are grouped by key and stored to the local linux file system of the map

Reduce Function How it works
Group mappers’ output (key, value) tuples by key Apply a user defined function to process each group of tuples Output: typically, (key, aggregates) Under the hood Each reduce handles a number of keys Reduce pulls the results of assigned keys from maps’ results Each reduce generates one result file in the GFS (or HDFS)

Summary of the Ideas Mapper generates some kind of index for the original data Reducer apply group/aggregate based on that index Flexibility Developers are free to generate all kinds of different indices based on the original data Thus, many different types jobs can be done based on this simple framework

Example: Count URL Access Frequency
Work on the log of web page requests (session ID, URL)… Map Input: URLs Output: (URL, 1) Reduce Input (URL, 1) Output (URL, counts)

Example: Reverse Web-link Graph
Each source page has links to target pages, find out (target, list (sources)) Map Input (src URL, page content) Output (tgt URL, src URL) Reduce Output (tgt URL, list(src URL)) target urls page

Image Smoothing To smooth an image, use a sliding mask and replace the value of each pixel

Image Smoothing Map: Input key = x,y input value = R, G, B
Emit 9 points (x-1 , y-1, R,G,B) (x , y-1, R,G,B) (x+1, y-1, R,G,B) Etc. Reduce: input key = x,y input value: list of R,G,B Compute average R,G,B Emit key=x,y value = average RGB

PageRank

Example: PageRank Graph representation PageRank Definition
MapReduce PageRank

Example: PageRank Program implemented by Google to rank any type of recursive “documents” using MapReduce. Initially developed at Stanford University by Google founders, Larry Page and Sergey Brin, in 1995. Led to a functional prototype named Google in 1998. Still provides the basis for all of Google's web search tools. PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (all pages linking to page u), divided by the number L(v) of links from page v

Graph Representation G = (V, E)
Typically represented as adjacency lists: Each node is associated with its neighbors (via outgoing edges) 1 2 3 4 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3

Steps Parse documents (web pages) for links
Iteratively compute PageRank Sort the documents by PageRank

Step 1: Parse URLs Reduce: Map: Input: Input: index.html
key: “index.html” values: “2.html”, “3.html”, ... Output: Value: “1.0 2.html 3.html ...” Map: Input: index.html <html><body>Blah blah blah... <ahref=“2.html”>A link</a>....</html> Output for each out link: key: “index.html” value: “2.html”

Step 2: Page Rank Iterations
Map: Input: key: index.html value: <pagerank> 1.html 2.html... Output for each outlink: key: “1.html” value: “index.html <pagerank> <number of outlinks>” Reduce Input: Key: “1.html” Value: “index.html ” Value: “2.html 2.4 2” Value: ... Output: Value: “<new pagerank> index.html 2.html...”

Sort Documents by PageRank
Last step: Sort all the documents by pagerank Assume many gigabytes of document ids, PageRanks and outlinks. MapReduce sorts all outputs by key using a distributed mergesort (very fast and scalable) Map: Input: Key: “index.html” Value: “<pagerank> <outlinks>” Output: Key: “<pagerank>” Value: “index.html”

Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback