Thanks to Jimmy Lin slides

Slides:



Advertisements
Similar presentations
Map-Reduce Graph Processing Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Advertisements

Overview of this week Debugging tips for ML algorithms
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 21, 2013 Session 5: Graph Processing This work is licensed.
大规模数据处理 / 云计算 Lecture 6 – Graph Algorithm 彭波 北京大学信息科学技术学院 4/26/2011 This work is licensed under a Creative Commons.
CS171 Introduction to Computer Science II Graphs Strike Back.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Distributed Graph Processing Abhishek Verma CS425.
Lectures on Network Flows
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Multi-Source Shortest Paths T. Patrick Bailey CSC 5408 Graph Theory 4/28/2008.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Graph & BFS.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
MapReduce Algorithms CSE 490H. Algorithms for MapReduce Sorting Searching TF-IDF BFS PageRank More advanced algorithms.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Shortest path algorithm. Introduction 4 The graphs we have seen so far have edges that are unweighted. 4 Many graph situations involve weighted edges.
Using Search in Problem Solving
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Representing and Using Graphs
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
Data Structures & Algorithms Graphs
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Distributed Computing Seminar Lecture 5: Graph Algorithms & PageRank Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except.
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Big Data Infrastructure Week 5: Analyzing Graphs (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Graphs and Paths : Chapter 15 Saurav Karmakar
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
BCA-II Data Structure Using C Submitted By: Veenu Saini
Big Data Infrastructure
Big Data Infrastructure
Graphs Lecture 19 CS2110 – Spring 2013.
Graphs Representation, BFS, DFS
Lectures on Network Flows
Routing: Distance Vector Algorithm
MapReduce and Data Management
Graphs & Graph Algorithms 2
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
Graph Algorithms Ch. 5 Lin and Dyer.
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Algorithms (2IL15) – Lecture 7
Lecture 10 Graph Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 32 Graphs I
Graph Algorithms Ch. 5 Lin and Dyer.
More Graphs Lecture 19 CS2110 – Fall 2009.
Presentation transcript:

Thanks to Jimmy Lin slides Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides

Topics Introduction to graph algorithms and graph representations Single Source Shortest Path (SSSP) problem Refresher: Dijkstra’s algorithm Breadth-First Search with MapReduce PageRank

What’s a graph? G = (V,E), where Different types of graphs: V represents the set of vertices (nodes) E represents the set of edges (links) Both vertices and edges may contain additional information Different types of graphs: Directed vs. undirected edges Presence or absence of cycles Graphs are everywhere: Hyperlink structure of the Web Physical structure of computers on the Internet Interstate highway system Social networks

Some Graph Problems Finding shortest paths Routing Internet traffic and UPS trucks Finding minimum spanning trees Telco laying down fiber Finding Max Flow Airline scheduling Identify “special” nodes and communities Breaking up terrorist cells, spread of avian flu Bipartite matching Monster.com, Match.com And of course... PageRank

Graphs and MapReduce Graph algorithms typically involve: Performing computation at each node Processing node-specific data, edge-specific data, and link structure Traversing the graph in some manner Key questions: How do you represent graph data in MapReduce? How do you traverse a graph in MapReduce?

Representing Graphs G = (V, E) Two common representations A poor representation for computational purposes Two common representations Adjacency matrix Adjacency list

Adjacency Matrices Represent a graph as an n x n square matrix M n = |V| Mij = 1 means a link from node i to j 2 1 2 3 4 1 3 4

Adjacency Matrices: Critique Advantages: Naturally encapsulates iteration over nodes Rows and columns correspond to inlinks and outlinks Disadvantages: Lots of zeros for sparse matrices Lots of wasted space

Adjacency Lists Take adjacency matrices… and throw away all the zeros 1 2 3 4 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3

Adjacency Lists: Critique Advantages: Much more compact representation Easy to compute over outlinks Graph structure can be broken up and distributed Disadvantages: Much more difficult to compute over inlinks

Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodes “Graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree” Wikipedia First, a refresher: Dijkstra’s algorithm Single machine

Dijkstra’s Algorithm Example   1 10 2 3 9 4 6 5 7   2 Example from CLR

Dijkstra’s Algorithm Example   n1 n3 1 10 n0 2 3 9 4 6 5 7   n2 n4 2 Example from CLR

Dijkstra’s Algorithm Example 10  n1 n3 1 10 n0 2 3 9 4 6 5 7 5  n2 n4 2 Example from CLR

Dijkstra’s Algorithm Example 8 14 n1 n3 1 10 n0 2 3 9 4 6 5 7 5 7 n2 n4 2 Example from CLR

Dijkstra’s Algorithm Example 8 13 n1 n3 1 10 n0 2 3 9 4 6 5 7 5 7 n2 n4 2 Example from CLR

Dijkstra’s Algorithm Example 8 9 n1 n3 1 10 n0 2 3 9 4 6 5 7 5 7 n2 n4 2 Example from CLR

Dijkstra’s Algorithm Example 8 9 n1 n3 1 10 n0 2 3 9 4 6 5 7 5 7 n2 n4 2 Example from CLR

Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Single processor machine: Dijkstra’s Algorithm MapReduce: parallel Breadth-First Search (BFS) How to do it? First simplify the problem!!

Finding the Shortest Path First, consider equal edge weights Solution to the problem can be defined inductively Here’s the intuition: DistanceTo(startNode) = 0 For all nodes n directly reachable from startNode, DistanceTo(n) = 1 For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m  S)

Finding the Shortest Path This strategy advances the “known frontier” by one hop Subsequent iterations include more reachable nodes as frontier advances Multiple iterations are needed to explore entire graph

Visualizing Parallel BFS 3 1 2 2 2 3 3 3 4 4

Termination Does the algorithm ever terminate? When do we stop? Eventually, all nodes will be discovered, all edges will be considered (in a connected graph) When do we stop? When distances at every node no longer change at next frontier

Next Step to Solving Next – No longer assume distance to each node is 1

Weighted Edges Now add positive weights to the edges Simple change: points-to list in map task includes a weight w for each pointed-to node emit (p, D+wp) instead of (p, D+1) for each node p

Dijkstra’s Algorithm Example   n1 n3 1 10 n0 2 3 9 4 6 5 7   n2 n4 2 Example from CLR

Multiple Iterations Needed This MapReduce task advances the “known frontier” by one hop Subsequent iterations include more reachable nodes as frontier advances Multiple iterations are needed to explore entire graph Each iteration a MapReduce task Final output is input to next iteration - MapReduce task Feed output back into the same MapReduce task

Assume d = 1

From Intuition to Algorithm What info does the map task require? A map task receives (k,v) Key: node n Value: D (distance from start) points-to (adjacency list of nodes reachable from n) What does the map task do? Computes distances Emit (p, D+wp) p  points-to: Makes sure current distance is carried into the reducer Emits graph structure of node n (n, struct) which contains the current shortest distance to node n

From Intuition to Algorithm What info does the reduce task require? The reduce task gathers possible distances to a given p What does the reduce task do? selects the minimum one

Algorithm Assume adjacency list has information about edges and distances!!

class Mapper method MAP(nid n, node N) D ← N.Distance Emit(nid n, N) // Pass along graph structure for all nodeid m € N.AdjacencyList do Emit(nid m, d+w) // Emit distances to reachable nodes class Reducer method REDUCE (nid m, [d1, d2, ...]) dmin ← ∞ M ← Φ for all d € counts [d1, d2, ...] do if IsNode(d) then M ← d // Recover graph structure else if d < dmin then // Look for shorter distance dmin ← d if M.Distance > dmin // update shortest distance M.Distance ← dmin Increment counter for driver Emit(nid m, node M)

Map Algorithm Line 2. N is an adjacency list and current distance (shortest) Line 4. Emits (k,v) in k which is current node info , but only one of these for a node because assume each node assigned to one mapper Line 6. Emits different type of (k,v) which only has distance to neighbor not adjacency list Shuffles (k,v) with same k to same reducers

Reduce Algorithm Line 2. Will have different types of (k,v) as input Line 5. Determine what type of (k,v) if adjacency list Line 6. If v is not adjacency list (Node structure) then it is a distance, find shortest Only 1 IsNode as far as I can tell Line 9. Determine if new shortest Line 10. Update current shortest, increment a counter to determine if should stop

Shortest path – one more thing Only finds shortest distances, not the shortest path Is this true? Do we have to use backpointers to find shortest path to retrace NO -- Emit paths along with distances, each node has shortest path accessible at all times Most paths relatively short, uses little space

Weighted edges Finds Minimum? Discover node r Discovered shortest D to p and shortest D to r goes through p Maybe path through q to r that is shorter, but path lies outside current search frontier Not true if D = 1 since shortest path cannot lie outside search frontier, since would be longer path Have found shortest path within frontier Will discover shortest path as frontier expands With sufficient iterations, eventually discover shortest Distance

Dijkstra’s Algorithm Example   n1 n3 1 10 n0 2 3 9 4 6 5 7   n2 n4 2 Example from CLR

Termination Does this ever terminate? Yes! Eventually, no better distances will be found. When distance is the same, we stop Checking of termination must occur outside of MapReduce Driver program submits MR job to iterate algorithm, see if termination condition met Hadoop provides Counters (drivers) outside MapReduce Drivers determine after reducers if done In shortest path reducers count each change to min distance, passes count to driver

Iterations How many iterations needed to compute shortest distance to all nodes? Diameter of graph or greatest distance between any pair of nodes Small for many real-world problems – 6 degrees of separation For global social network – 6 MapReduce iterations

Fig. 5.6 needs how many iterations for n1-n6? Worst case? need (#nodes – 1)

Comparison to Dijkstra Dijkstra’s algorithm is more efficient At any step it only pursues edges from the minimum-cost path inside the frontier MapReduce explores all paths in parallel Brute force – wastes time Divide and conquer Except at search frontier, within frontier repeating same computations Throw more hardware at the problem

General Approach MapReduce is adept at manipulating graphs Store graphs as adjacency lists Graph algorithms with MapReduce: Each map task receives a node and its outlinks Map task compute some function of the link structure, emits value with target as the key Reduce task collects keys (target nodes) and aggregates Iterate multiple MapReduce cycles until some termination condition Remember to “pass” graph structure from one iteration to next

Another example – Random Walks Over the Web Model: User starts at a random Web page User randomly clicks on links, surfing from page to page (may also teleport to completely diff page How frequently will a page be encountered during this surfing? This is PageRank Probability distribution over nodes in a graph representing likelihood random walk over a graph will arrive at a particular node

PageRank: Defined … Given page n with in-bound links L(n), where C(m) is the out-degree of m P(m) is the page rank of m  is probability of random jump |G| is the total number of nodes in the graph m1 n mn … mn

Computing PageRank Properties of PageRank Sketch of algorithm: Can be computed iteratively Effects at each iteration is local Sketch of algorithm: Start with seed (Pi ) values Each page distributes (Pi ) “credit” to all pages it links to Each target page adds up “credit” from multiple in-bound links to compute (Pi+1) Iterate until values converge

Computing PageRank What does map do? What does reduce do?

PageRank MapReduce Fig. 5.7 Begins with 5 nodes splitting 1.0 -> 0.2 each Each node must split their 0.2 to outgoing nodes (map) Then add up all incoming values (reduce) Each iteration is one MapReduce job

PageRank in MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence ...

Convergence to end Page Rank Stop when few changes (some tolerance for precision errors) or reached fixed number of iterations Driver checks for convergence How many iterations needed for PageRank to converge, e.g. if 322 M edges? Fewer than expected 52 iterations

Dangling nodes and random jumps Must redistribute mass lost at dangling nodes (no out going edges – so mass lost) 3 approaches to determine missing mass Count dangling nodes and multiply by constant Emit special key, handle special key with logic Write as side data, sum across all map tasks Next, Redistribute missing mass m across all nodes Compute final page rank p’ where a is random jump probability Need 2 MapReduce jobs for one iteration – 1 to distribute mass across edges, the other to take care of lost mass

PageRank Assume honest users No Spider trap – infinite chain of pages all link to single page to inflate PageRank PageRank only one of thousands of features used in ranking web pages

Issues with Graph processing No global data structures can be used Local computation on each node, results passed to neighbors With multiple iterations, convergence on global graph Amount of intermediate data order of number of edges Worst case? O(n2) for dense graph

Issues with Graph processing Role of combiner?

PageRank in MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence ...

Dijkstra’s Algorithm Example   n1 n3 1 10 n0 2 3 9 4 6 5 7   n2 n4 2 Example from CLR

Issues with Graph processing Combiners only useful if can do partial aggregation Only if multiple nodes being processed by individual mapper and point to same nodes Otherwise combiner not useful Assume we have a mapper process more than one node How to assign nodes (partition graph) so useful?

Issues with Graph processing Desirable to partition graph so many intra-component links and few inter-component link Consider a social network -- Partitioning heuristics Order nodes by: Last name? Zip code? Language spoken? School? So people are connected

Summary Graph structure represented with adjacency list Map over nodes, pass partial results to nodes on adjacency list, partial results aggregated for each node in reducer Graph structure passed from mapper to reducer, output in same form as input Algorithms iterative, under control of non-MapReduce driver checking for termination at end of each iteration