Big Data Infrastructure

Slides:



Advertisements
Similar presentations
Map-Reduce Graph Processing Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Advertisements

IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 21, 2013 Session 5: Graph Processing This work is licensed.
大规模数据处理 / 云计算 Lecture 6 – Graph Algorithm 彭波 北京大学信息科学技术学院 4/26/2011 This work is licensed under a Creative Commons.
Thanks to Jimmy Lin slides
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
CS171 Introduction to Computer Science II Graphs Strike Back.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Graph & BFS.
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
MapReduce Algorithms CSE 490H. Algorithms for MapReduce Sorting Searching TF-IDF BFS PageRank More advanced algorithms.
Graphs & Graph Algorithms 2 Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
CS 206 Introduction to Computer Science II 03 / 30 / 2009 Instructor: Michael Eckmann.
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
Social Media Mining Graph Essentials.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Representing and Using Graphs
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Mathematics of Networks (Cont)
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Distributed Computing Seminar Lecture 5: Graph Algorithms & PageRank Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Big Data Infrastructure Week 5: Analyzing Graphs (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Source: CSE 214 – Computer Science II Graphs.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Graphs Lecture 19 CS2110 – Spring 2013.
Parallel Graph Algorithms
Graphs Representation, BFS, DFS
Lecture 11 Graph Algorithms
Copyright © Zeph Grunschlag,
Case Study 2- Parallel Breadth-First Search Using OpenMP
CSE 2331/5331 Topic 9: Basic Graph Alg.
CS 3343: Analysis of Algorithms
I206: Lecture 15: Graphs Marti Hearst Spring 2012.
Network Science: A Short Introduction i3 Workshop
MapReduce and Data Management
Graphs Representation, BFS, DFS
Graphs & Graph Algorithms 2
Modeling and Simulation NETW 707
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
Graphs Chapter 13.
Graph Algorithms Ch. 5 Lin and Dyer.
Graphs CSE 2011 Winter November 2018.
Chapter 22: Elementary Graph Algorithms I
Lectures on Graph Algorithms: searching, testing and sorting
ITEC 2620M Introduction to Data Structures
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Algorithm Course Dr. Aref Rashad
Lecture 10 Graph Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 32 Graphs I
Graph Algorithms Ch. 5 Lin and Dyer.
Chapter 9 Graph algorithms
More Graphs Lecture 19 CS2110 – Fall 2009.
Presentation transcript:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (1/2) January 31, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2017w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Structure of the Course Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining “Core” framework features and algorithm design

What’s a graph? G = (V,E), where V represents the set of vertices (nodes) E represents the set of edges (links) Edges may be directed or undirected Both vertices and edges may contain additional information outlinks outgoing (outbound) edges “incident” out-degree edges (links) vertex (node) edges (links) in-degree incoming (inbound) edges inlinks

Examples of Graphs Hyperlink structure of the web Physical structure of computers on the Internet Interstate highway system Social networks We’re mostly interested in sparse graphs!

Source: Wikipedia (Königsberg)

Source: Wikipedia (Kaliningrad)

Some Graph Problems Finding shortest paths Routing Internet traffic and UPS trucks Finding minimum spanning trees Telco laying down fiber Finding max flow Airline scheduling Identify “special” nodes and communities Halting the spread of avian flu Bipartite matching match.com Web ranking PageRank

What makes graphs hard? Irregular structure Fun with data structures! Irregular data access patterns Fun with architectures! Iterations Fun with optimizations!

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Local computations at each node Propagating results: “traversing” the graph Key questions: How do you represent graph data in MapReduce (and Spark)? How do you traverse a graph in MapReduce (and Spark)?

Representing Graphs Adjacency matrices Adjacency lists Edge lists

Adjacency Matrices Represent a graph as an n x n square matrix M n = |V| Mij = 1 iff an edge from vertex i to j 2 1 2 3 4 1 3 4

Adjacency Matrices: Critique Advantages Amenable to mathematical manipulation Intuitive iteration over rows and columns Disadvantages Lots of wasted space (for sparse matrices) Easy to write, hard to compute

Adjacency Lists Take adjacency matrix… and throw away all the zeros 1 2 3 4 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3 Wait, where have we seen this before?

Adjacency Lists: Critique Advantages Much more compact representation (compress!) Easy to compute over outlinks Disadvantages Difficult to compute over inlinks

Explicitly enumerate all edges Edge Lists Explicitly enumerate all edges (1, 2) (1, 4) (2, 1) (2, 3) (2, 4) (3, 1) (4, 1) (4, 3) 1 2 3 4

Easily support edge insertions Edge Lists: Critique Advantages Easily support edge insertions Disadvantages Wastes spaces

(A lot more detail later…) Graph Partitioning Vertex Partitioning … Edge Partitioning … (A lot more detail later…)

Storing Undirected Graphs Standard Tricks 1. Store both edges Make sure your algorithm de-dups 2. Store one edge, e.g., (x, y) st. x < y Make sure your algorithm handles the asymmetry

Basic Graph Manipulations Invert the graph flatMap and regroup Adjacency lists to edge lists flatMap adjacency lists to emit tuples Edge lists to adjacency lists groupBy Framework does all the heavy lifting!

Co-occurrence of characters in Les Misérables Source: http://bost.ocks.org/mike/miserables/

Co-occurrence of characters in Les Misérables Source: http://bost.ocks.org/mike/miserables/

Co-occurrence of characters in Les Misérables How are visualizations like this generated? Limitations? Source: http://bost.ocks.org/mike/miserables/

What does the web look like? Analysis of a large webgraph from the common crawl: 3.5 billion pages, 129 billion links Meusel et al. Graph Structure in the Web — Revisited. WWW 2014.

Broder’s Bowtie (2000) – revisited

What does the web look like? Very roughly, a scale-free network Fraction of k nodes having k connections: (i.e., distribution follows a power law)

Power Laws are everywhere! Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipf's law.” Contemporary Physics 46:323–351.

What about Facebook? Figure from: Seth A. Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. Information Network or Social Network? The Structure of the Twitter Follow Graph. WWW 2014.

What does the web look like? Very roughly, a scale-free network Other Examples: Internet domain routers Co-author network Citation network Movie-Actor network Why?

Preferential Attachment (In this installment of “learn fancy terms for simple ideas”) Preferential Attachment Also: Matthew Effect For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken even that which he hath. — Matthew 25:29, King James Version.

BTW, how do we compute these graphs?

Count. Source: http://www.flickr.com/photos/guvnah/7861418602/

BTW, how do we extract the webgraph? The webgraph… is big?! A few tricks: Integerize vertices (montone minimal perfect hashing) Sort URLs Integer compression webgraph from the common crawl: 3.5 billion pages, 129 billion links 58 GB! Meusel et al. Graph Structure in the Web — Revisited. WWW 2014.

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Local computations at each node Propagating results: “traversing” the graph Key questions: How do you represent graph data in MapReduce (and Spark)? How do you traverse a graph in MapReduce (and Spark)?

Single-Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost First, a refresher: Dijkstra’s Algorithm…

Dijkstra’s Algorithm Example   1 10 2 3 9 4 6 5 7   2 Example from CLR

Dijkstra’s Algorithm Example 10  1 10 2 3 9 4 6 5 7 5  2 Example from CLR

Dijkstra’s Algorithm Example 8 14 1 10 2 3 9 4 6 5 7 5 7 2 Example from CLR

Dijkstra’s Algorithm Example 8 13 1 10 2 3 9 4 6 5 7 5 7 2 Example from CLR

Dijkstra’s Algorithm Example 8 9 1 1 10 2 3 9 4 6 5 7 5 7 2 Example from CLR

Dijkstra’s Algorithm Example 8 9 1 10 2 3 9 4 6 5 7 5 7 2 Example from CLR

Single-Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost Single processor machine: Dijkstra’s Algorithm MapReduce: parallel breadth-first search (BFS)

Finding the Shortest Path Consider simple case of equal edge weights Solution to the problem can be defined inductively: Define: b is reachable from a if b is on adjacency list of a DistanceTo(s) = 0 For all nodes p reachable from s, DistanceTo(p) = 1 For all nodes n reachable from some other set of nodes M, DistanceTo(n) = 1 + min(DistanceTo(m), m  M) d1 m1 … d2 s … n m2 … d3 m3

Source: Wikipedia (Wave)

Visualizing Parallel BFS

From Intuition to Algorithm Data representation: Key: node n Value: d (distance from start), adjacency list Initialization: for all nodes except for start node, d =  Mapper: m  adjacency list: emit (m, d + 1) Remember to also emit distance to yourself Sort/Shuffle: Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path

Multiple Iterations Needed Each MapReduce iteration advances the “frontier” by one hop Subsequent iterations include more reachable nodes as frontier expands Multiple iterations are needed to explore entire graph Preserving graph structure: Problem: Where did the adjacency list go? Solution: mapper emits (n, adjacency list) as well Ugh! This is ugly!

BFS Pseudo-Code

Stopping Criterion (equal edge weight) How many iterations are needed in parallel BFS? Convince yourself: when a node is first “discovered”, we’ve found the shortest path What does it have to do with six degrees of separation? Practicalities of MapReduce implementation…

Implementation Practicalities HDFS map reduce Convergence? HDFS

Comparison to Dijkstra Dijkstra’s algorithm is more efficient At each step, only pursues edges from minimum-cost path inside frontier MapReduce explores all paths in parallel Lots of “waste” Useful work is only done at the “frontier” Why can’t we do better using MapReduce?

Single Source: Weighted Edges Now add positive weights to the edges Simple change: add weight w for each edge in adjacency list Simple change: add weight w for each edge in adjacency list In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m That’s it?

Stopping Criterion (positive edge weight) How many iterations are needed in parallel BFS? Convince yourself: when a node is first “discovered”, we’ve found the shortest path Not true!

Additional Complexities 10 n1 n2 n3 n4 n5 n6 n7 n8 n9 1 search frontier r s q p

Stopping Criterion (positive edge weight) How many iterations are needed in parallel BFS? Practicalities of MapReduce implementation…

Questions? Source: Wikipedia (Japanese rock garden)