Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Slides:



Advertisements
Similar presentations
Lecture 5 Graph Theory. Graphs Graphs are the most useful model with computer science such as logical design, formal languages, communication network,
Advertisements

Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
2/14/13CMPS 3120 Computational Geometry1 CMPS 3120: Computational Geometry Spring 2013 Planar Subdivisions and Point Location Carola Wenk Based on: Computational.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Greed is good. (Some of the time)
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Some Graph Problems. LINIAL’S CONJECTURE Backgound: In a partially ordered set we have Dilworth’s Theorem; The largest size of an independent set (completely.
Peer-to-Peer and Social Networks Centrality measures.
3.3 Spanning Trees Tucker, Applied Combinatorics, Section 3.3, by Patti Bodkin and Tamsen Hunter.
Breadth-First Search Seminar – Networking Algorithms CS and EE Dept. Lulea University of Technology 27 Jan Mohammad Reza Akhavan.
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Introduction Outline The Problem Domain Network Design Spanning Trees Steiner Trees Triangulation Technique Spanners Spanners Application Simple Greedy.
K-Coloring k-coloring: A k-coloring of a graph G is a labeling f: V(G)  S, where |S|=k. The labels are colors; the vertices of one color form a color.
Chapter 11 Graphs and Trees This handout: Terminology of Graphs Eulerian Cycles.
Graphs and Trees This handout: Terminology of Graphs Applications of Graphs.
K-Coloring k-coloring: A k-coloring of a graph G is a labeling f: V(G)  S, where |S|=k. The labels are colors; the vertices of one color form a color.
The Shortest Path Problem
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Social Media Mining Graph Essentials.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
GRAPH Learning Outcomes Students should be able to:
Midwestern State University Minimum Spanning Trees Definition of MST Generic MST algorithm Kruskal's algorithm Prim's algorithm 1.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Data Structures Week 9 Towards Weighted BFS So, far we have measured d s (v) in terms of number of edges in the path from s to v. Equivalent to assuming.
Graphs Chapter 10.
Graph Theoretic Concepts. What is a graph? A set of vertices (or nodes) linked by edges Mathematically, we often write G = (V,E)  V: set of vertices,
Chapter 2 Graph Algorithms.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Mining Social Network Graphs Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 5 Graph Algorithms.
Chinese postman problem
TCP Traffic and Congestion Control in ATM Networks
1 CS104 : Discrete Structures Chapter V Graph Theory.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
Minimum Spanning Trees Prof. Sin-Min Lee Dept. of Computer Science, San Jose State University.
Minimal Spanning Tree Problems in What is a minimal spanning tree An MST is a tree (set of edges) that connects all nodes in a graph, using.
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
Trees Dr. Yasir Ali. A graph is called a tree if, and only if, it is circuit-free and connected. A graph is called a forest if, and only if, it is circuit-free.
Discrete Structures CISC 2315 FALL 2010 Graphs & Trees.
Trees Thm 2.1. (Cayley 1889) There are nn-2 different labeled trees
Chapter 9: Graphs.
Great Theoretical Ideas in Computer Science for Some.
Prof. Swarat Chaudhuri COMP 382: Reasoning about Algorithms Fall 2015.
Chapter 05 Introduction to Graph And Search Algorithms.
Iterative Improvement for Domain-Specific Problems Lecturer: Jing Liu Homepage:
Jeffrey D. Ullman Stanford University.  Different algorithms for the same problem can be parallelized to different degrees.  The same activity can (sometimes)
Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.
Midwestern State University Minimum Spanning Trees Definition of MST Generic MST algorithm Kruskal's algorithm Prim's algorithm 1.
CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.
1 Graphs Chapters 10.1 and 10.2 University of Maryland Chapters 10.1 and 10.2 Based on slides by Y. Peng University of Maryland.
Chapter Chapter Summary Graphs and Graph Models Graph Terminology and Special Types of Graphs Representing Graphs and Graph Isomorphism Connectivity.
Graphs and Social Networks
Graph Graphs and graph theory can be used to model:
Redraw these graphs so that none of the line intersect except at the vertices B C D E F G H.
Community detection in graphs
Network Science: A Short Introduction i3 Workshop
Peer-to-Peer and Social Networks
Spanning Trees.
Betweeness Counting Triangles Transitive Closure
Counting Triangles Transitive Closure
Why Social Graphs Are Different Communities Finding Triangles
3.5 Minimum Cuts in Undirected Graphs
5.4 T-joins and Postman Problems
Chapter 10 Graphs and Trees
GRAPHS.
Presentation transcript:

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

1. Facebook “Friends” graph.  Undirected graph, over a billion nodes, hundreds of billions of edges. 2. Twitter Followers.  Directed graph, three hundred million nodes, hundreds of billions of arcs. 3. Many other examples: telephone calls, s, Wikipedia articles and editors, coauthorship, etc., etc. 2

 These are not random graphs.  Community structure: if there is an edge (A,B) and an edge (B,C), it is more likely there is an edge (A,C).  Simple problem: divide a social network into disjoint communities (sets of nodes with a relatively high density of edges).  Harder problem: find overlapping communities.  More realistic case. 3

 Used to divide a graph into reasonable communities.  Roughly: the betweenness of an edge E is the number of pairs of nodes (A,B) for which the edge lies on the shortest path between A and B.  More precisely: if there are several shortest paths between A and B, then E is credited with the fraction of those paths on which it appears.  Edges of high betweenness separate communities. 4

5 AD C E FG B Edge (B,D) has betweenness = 12, since it is on the shortest path from each of {A,B,C} to each of {D,E,F,G}. Edge (G,F) has betweenness = 1, since it is on no shortest path other than that for its endpoints.

1. Perform a breadth-first search from each node of the graph. 2. Label nodes top-down to count the number of shortest paths from the root to that node. 3. Label both nodes and edges bottom-up with the fraction of shortest paths from the root to nodes at or below. 4. The betweenness of an edge is half the sum of the labels of that edge, starting with each node as root.  Half to avoid double-counting each edge. 6

7 AD C E FG B A D C E F G B Label of root = 1 Label of other nodes = sum of labels of parents BFS starting at E

8 AD C E FG B A D C E F G B Leaves get label 1 11 Edges get their share of their children 3 Interior nodes get 1 plus the sum of the edges below Split of G’s label is according to the path counts (black labels) of its parents D and F.

9 AD C E FG B A D C E F G B Edge (E,D) has label 4.5. This edge is on all shortest paths from E to A, B, C, and D. It is also on half the shortest paths from E to G. But on none of the shortest paths from E to F.

10 AD C E FG B

11 AD C E FG B A sensible partition into communities

12 AD C E FG B Why are A and C closer than B? B is a “traitor” to the community, being connected to D outside the group.

 Why Care? 1.Density of triangles measures maturity of a community.  As communities age, their members tend to connect. 2.The algorithm is actually an example of a recent and powerful theory of optimal join computation. 13

 Let the undirected graph have N nodes and M edges.  N < M < N 2.  One approach: Consider all N-choose-3 sets of nodes, and see if there are edges connecting all 3.  An O(N 3 ) algorithm.  Another approach: consider all edges e and all nodes u and see if both ends of e have edges to u.  An O(MN) algorithm. 14

 To find a better algorithm, we need to use the concept of a heavy hitter – a node with degree at least  M.  Note: there can be no more than 2  M heavy hitters, or the sum of the degrees of all nodes exceeds 2M.  A heavy-hitter triangle is one whose three nodes are all heavy hitters. 15

 Consider all triples of heavy hitters and see if there are edges between each pair of the three.  Takes time O(M 1.5 ), since there is a limit of 2  M on the number of heavy hitters. 16

 At least one node is not a heavy hitter.  Consider each edge e.  If both ends are heavy hitters, ignore.  Otherwise, let end node u not be a heavy hitter.  For each of the at most  M nodes v connected to u, see whether v is connected to the other end of e.  Takes time O(M 1.5 ).  M edges, and at most  M work with each. 17

 Both parts take O(M 1.5 ) time and together find any triangle in the graph.  For any N and M, you can find a graph with N nodes, M edges, and  (M 1.5 ) triangles, so no algorithm can do significantly better.  Note that M 1.5 can never be greater than the running times of the two obvious algorithms with which we began: N 3 and MN. 18

 If there is an edge between nodes u and v, then u is a neighbor of v and vice-versa.  The neighborhood of node u at distance d is the set of all nodes v such that there is a path of length at most d from u to v.  Denoted n(u,d).  Notice that if there are N nodes in a graph, then n(u,N-1) = n(u,N) = n(u,N+1) = … = all nodes reachable from u. 19

20 AD C E FG B n(E,0) = {E}; n(E,1) = {D,E,F}; n(E,2) = {B,D,E,F,G}; n(E,3) = {A,B,C,D,E,F,G}.

 The sizes of neighborhoods of small distance measure the “influence” a person has in a social network.  Note it is the size of the neighborhood, not the exact members of the neighborhood that is important here. 21

 n(u,0) = {u} for every u.  n(u,d) is the union of n(v, d-1) taken over every neighbor v of u.  Not really feasible for large graphs, since the neighborhoods get large, and taking the union requires examining the neighborhood of each neighbor.  To eliminate duplicates. 22

 Remember the Flagolet-Martin algorithm for estimating the number of distinct elements in a stream?  The same idea lets you estimate the number of distinct elements in the union of several sets.  Pick several hash functions.  Let h be one of these hash functions.  For each node u and distance d compute the maximum “tail” length among all nodes in n(u,d), using hash function h. 23

 Remember: if R is the maximum tail length in a set of values, then 2 R is a good estimate of the number of distinct elements in the set.  Since n(u,d) is the union of all neighbors v of u of n(v,d-1), the maximum tail length of members of n(u,d) is the largest of 1.The tail length of h(u), and 2.The maximum tail length for all the members of n(v,d-1) for any neighbor v of u. 24

 Thus, we have a recurrence for the maximum tail length of any neighbor of any node u, using any given hash function h.  Repeat for some chosen number of hash functions.  Combine estimates to get an estimate of neighborhood sizes, as for the Flagolet-Martin algorithm. 25