Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.

Slides:

Advertisements

Similar presentations

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Strong and Weak Ties Chapter 3, from D. Easley and J. Kleinberg book.

Knowledge Graph: Connecting Big Data Semantics

Analysis and Modeling of Social Networks Foudalis Ilias.

Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.

Guest lecture II: Amos Fiat’s Social Networks class Edith Cohen TAU, December 2014.

1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Information Networks Small World Networks Lecture 5.

Advanced Topics in Data Mining Special focus: Social Networks.

CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.

Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)

Mining and Searching Massive Graphs (Networks)

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

CS Lecture 9 Storeing and Querying Large Web Graphs.

DISC-Finder: A distributed algorithm for identifying galaxy clusters.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

CSE 321 Discrete Structures Winter 2008 Lecture 25 Graph Theory.

(hyperlink-induced topic search)

Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’

CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan.

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Overview of Web Data Mining and Applications Part I

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

LDBC & The Social Network Benchmark Peter Boncz Database Architectures CWI Special chair “Large-Scale Data VU event.cwi.nl/lsde2015.

Models of Influence in Online Social Networks

Optimization Based Modeling of Social Network Yong-Yeol Ahn, Hawoong Jeong.

Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:

Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial

A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.

Survey on Evolving Graphs Research Speaker: Chenghui Ren Supervisors: Prof. Ben Kao, Prof. David Cheung 1.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.

+ Offline Optimal Ads Allocation in SNS Advertising Hui Miao, Peixin Gao.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Web Science Course Lecture: Social Networks - * Dr. Stefan Siersdorfer 1 * Figures from Easley and Kleinberg 2010 (

Principles of Social Network Analysis. Definition of Social Networks “A social network is a set of actors that may have relationships with one another”

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Online Social Networks and Media

Graph Algorithms: Properties of Graphs? William Cohen.

Complex Networks Measures and deterministic models Philippe Giabbanelli.

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck.

Web Intelligence Complex Networks I This is a lecture for week 6 of `Web Intelligence Example networks in this lecture come from a fabulous site of Mark.

The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?

Social Network Analysis. Outline l Background of social networks –Definition, examples and properties l Data in social networks –Data creation, flow and.

CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.

Data Structures and Algorithms in Parallel Computing Lecture 3.

Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.

Mining information from social media

Informatics tools in network science

Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.

Models of Web-Like Graphs: Integrated Approach

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

Topics In Social Computing (67810) Module 1 Introduction & The Structure of Social Networks.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Cohesive Subgraph Computation over Large Graphs

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Community detection in graphs

Apache Spark & Complex Network

Why Social Graphs Are Different Communities Finding Triangles

Graph and Link Mining.

Lecture 6: Counting triangles Dynamic graphs & sampling

Modelling and Searching Networks Lecture 2 – Complex Networks

Analyzing Massive Graphs - ParT I

Graph Algorithms Ch. 5 Lin and Dyer.

Advanced Topics in Data Mining Special focus: Social Networks

Presentation transcript:

Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo

Today’s topics  Graph datasets  Mining graphs: which properties we look at and why  Challenges with massive graphs  Some techniques/algorithms for very large graphs:  Min-Hash sketches of reachability sets  All-Distances sketches

Graph Datasets: Represent relations between “things” Bowtie structure of the Web Broder et. al Dolphin interactions

Graph Datasets  Hyperlinks (the Web)  Social graphs (Facebook, Twitter, LinkedIn,…)  logs, phone call logs, messages  Commerce transactions (Amazon purchases)  Road networks  Communication networks  Protein interactions  …

Properties  Directed/Undirected  Snapshot or with time dimension (dynamic)  One or more types of entities (people, pages, products)  Meta data associated with nodes  Some graphs are really large: billions of edges for Facebook and Twitter graphs

Mining the link structure: Node/Network-level properties  Connected/Strongly connected components  Diameter (longest shortest s-t path)  Effective diameter (90% percentile of pairwise distance)  Distance distribution (number of pairs within each distance)  Degree distribution  Clustering coefficient: Ratio of the number of closed triangles to open triangles.

Diameter Diameter is 3

Distance distribution of Distance 1: 5 Distance 2: 5 Distance 3: 1

Triangles Open triangle

Triangles closed triangle

Triangles  Social graphs have many more closed triangle than random graphs  “Communities” have more closed triangles

…Mining the link structure Centrality (who are the most important nodes?) Similarity of nodes (link prediction, targeted ads, friend/product recommendations, Meta- Data completion) Communities: set of nodes that are more tightly related to each other than to others “cover:” set of nodes with good coverage (facility location, influence maximization)

Communities Example (overlapping) Star Wars

Communities Example (overlapping) Ninjago

Communities Example (overlapping) Bad Guys

Communities Example (overlapping) Good Guys

Also a good “cover”

Centrality Central Guys

Centrality Which are the most important nodes ? … answer depends on what we want to capture:  Degree (in/out): largest number of followers, friends. Easy to compute locally. Spammable.  Eigenvalue (PageRank): Your importance/ reputation recursively depend on that of your friends  Betweenness: Your value as a “hub” -- being on a shortest path between many pairs.  Closeness: Centrally located, able to quickly reach/infect many nodes.

Centrality Central with respect to all measures

Computing on Very Large Graphs  Many applications, platforms, algorithms  Clusters (Map Reduce, Hadoop) when applicable  iGraph/Pregel better with edge traversals  (Semi-)streaming : pass(es), keep small info (per-node)  General algorithm design principles :  settle for approximations  keep total computation/ communication/ storage “linear” in the size of the data  Parallelize (minimize chains of dependencies)  Localize dependencies

Next: Node sketches (this lecture and the next one)  Min-Hash sketches of reachability sets  All-distances sketches (ADS)  Connectivity sketches (Guha/McGregor) Sketching:  Compute a sketch for each node, efficiently  From sketch(es) can estimate properties that are “harder” to compute exactly

Review (lecture 2): Min Hash Sketches

Review: Min-Hash Sketches

Sketching Reachability Sets

Reachability Set of Size 4

Reachability Set of Size 13

Why sketch reachability sets ? From reachability sketch(es) we can:  Estimate cardinality of reachability set  Get a sample of the reachable nodes  Estimate relations between reachability sets (e.g., Jaccard similarity)

Min-Hash sketches of all Reachability sets

{0.23} {0.06} {0.12} {0.23}

Min-Hash sketches of all Reachability Sets: bottom {0.06,0.12} {0.12,0.23} {0.23,0.37}

Next: Computing Min-Hash sketches of all reachability sets efficiently Algorithms/methods:  Graphs searches (say BFS)  Dynamic programming/ Distributed

{0.23} {0.06} {0.12} {0.23}

Parallelizing BFS-based Min-Hash computation How can we reduce dependencies ?

Parallelizing BFS-based Min-Hash computation

 We recursively apply this to each of the lower/higher sets:

Parallelizing BFS-based Min-Hash computation Super-nodes created in recursion

Next: Computing sketches using the BFS method for k>1

{0.06, } {0.12, } {0.23, } Min-Hash sketches of all Reachability Sets: bottom-2

{0.45} {0.95}{0.32} {0.69} {0.06} {0.28} {0.93} {0.77} {0.34} {0.12} {0.37} {0.85} {0.23}

Send to inNeighbors {0.45} {0.95}{0.32} {0.69} {0.06} {0.28} {0.93} {0.77} {0.34} {0.12} {0.37} {0.85} {0.23}

Update {0.45} {0.32} {0.06} {0.12} {0.28} {0.12} {0.37} {0.23}

If updated, send to inNeighbors {0.45} {0.32} {0.06} {0.12} {0.28} {0.12} {0.37} {0.23}

Update {0.45} {0.32} {0.06} {0.12} {0.28} {0.12} {0.37} {0.23}

If updated, send to inNeighbors. Done {0.23} {0.06} {0.12} {0.23}

Analysis of DP: Edge traversals

Analysis of DP: dependencies The longest chain of dependencies is at most the longest shortest path (the diameter of the graph)

Next: All-Distances Sketches (ADS) Often we care about distance, not only reachability:  Nodes that are closer to you, in distance or in Dijkstra (Nearest-Neighbor) rank, are more meaningful.  We want a sketch that supports distance- based queries.

Applications of ADSs Estimate node/subset/network level properties that are expensive to compute exactly:

Applications of ADSs

Bibliography Recommended further reading on social networks analysis:  Book: “Networks, Crowds, and Markets: Reasoning About a Highly Connected World” By David Easley and Jon Kleinberg. s-book/  Course/Lectures by Lada Adamic: 

Bibliography Reachability Min-Hash sketches, All-Distances sketches (lectures 11,12):  E. Cohen “Size-Estimation Framework with Applications to Transitive Closure and Reachability” JCSS 1997  E. Cohen H. Kaplan “Spatially-decaying aggregation over a network” JCSS 2007  E. Cohen H. Kaplan “Summarizing data using bottom-k sketches” PODC 2007  E. Cohen: “All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis” arXiv 2013