Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)

Slides:



Advertisements
Similar presentations
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
VL Netzwerke, WS 2007/08 Edda Klipp 1 Max Planck Institute Molecular Genetics Humboldt University Berlin Theoretical Biophysics Networks in Metabolism.
Analysis of Social Media MLD , LTI William Cohen
Synopsis of “Emergence of Scaling in Random Networks”* *Albert-Laszlo Barabasi and Reka Albert, Science, Vol 286, 15 October 1999 Presentation for ENGS.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Advanced Topics in Data Mining Special focus: Social Networks.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
The Barabási-Albert [BA] model (1999) ER Model Look at the distribution of degrees ER ModelWS Model actorspower grid www The probability of finding a highly.
CMU SCS KDD 2006Leskovec & Faloutsos1 ??. CMU SCS KDD 2006Leskovec & Faloutsos2 Sampling from Large Graphs poster# 305 Jurij (Jure) Leskovec Christos.
Network Statistics Gesine Reinert. Yeast protein interactions.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
Common Properties of Real Networks. Erdős-Rényi Random Graphs.
CS Lecture 6 Generative Graph Models Part II.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Global topological properties of biological networks.
Advanced Topics in Data Mining Special focus: Social Networks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Computer Science 1 Web as a graph Anna Karpovsky.
Information Networks Power Laws and Network Models Lecture 3.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Exploratory Data Analysis on Graphs William Cohen.
Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.
Analysis of Social Media MLD , LTI William Cohen
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one sample algorithm on graphs edges and nodes in memory nodes in memory nothing.
COLOR TEST COLOR TEST. Social Networks: Structure and Impact N ICOLE I MMORLICA, N ORTHWESTERN U.
Social Network Analysis Prof. Dr. Daning Hu Department of Informatics University of Zurich Mar 5th, 2013.
Graph Algorithms: Properties of Graphs? William Cohen.
Networks Igor Segota Statistical physics presentation.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Class 9: Barabasi-Albert Model-Part I
Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Analysis of Social Media MLD , LTI William Cohen
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Analysis of Social Media MLD , LTI William Cohen
How Do “Real” Networks Look?
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
1 CIS 4930/6930 – Recent Advances in Bioinformatics Spring 2014 Network models Tamer Kahveci.
Performance Evaluation Lecture 1: Complex Networks Giovanni Neglia INRIA – EPI Maestro 10 December 2012.
Class 2: Graph Theory IST402. Can one walk across the seven bridges and never cross the same bridge twice? Network Science: Graph Theory THE BRIDGES OF.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Algorithms and Computational Biology Lab, Department of Computer Science and & Information Engineering, National Taiwan University, Taiwan Network Biology.
Network (graph) Models
Lecture 23: Structure of Networks
Topics In Social Computing (67810)
How Do “Real” Networks Look?
Community detection in graphs
Lecture 23: Structure of Networks
Generative Model To Construct Blog and Post Networks In Blogosphere
How Do “Real” Networks Look?
How Do “Real” Networks Look?
The likelihood of linking to a popular website is higher
Why Social Graphs Are Different Communities Finding Triangles
Peer-to-Peer and Social Networks Fall 2017
How Do “Real” Networks Look?
Lecture 23: Structure of Networks
Lecture 21 Network evolution
Network Science: A Short Introduction i3 Workshop
Diffusion in Networks
Network Models Michael Goodrich Some slides adapted from:
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)

Why I’m talking about graphs Lots of large data is graphs – Facebook, Twitter, citation data, and other social networks – The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networks – Text corpora (like RCV1), large datasets with discrete feature values, and other bipartite networks nodes = documents or words links connect document  word or word  document – Computer networks, biological networks (proteins, ecosystems, brains, …), … – Heterogeneous networks with multiple types of nodes people, groups, documents

Properties of Graphs Descriptive Statistics Number of connected components Diameter Degree distribution … Models of Growth/Formation Erdos-Renyi Preferential attachment Stochastic block models …. Let’s look at some examples of graphs … but first, why are these statistics important?

An important question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample?

Protein-Protein Interactions 5 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Protein-Protein Interactions 6 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Facebook Network 7 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Facebook Network 8 High school Summer internship Stanford (Squash) Stanford (Basketball) Social communities Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Blogs

Citations

Blogs

How do you model graphs?

What are we trying to do? Normal curve: Data to fit Underlying process that “explains” the data Closed-form parametric form Only models a small part of the data

Graphs Set V of vertices/nodes v1,…, set E of edges (u,v),…. – Edges might be directed and/or weighted and/or labeled Degree of v is #edges touching v – Indegree and Outdegree for directed graphs Path is sequence of edges (v0,v1),(v1,v2),…. Geodesic path between u and v is shortest path connecting them – Diameter is max u,v in V {length of geodesic between u,v} – Effective diameter is 90 th percentile – Mean diameter is over connected pairs (Connected) component is subset of nodes that are all pairwise connected via paths Clique is subset of nodes that are all pairwise connected via edges Triangle is a clique of size three

Graphs Some common properties of graphs: – Distribution of node degrees – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest-path) Effective diameter (90 th percentile) Connected components – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” …

Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1)

Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1) – Mean number of neighbors distance d from v is z d – How large does d need to be so that z d >=n ? If z>1, d = log(n)/log(z) If z<1, you can’t do it – So: There tend to be either many small components (z 1) giant connected component) – Another intuition: If there are a two large connected components, then with high probability a few random edges will link them up.

Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1) – Mean number of neighbors distance d from v is z d – How large does d need to be so that z d >=n ? If z>1, d = log(n)/log(z) If z<1, you can’t do it – So: If z>1, diameters tend to be small (relative to n)

Sociometry, Vol. 32, No. 4. (Dec., 1969), pp of 296 chains succeed, avg chain length is 6.2

Illustrations of the Small World Millgram’s experiment Erdős numbers – Bacon numbers – LinkedIn – – Privacy issues: the whole network is not visible to all

Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1) This is usually not a good model of degree distribution in natural networks

Degree distribution Plot cumulative degree – X axis is degree – Y axis is #nodes that have degree at least k Typically use a log-log scale – Straight lines are a power law; normal curve dives to zero at some point – Left: trust network in epinions web site from Richardson & Domingos

Degree distribution Plot cumulative degree – X axis is degree – Y axis is #nodes that have degree at least k Typically use a log-log scale – Straight lines are a power law; normal curve dives to zero at some point This defines a “scale” for the network – Left: trust network in epinions web site from Richardson & Domingos

Friendship network in Framington Heart Study

Graphs Some common properties of graphs: – Distribution of node degrees – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest-path) Effective diameter (90 th percentile) Connected components – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” …

Graphs Some common properties of graphs: – Distribution of node degrees: often scale-free – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest- path) Effective diameter (90 th percentile) often small Connected components usually one giant CC – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” generates scale-free graphs …

Barabasi-Albert Networks Science 286 (1999) Start from a small number of node, add a new node with m links Preferential Attachment Probability of these links to connect to existing nodes is proportional to the node’s degree ‘Rich gets richer’ This creates ‘hubs’: few nodes with very large degrees

Preferential attachment (Barabasi-Albert) Random graph (Erdos Renyi)

Graphs Some common properties of graphs: – Distribution of node degrees: often scale-free – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest- path) Effective diameter (90 th percentile) often small Connected components usually one giant CC – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” generates scale-free graphs …

Homophily One definition: excess edges between similar nodes – E.g., assume nodes are male and female and Pr(male)=p, Pr(female)=q. – Is Pr(gender(u)≠ gender(v) | edge (u,v)) >= 2pq? Another def’n: excess edges between common neighbors of v

Homophily Another def’n: excess edges between common neighbors of v

Homophily In a random Erdos-Renyi graph: In natural graphs two of your mutual friends might well be friends: Like you they are both in the same class (club, field of CS, …) You introduced them

Watts-Strogatz model Start with a ring Connect each node to k nearest neighbors  homophily Add some random shortcuts from one point to another  small diameter Degree distribution not scale-free Generalizes to d dimensions

An important question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample?

KDD 2006

Brief summary Define goals of sampling: – “scale-down” – find G’<G with similar statistics – “back in time”: for a growing G, find G’<G that is similar (statistically) to an earlier version of G Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly – … See how well they perform

Brief summary Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly RPN – random nodes, sampled by PageRank RDP – random nodes sampled by in-degree – RE – random edges – RJ – run PageRank’s “random surfer” for n steps – RW – run RWR’s “random surfer” for n steps – FF – repeatedly pick r(i) neighbors of i to “burn”, and then recursively sample from them

10% sample – pooled on five datasets

d-statistic measures agreement between distributions D=max{|F(x)-F’(x)|} where F, F’ are cdf’s max over nine different statistics

Graphs (Part II) To be continued…