Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine WEB GRAPHS.

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Complex Networks Advanced Computer Networks: Part1.
Scale Free Networks.
Analysis and Modeling of Social Networks Foudalis Ilias.
VL Netzwerke, WS 2007/08 Edda Klipp 1 Max Planck Institute Molecular Genetics Humboldt University Berlin Theoretical Biophysics Networks in Metabolism.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Generative Models for the Web Graph José Rolim. Aim Reproduce emergent properties: –Distribution site size –Connectivity of the Web –Power law distriubutions.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Information Networks Small World Networks Lecture 5.
Advanced Topics in Data Mining Special focus: Social Networks.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
Hierarchy in networks Peter Náther, Mária Markošová, Boris Rudolf Vyjde : Physica A, dec
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
Topology Generation Suat Mercan. 2 Outline Motivation Topology Characterization Levels of Topology Modeling Techniques Types of Topology Generators.
Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
Network Models Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Models Why should I use network models? In may 2011, Facebook.
Scale-free networks Péter Kómár Statistical physics seminar 07/10/2008.
Small-World Graphs for High Performance Networking Reem Alshahrani Kent State University.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
WEB GRAPHS (Chap 3 of Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2005/10/6.
Network Statistics Gesine Reinert. Yeast protein interactions.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines.
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
Advanced Topics in Data Mining Special focus: Social Networks.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Summary from Previous Lecture Real networks: –AS-level N= 12709, M=27384 (Jan 02 data) route-views.oregon-ix.net, hhtp://abroude.ripe.net/ris/rawdata –
Computer Science 1 Web as a graph Anna Karpovsky.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.
Information Networks Power Laws and Network Models Lecture 3.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Network properties Slides are modified from Networks: Theory and Application by Lada Adamic.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Complex Networks First Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Social Network Analysis Prof. Dr. Daning Hu Department of Informatics University of Zurich Mar 5th, 2013.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Mathematics of Networks (Cont)
3. SMALL WORLDS The Watts-Strogatz model. Watts-Strogatz, Nature 1998 Small world: the average shortest path length in a real network is small Six degrees.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
How Do “Real” Networks Look?
Urban Traffic Simulated From A Dual Perspective Hu Mao-Bin University of Science and Technology of China Hefei, P.R. China
Models of Web-Like Graphs: Integrated Approach
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Cmpe 588- Modeling of Internet Emergence of Scale-Free Network with Chaotic Units Pulin Gong, Cees van Leeuwen by Oya Ünlü Instructor: Haluk Bingöl.
Lecture 23: Structure of Networks
How Do “Real” Networks Look?
Lecture 23: Structure of Networks
Milgram’s experiment really demonstrated two striking facts about large social networks: first, that short paths are there in abundance;
Network Science: A Short Introduction i3 Workshop
The Watts-Strogatz model
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Peer-to-Peer and Social Networks Fall 2017
How Do “Real” Networks Look?
Lecture 23: Structure of Networks
Advanced Topics in Data Mining Special focus: Social Networks
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine WEB GRAPHS

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 2 Internet/Web as Graphs Graph of the physical layer with routers, computers etc as nodes and physical connections as edges –It is limited –Does not capture the graphical connections associated with the information on the Internet Web Graph where nodes represent web pages and edges are associated with hyperlinks

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 3 Web Graph

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 4 Web Graph Considerations Edges can be directed or undirected Graph is highly dynamic –Nodes and edges are added/deleted often –Content of existing nodes is also subject to change –Pages and hyperlinks created on the fly Apart from primary connected component there are also smaller disconnected components

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 5 Why the Web Graph? Example of a large,dynamic and distributed graph Possibly similar to other complex graphs in social, biological and other systems Reflects how humans organize information (relevance, ranking) and their societies Efficient navigation algorithms Study behavior of users as they traverse the web graph (e-commerce)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 6 Statistics of Interest Size and connectivity of the graph Number of connected components Distribution of pages per site Distribution of incoming and outgoing connections per site Average and maximal length of the shortest path between any two vertices (diameter)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 7 Properties of Web Graphs Connectivity follows a power law distribution The graph is sparse –|E| = O(n) or atleast o(n 2 ) –Average number of hyperlinks per page roughly a constant A small world graph

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 8 Power Law Size Simple estimates suggest over a billion nodes Distribution of site sizes measured by the number of pages follow a power law distribution Observed over several orders of magnitude with an exponent  in the range

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 9 Power Law Connectivity Distribution of number of connections per node follows a power law distribution Study at Notre Dame University reported –  = 2.45 for outdegree distribution –  = 2.1 for indegree distribution Random graphs have Poisson distribution if p is large. –Decays exponentially fast to 0 as k increases towards its maximum value n-1

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 10 Power Law Distribution - Examples

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 11 Examples of networks with Power Law Distribution Internet at the router and interdomain level Citation network Collaboration network of actors Networks associated with metabolic pathways Networks formed by interacting genes and proteins Network of nervous system connection in C. elegans

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 12 Small World Networks It is a ‘small world’ –Millions of people. Yet, separated by “six degrees” of acquaintance relationships –Popularized by Milgram’s famous experiment Mathematically –Diameter of graph is small (log N) as compared to overall size 3. Property seems interesting given ‘sparse’ nature of graph but … This property is ‘natural’ in ‘pure’ random graphs

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 13 The small world of WWW Empirical study of Web-graph reveals small-world property –Average distance (d) in simulated web: d = log (n) e.g. n = 10 9, d ~= 19 –Graph generated using power-law model –Diameter properties inferred from sampling Calculation of max. diameter computationally demanding for large values of n

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 14 Implications for Web Logarithmic scaling of diameter makes future growth of web manageable –10-fold increase of web pages results in only 2 more additional ‘clicks’, but … –Users may not take shortest path, may use bookmarks or just get distracted on the way –Therefore search engines play a crucial role

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 15 Some theoretical considerations Classes of small-world networks –Scale-free: Power-law distribution of connectivity over entire range –Broad-scale: Power-law over “broad range” + abrupt cut-off –Single-scale: Connectivity distribution decays exponentially

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 16 Power Law of PageRank Assess importance of a page relative to a query and rank pages accordingly –Importance measured by indegree –Not reliable since it is entirely local PageRank – proportion of time a random surfer would spend on that page at steady state A random first order Markov surfer at each time step travels from one page to another

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 17 PageRank contd Page rank r(v) of page v is the steady state distribution obtained by solving the system of linear equations given by Where pa[v] = set of parent nodes Ch[u] = out degree

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 18 Examples Log Plot of PageRank Distribution of Brown Domain (*.brown.edu) G.Pandurangan, P.Raghavan,E.Upfal,”Using PageRank to characterize Webstructure”,COCOON 2002

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 19 Bow-tie Structure of Web A large scale study (Altavista crawls) reveals interesting properties of web –Study of 200 million nodes & 1.5 billion links –Small-world property not applicable to entire web Some parts unreachable Others have long paths –Power-law connectivity holds though Page indegree (  = 2.1), outdegree (  = 2.72)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 20 Bow-tie Components Strongly Connected Component (SCC) –Core with small-world property Upstream (IN) –Core can’t reach IN Downstream (OUT) –OUT can’t reach core Disconnected (Tendrils)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 21 Component Properties Each component is roughly same size –~50 million nodes Tendrils not connected to SCC – But reachable from IN and can reach OUT Tubes: directed paths IN->Tendrils->OUT Disconnected components –Maximal and average diameter is infinite

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 22 Empirical Numbers for Bow-tie Maximal minimal (?) diameter – 28 for SCC, 500 for entire graph Probability of a path between any 2 nodes –~1 quarter (0.24) Average length –16 (directed path exists), 7 (undirected) Shortest directed path between 2 nodes in SCC: links on average

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 23 Models for the Web Graph Stochastic models that can explain or atleast partially reproduce properties of the web graph – The model should follow the power law distribution properties –Represent the connectivity of the web –Maintain the small world property

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 24 Web Page Growth Empirical studies observe a power law distribution of site sizes –Size includes size of the Web, number of IP addresses, number of servers, average size of a page etc A Generative model is being proposed to account for this distribution

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 25 Component One of the Generative Model The first component of this model is that “ sites have short-term size fluctuations up or down that are proportional to the size of the site “ A site with 100,000 pages may gain or lose a few hundred pages in a day whereas the effect is rare for a site with only 100 pages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 26 Component Two of the Generative Model There is an overall growth rate  so that the size S(t) satisfies S(t+1) =  (1+  t  )S(t) where -  t is the realization of a +-1 Bernoulli random variable at time t with probability b is the absolute rate of the daily fluctuations

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 27 Component Two of the Generative Model contd After T steps so that

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 28 Theoretical Considerations Assuming  t independent, by central limit theorem it is clear that for large values of T, log S(T) is normally distributed –The central limit theorem states that given a distribution with a mean μ and variance σ 2, the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ 2 /N as N, the sample size, increases.

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 29 Theoretical Considerations contd Log S(T) can also be associated with a binomial distribution counting the number of time  t = +1 Hence S(T) has a log-normal distribution The probability density and cumulative distribution functions for the log normal distribution

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 30 Modified Model Can be modified to obey power law distribution Model is modified to include the following inorder to obey power law distribution –A wide distribution of growth rates across different sites and/or –The fact that sites have different ages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 31 Capturing Power Law Property Inorder to capture Power Law property it is sufficient to consider that –Web sites are being continuously created –Web sites grow at a constant rate  during a growth period after which their size remains approximately constant –The periods of growth follow an exponential distribution This will give a relation = 0.8  between the rate of exponential distribution and  the growth rage when power law exponent  = 1.08

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 32 Lattice Perturbation (LP) Models Some Terms –“Organized Networks” (a.k.a Mafia) Each node has same degree k and neighborhoods are entirely local Probability of Edge (a,b) = 1 if dist (a,b) = 1 0 otherwise Note: We are talking about graphs that can be mapped to a Cartesian plane

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 33 Terms (Cont’d) Organized Networks –Are ‘cliquish’ (Subgraph that is fully connected) in local neighborhood –Probability of edges across neighborhoods is almost non existent (p=0 for fully organized) “Disorganized” Networks –‘Long-range’ edges exist –Completely Disorganized Fully Random (Erdos Model) : p=1

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 34 Semi-organized (SO) Networks Probability for long-range edge is between zero and one Clustered at local level (cliquish) But have long-range links as well Leads to networks that –Are locally cliquish –And have short path lengths

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 35 Creating SO Networks Step 1: –Take a regular network (e.g. lattice) Step 2: –Shake it up (perturbation) Step 2 in detail: –For each vertex, pick a local edge –‘Rewire’ the edge into a long-range edge with a probability (p) –p=0: organized, p=1: disorganized

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 36 Statistics of SO Networks Average Diameter (d): Average distance between two nodes Average Clique Fraction (c) –Given a vertex v, k(v): neighbors of v –Max edges among k(v) = k(k-1)/2 –Clique Fraction (c v ): (Edges present) / (Max) –Average clique fraction: average over all nodes –Measures: Degree to which “my friends are friends of each other”

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 37 Statistics (Cont’d) Statistics of common networks: nkdc Actors225, Power- grid 4, C.elegans Large k = large c? Small c = large d?

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 38 Other Properties For graph to be sparse but connected: –n >> k >> log(n) >>1 As p --> 0 (organized) –d ~= n/2k >>1, c ~= 3/4 –Highly clustered & d grows linearly with n As p --> 1 (disorganized) –d ~= log(n)/log(k), c ~= k/n << 1 –Poorly clustered & d grows logarithmically with n

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 39 Effect of ‘Shaking it up’ Small shake (p close to zero) –High cliquishness AND short path lengths Larger shake (p increased further from 0) –d drops rapidly (increased small world phenomena_ –c remains constant (transition to small world almost undetectable at local level) Effect of long-range link: –Addition: non-linear decrease of d –Removal: small linear decrease of c

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 40 LP and The Web LP has severe limitations –No concept of short or long links in Web A page in USA and another in Europe can be joined by one hyperlink –Edge rewiring doesn’t produce power-law connectivity! Degree distribution bounded & strongly concentrated around mean value Therefore, we need other models …