| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003.

Slides:



Advertisements
Similar presentations
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Analysis and Modeling of Social Networks Foudalis Ilias.
Week 5 - Models of Complex Networks I Dr. Anthony Bonato Ryerson University AM8002 Fall 2014.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Generative Models for the Web Graph José Rolim. Aim Reproduce emergent properties: –Distribution site size –Connectivity of the Web –Power law distriubutions.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Power Laws: Rich-Get-Richer Phenomena
SILVIO LATTANZI, D. SIVAKUMAR Affiliation Networks Presented By: Aditi Bhatnagar Under the guidance of: Augustin Chaintreau.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
A Brief History of Lognormal and Power Law Distributions and an Application to File Size Distributions Michael Mitzenmacher Harvard University.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
On the Spread of Viruses on the Internet Noam Berger Joint work with C. Borgs, J.T. Chayes and A. Saberi.
CS728 Lecture 5 Generative Graph Models and the Web.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
The Barabási-Albert [BA] model (1999) ER Model Look at the distribution of degrees ER ModelWS Model actorspower grid www The probability of finding a highly.
Model Fitting Jean-Yves Le Boudec 0. Contents 1 Virus Infection Data We would like to capture the growth of infected hosts (explanatory model) An exponential.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
CS Lecture 6 Generative Graph Models Part II.
Evaluating Hypotheses
SDSC, skitter (July 1998) A random graph model for massive graphs William Aiello Fan Chung Graham Lincoln Lu.
Statistical Background
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Continuous Random Variables and Probability Distributions
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
1 A Random-Surfer Web-Graph Model Avrim Blum, Hubert Chan, Mugizi Rwebangira Carnegie Mellon University.
Computer Science 1 Web as a graph Anna Karpovsky.
1 Dynamic Models for File Sizes and Double Pareto Distributions Michael Mitzenmacher Harvard University.
Review of Probability.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Optimization Based Modeling of Social Network Yong-Yeol Ahn, Hawoong Jeong.
Information Networks Power Laws and Network Models Lecture 3.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Distributions of Randomized Backtrack Search Key Properties: I Erratic behavior of mean II Distributions have “heavy tails”.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University (With thanks to David Parkes!)
Efficient Gathering of Correlated Data in Sensor Networks
Popularity versus Similarity in Growing Networks Fragiskos Papadopoulos Cyprus University of Technology M. Kitsak, M. Á. Serrano, M. Boguñá, and Dmitri.
Models and Algorithms for Complex Networks Power laws and generative processes.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Percolation Processes Rajmohan Rajaraman Northeastern University, Boston May 2012 Chennai Network Optimization WorkshopPercolation Processes1.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Sampling and estimation Petter Mostad
How Do “Real” Networks Look?
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Statistical Properties of Text
On the behaviour of an edge number in a power-law random graph near a critical points E. V. Feklistova, Yu.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Chapter 7: Sampling Distributions
How Do “Real” Networks Look?
Random Graph Models of large networks
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Lecture 13 Network evolution
How Do “Real” Networks Look?
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Lecture 21 Network evolution
Presentation transcript:

| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide

© 2003 IBM Corporation I 2 Probabilistic generative models  Observation: These distributions have the same form: 1.Fraction of laptops that fail catastrophically during tutorials, by city 2.Fraction of pairs of shoes that spontaneously de-sole during periods of stress, by city  Conclusion: The distribution arises because the same stochastic process is at work, and this process can be understood beyond the context of each example

© 2003 IBM Corporation I 3 Models for Power Laws  Power laws arise in many different areas of human endeavor, the “hallmark of human activity”  (they also occur in nature)  Can we find the underlying process (processes?) that accounts for this prevalence? [Mitzenmacher 2002]

© 2003 IBM Corporation I 4 An Introduction to the Power Law  Definition: a distribution is said to have a power law if Pr[X >= x]  cx   Normally: 0<  <=2 (Var(X) infinite)  Sometimes: 0<  <=1 (Mean(X) infinite) Exponentially-decaying distribution Power law distribution

© 2003 IBM Corporation I 5 Early Observations: Pareto on Income  [Pareto1897] observed that the random variable I denoting the income of an individual has a power law distribution  More strongly, he observed that Pr[X>x] = (x/k)   For density function f, note that ln f(x) = (-  -1)ln(x) + c for constant c  Thus, in a plot of log(value) versus log(probability), power laws display a linear tail, while Pareto distributions are linear always.

© 2003 IBM Corporation I 6 Early Observations: Yule/Zipf  [Yule26] observed (and explained) power laws in the context of number of species within a genus  [Zipf32] and [Estoup16] studied the relative frequency of words in natural language, beginning a cottage industry that continues to this day.  A “Yule-Zipf” distribution is typically characterized by rank rather than value: The ith most frequent word in English occurs with probability proportional to 1/i.  This characterization relies on finite vocabulary

© 2003 IBM Corporation I 7 Early Observations: Lotka on Citations  [Lotka25] presented the first occurrence of power laws in the context of graph theory, showing a power law for the indegree of the citation graph

© 2003 IBM Corporation I 8 Ranks versus Values  Commonly encountered phrasings of the power law in the context of word counts: 1.Pr[word has count >= W] has some form 2.Number of words with count >= W has some form 3.The frequency of the word with rank r has some form The first two forms are clearly identical. What about the third?

© 2003 IBM Corporation I 9 Equivalence of rank versus value formulation  Given: number of words occurring t times ~ t   Approach: Consider single most frequent word, with count T Characterize word occurring t times in terms of T Approximate rank of words occurring t times by counting number of words occurring at each more frequent count.  Conclusion: Rank-j word occurs  (cj + d)  times (power law)  But... high ranks correspond to low values – must keep straight the “head” and the “tail” [Bookstein90, Adamic99]

© 2003 IBM Corporation I 10 Early modeling work  The characterization of power laws is a limiting statement  Early modeling work showed approaches that provide the correct form of the tail in the limit  Later work introduced the rate of convergence of a process to its limiting distribution

© 2003 IBM Corporation I 11 A model of Simon  Following Simon [1955], described in terms of word frequences  Consider a book being written. Initially, the book contains a single word, “the.”  At time t, the book contains t words. The process of Simon generates the t+1 st word based on the current book.

© 2003 IBM Corporation I 12 Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: RankWordCount 1“the”1000 2“of”600 3“from”300 …“...”… 4,791“necessary”5 “...” 11,325“neccesary”1 Let f(i,t) be the number of words of count i at time t

© 2003 IBM Corporation I 13 The Generative Model  Assumptions: 1.Constant probability that a neologism will be introduced at any timestep 2.Probability of re-using a word of count i is proportional to if(i,t), that is, number of occurrences of count i words.  Algorithm: With probability  a new word is introduced into the text With remaining probability, a word with count i is introduced with probability proportional to if(i,t)

© 2003 IBM Corporation I 14 Constructing a book: snapshot at time t Current word frequencies: RankWordCount 1“the”1000 2“of”600 3“from”300 …“...”… 4,791“necessary”5 “...” 11,325“neccesary”1 Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1-  ) 1000 / K Pr[“of”] = (1-  ) 600 / K Pr[some count-1 word] = (1-  ) 1 * f(1,t) / K K =  if(i,t)

© 2003 IBM Corporation I 15 What’s going on? One unique word (which occurs 1 or more times) Each word in bucket i occurs i times in the current document ….

© 2003 IBM Corporation I 16 What’s going on? With probability  a new word is introduced into the text

© 2003 IBM Corporation I 17 What’s going on? How many times do words in this bucket occur? With probability 1-  an existing word is reused

© 2003 IBM Corporation I 18 What’s going on? 234 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value

© 2003 IBM Corporation I 19 Models for power laws in the web graph  Retelling the Simon model: “preferential attachment” Barabasi et al Kumar et al  Other models for the web graph: [Aiello, Chung, Lu], [Huberman et al]

© 2003 IBM Corporation I 20 Why create such a model?  Evaluate algorithms and heuristics  Get insight into page creation  Estimate hard-to-sample parameters  Help understand web structure  Cost modeling for query optimization  To find “surprises” means we must understand what is typical.

© 2003 IBM Corporation I 21 Random graph models G(n,p)Web indeg > 1000 k23's 4-cliques many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?

© 2003 IBM Corporation I 22 Desiderata for a graph model  Succinct description  Insight into page creation  No a priori set of "topics", but... ... topics should emerge naturally  Reflect structural phenomena  Dynamic page arrivals  Should mirror web's "rich get richer" property, and manifest link correlation.

© 2003 IBM Corporation I 23 Page creation on the web  Some page creators will link to other sites without regard to existing topics, but  Most page creators will be drawn to pages covering existing topics they care about, and will link to pages within these topics Model idea: new pages add links by "copying" them from existing pages

© 2003 IBM Corporation I 24 Generally, would require…  Separate processes for: Node creation Node deletion Edge creation Edge deletion

© 2003 IBM Corporation I 25 A specific model  Nodes are created in a sequence of discrete time steps e.g. at each time step, a new node is created with d  1) out-links  Probabilistic copying –links go to random nodes with probability  –copy d links from a random node with probability 1- 

© 2003 IBM Corporation I 26 Example New node arrives With probability , it links to a uniformly-chosen page

© 2003 IBM Corporation I 27 To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. Example With probability (1-  ), it decides to copy a link.

© 2003 IBM Corporation I 28 Degree sequences in this model k Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (  = 1/11 matches web) -(2-  ) (1-  ) [Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal 2000]

© 2003 IBM Corporation I 29 Model extensions  Component size distributions.  More complex copying.  Tighter lower tail bounds.  More structure results.

© 2003 IBM Corporation I 30 A model of Mandelbrot  Key idea: Generate frequencies of English words to maximize information transferred per unit cost  Approach: Say word i occurs with probability p(i) Set the transmission cost of word i to be log(i) Average information per word: –  p(i) log(p(i)) Cost of a word with probability p(j): log (j) Average cost per word:  p(j) log(j) Choose probabilities p(i) to maximize information/cost  Result: p(j) = c j  [Mandelbrot 1953]

© 2003 IBM Corporation I 31 Discussion of Mandelbrot’s model  Trade-offs between communication cost (log(p(j)) and information.  Are there other tradeoff-based models that drive similar properties?

© 2003 IBM Corporation I 32 Heuristically Optimized Trade-offs  Goal: construction of trees (note: models to generate trees with power law behavior were first proposed in [Yule26])  Idea: New nodes must trade off connecting to nearby nodes, and connecting to central nodes.  Model: Points arrive uniformly within the unit square New point arrives, and computes two measures for candidate connection points j –d(j): distance from new node to existing node j (“nearness”) –h(j): distance from node j to root of tree (“centrality”) New destination chosen to minimize  d(j) + h(j)  Result: for a wide variety of values of , distribution of degrees has a power law [Fabrikant, Koutsoupias, Papadimitriou 2002]

© 2003 IBM Corporation I 33 Monkeys on Typewriters  Consider a creation model divorced form concerns of information and cost  Model: Monkey types randomly, hits space bar with probability q, character chosen uniformly with remaining probability  Result: Rank j word occurs with probability qj log(1-q)-1 = c j  [Miller 1962]

© 2003 IBM Corporation I 34 Other Distributions  “Power law” means a clean characterization of a particular property on distribution upper tails  Often used to mean “heavy tailed,” meaning bounded away from an exponentially decaying distribution  There are other forms of heavy-tailed distributions  A commonly-occurring example: lognormal distribution

© 2003 IBM Corporation I 35 Quick characterization of lognormal distributions  Let X be a normally-distributed random variable  Let Y = ln X  Then Y is lognormal  Common situations: Multiplicative growth  Concern: There is a growing sequence of papers dating back several decades questioning whether certain observed values are best described by power law or lognormal (or other) distributions.

© 2003 IBM Corporation I 36 One final direction…  The Central Limit Theorem tells us how sums of independent random variables behave in the limit  Example: ln X j = ln X 0 +  ln F j  X j well-approximated by a lognormal variable  Thus, lognormal variables arise in situations of multiplicative growth  Examples in biology, ecology, economics,…  Example: [Huberman et al]: growth of web sites  Similarly: the product The same result applies to the product of lognormal variables  Each of these generative models is evolutionary  What is the role of time?