School of Information University of Michigan Zipf’s law & fat tails Plotting and fitting distributions Lecture 6 Instructor: Lada Adamic Reading: Lada.

Slides:



Advertisements
Similar presentations
Power laws, Pareto distribution and Zipf's law M. E. J. Newman Presented by: Abdulkareem Alali.
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Hydrologic Statistics
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Advanced Topics in Data Mining Special focus: Social Networks.
Lecture 10: Power Laws CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
Network Models Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Models Why should I use network models? In may 2011, Facebook.
Model Fitting Jean-Yves Le Boudec 0. Contents 1 Virus Infection Data We would like to capture the growth of infected hosts (explanatory model) An exponential.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
Statistics.
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
Power Laws Otherwise known as any semi- straight line on a log-log plot.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Analysis of Social Information Networks Thursday January 27 th, Lecture 3: Popularity-Power law 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Doubling Dimension in Real-World Graphs Melitta Lorraine Geistdoerfer Andersen.
CS 124/LINGUIST 180 From Languages to Information Dan Jurafsky Stanford University Social Networks: Small Worlds, Weak Ties, and Power Laws Slides from.
Hydrologic Statistics
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Information Networks Power Laws and Network Models Lecture 3.
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
Models and Algorithms for Complex Networks Power laws and generative processes.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
COLOR TEST COLOR TEST. Social Networks: Structure and Impact N ICOLE I MMORLICA, N ORTHWESTERN U.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Graphing Data: Introduction to Basic Graphs Grade 8 M.Cacciotti.
GrowingKnowing.com © Frequency distribution Given a 1000 rows of data, most people cannot see any useful information, just rows and rows of data.
Lotkaian Informetrics and applications to social networks L. Egghe Chief Librarian Hasselt University Professor Antwerp University Editor-in-Chief “Journal.
School of Information University of Michigan Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
How Do “Real” Networks Look?
Probability distributions
Power law distribution
Descriptive Statistics. Outline of Today’s Discussion 1.Central Tendency 2.Dispersion 3.Graphs 4.Excel Practice: Computing the S.D. 5.SPSS: Existing Files.
Statistical Properties of Text
CS 124/LINGUIST 180 From Languages to Information
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
Section 6 – Ec1818 Jeremy Barofsky March 10 th and 11 th, 2010.
CS 124/LINGUIST 180 From Languages to Information Dan Jurafsky Stanford University Social Networks: Small Worlds, Weak Ties, and Power Laws Slides from.
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
Lecture 23: Structure of Networks
Data Transformation: Normalization
How Do “Real” Networks Look?
Lecture 23: Structure of Networks
CS224W: Social and Information Network Analysis
Lecture 11: Scale Free Networks
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Social Network Analysis
Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License.
Lecture 4: “Scale free” networks
Hydrologic Statistics
Peer-to-Peer and Social Networks Fall 2017
How Do “Real” Networks Look?
Lecture 23: Structure of Networks
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

School of Information University of Michigan Zipf’s law & fat tails Plotting and fitting distributions Lecture 6 Instructor: Lada Adamic Reading: Lada Adamic, Zipf, Power-laws, and Pareto - a ranking tutorial, M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemporary Physics 46, (2005)

Outline Power law distributions Fitting Data sets for projects Next class: what kinds of processes generate power laws?

What is a heavy tailed-distribution? Right skew normal distribution (not heavy tailed) e.g. heights of human males: centered around 180cm (5’11’’) Zipf’s or power-law distribution (heavy tailed) e.g. city population sizes: NYC 8 million, but many, many small towns High ratio of max to min human heights tallest man: 272cm (8’11”), shortest man: (1’10”) ratio: 4.8 from the Guinness Book of world records city sizes NYC: pop. 8 million, Duffield, Virginia pop. 52, ratio: 150,000

Normal (also called Gaussian) distribution of human heights average value close to most typical distribution close to symmetric around average value

Power-law distribution linear scale log-log scale high skew (asymmetry) straight line on a log-log plot

Power laws are seemingly everywhere note: these are cumulative distributions, more about this in a bit… Moby Dick scientific papers AOL users visiting sites ‘97 bestsellers AT&T customers on 1 dayCalifornia

Yet more power laws MoonSolar flares wars ( ) richest individuals 2003US family names 1990US cities 2003

Power law distribution Straight line on a log-log plot Exponentiate both sides to get that p(x), the probability of observing an item of size ‘x’ is given by normalization constant (probabilities over all x must sum to 1) power law exponent 

Logarithmic axes powers of a number will be uniformly spaced =1, 2 1 =2, 2 2 =4, 2 3 =8, 2 4 =16, 2 5 =32, 2 6 =64,….

Fitting power-law distributions Most common and not very accurate method: Bin the different values of x and create a frequency histogram ln(x) ln(# of times x occurred) x can represent various quantities, the indegree of a node, the magnitude of an earthquake, the frequency of a word in text ln(x) is the natural logarithm of x, but any other base of the logarithm will give the same exponent of a because log 10 (x) = ln(x)/ln(10)

Example on an artificially generated data set Take 1 million random numbers from a distribution with  = 2.5 Can be generated using the so-called ‘transformation method’ Generate random numbers r on the unit interval 0≤r<1 then x = (1-r)  is a random power law distributed real number in the range 1 ≤ x < 

Linear scale plot of straight bin of the data How many times did the number 1 or 3843 or occur Power-law relationship not as apparent Only makes sense to look at smallest bins whole range first few bins

Log-log scale plot of straight binning of the data Same bins, but plotted on a log-log scale Noise in the tail: Here we have 0, 1 or 2 observations of values of x when x > 500 here we have tens of thousands of observations when x < 10 Actually don’t see all the zero values because log(0) = 

Log-log scale plot of straight binning of the data Fitting a straight line to it via least squares regression will give values of the exponent  that are too low fitted  true 

What goes wrong with straightforward binning Noise in the tail skews the regression result have many more bins here have few bins here

First solution: logarithmic binning bin data into exponentially wider bins: 1, 2, 4, 8, 16, 32, … normalize by the width of the bin evenly spaced datapoints less noise in the tail of the distribution disadvantage: binning smoothes out data but also loses information

Second solution: cumulative binning No loss of information No need to bin, has value at each observed value of x But now have cumulative distribution i.e. how many of the values of x are at least X The cumulative probability of a power law probability distribution is also power law but with an exponent  - 1

Fitting via regression to the cumulative distribution fitted exponent (2.43) much closer to actual (2.5)

Where to start fitting? some data exhibit a power law only in the tail after binning or taking the cumulative distribution you can fit to the tail so need to select an x min the value of x where you think the power-law starts certainly x min needs to be greater than 0, because x  is infinite at x = 0

Example: Distribution of citations to papers power law is evident only in the tail (x min > 100 citations) x min

Maximum likelihood fitting – best You have to be sure you have a power-law distribution (this will just give you an exponent but not a goodness of fit) x i are all your datapoints, and you have n of them for our data set we get  = – pretty close!

Some exponents for real world data x min exponent  frequency of use of words12.20 number of citations to papers number of hits on web sites12.40 copies of books sold in the US telephone calls received magnitude of earthquakes diameter of moon craters intensity of solar flares intensity of wars31.80 net worth of Americans$600m2.09 frequency of family names population of US cities

Many real world networks are power law exponent   in/out degree) film actors2.3 telephone call graph2.1 networks1.5/2.0 sexual contacts3.2 WWW2.3/2.7 internet2.5 peer-to-peer2.1 metabolic network2.2 protein interactions2.4

Hey, not everything is a power law number of sightings of 591 bird species in the North American Bird survey in cumulative distribution another examples: size of wildfires (in acres)

Not every network is power law distributed address books power grid Roget’s thesaurus company directors…

Example on a real data set: number of AOL visitors to different websites back in 1997 simple binning on a linear scale simple binning on a log-log scale

trying to fit directly… direct fit is too shallow:  = 1.17…

Binning the data logarithmically helps select exponentially wider bins 1, 2, 4, 8, 16, 32, ….

Or we can try fitting the cumulative distribution Shows perhaps 2 separate power-law regimes that were obscured by the exponential binning Power-law tail may be closer to 2.4

Another common distribution: power-law with an exponential cutoff p(x) ~ x -a e -k/  starts out as a power law ends up as an exponential but could also be a lognormal or double exponential…

Zipf &Pareto: what they have to do with power-laws Zipf George Kingsley Zipf, a Harvard linguistics professor, sought to determine the 'size' of the 3rd or 8th or 100th most common word. Size here denotes the frequency of use of the word in English text, and not the length of the word itself. Zipf's law states that the size of the r'th largest occurrence of the event is inversely proportional to its rank: y ~ r - , with  close to unity.

Zipf &Pareto: what they have to do with power-laws Pareto The Italian economist Vilfredo Pareto was interested in the distribution of income. Pareto’s law is expressed in terms of the cumulative distribution (the probability that a person earns X or more). P[X > x] ~ x -k Here we recognize k as just  -1, where  is the power-law exponent

So how do we go from Zipf to Pareto? The phrase "The r th largest city has n inhabitants" is equivalent to saying "r cities have n or more inhabitants". This is exactly the definition of the Pareto distribution, except the x and y axes are flipped. Whereas for Zipf, r is on the x-axis and n is on the y-axis, for Pareto, r is on the y-axis and n is on the x-axis. Simply inverting the axes, we get that if the rank exponent is , i.e. n ~ r  for Zipf, (n = income, r = rank of person with income n) then the Pareto exponent is 1/  so that r ~ n -1/  (n = income, r = number of people whose income is n or higher)

Zipf’s law & AOL site visits Deviation from Zipf’s law slightly too few websites with large numbers of visitors:

Zipf’s Law and city sizes (~1930) [2] Rank(k)CityPopulation (1990) Zips’s LawModified Zipf’s law: (Mandelbrot) 1Now York7,322,56410,000,0007,334,265 7Detroit1,027,9741,428,5711,214,261 13Baltimore736,014769,231747,693 19Washington DC606,900526,316558,258 25New Orleans496,938400,000452,656 31Kansas City434,829322,581384,308 37Virgina Beach393,089270,270336,015 49Toledo332,943204,082271,639 61Arlington261,721163,932230,205 73Baton Rouge219,531136,986201,033 85Hialeah188,008117,647179,243 97Bakersfield174,820103,270162,270 slide: Luciano Pietronero not any more

Exponents and averages In general, power law distributions do not have an average value if  < 2 (but the sample will!) This is because the average is given by (for integer values of k) The harmonic series diverges… Same holds for continuous values of k for a finite sample this will only go to the largest observed value

80/20 rule The fraction W of the wealth in the hands of the richest P of the the population is given by W = P  Example: US wealth:  = 2.1 richest 20% of the population holds 86% of the wealth

Generative processes for power-laws Many different processes can lead to power laws There is no one unique mechanism that explains it all Next class: Yule’s process and preferential attachment

What does it mean to be scale free? A power law looks the same no mater what scale we look at it on (2 to 50 or 200 to 5000) Only true of a power-law distribution! p(bx) = g(b) p(x) – shape of the distribution is unchanged except for a multiplicative constant p(bx) = (bx)  = b  x  log(x) log(p(x)) x →b*x

Data sets: patent networks Patent networks (very large, but can study subset) “small worlds” of patent co-inventorship connections between firms by movement of inventors patent interactions (“blocking”, “independent”, “complementary”, “substitute”) Prof. Gavin Clarkson has great access + expertise example of acyclic graph patent can only cite previous patents

Patent network

Data sets: wordnet Lexical database used in NLP relationships Synonymy Antonymy Hyponymy (sub-name) gives rise to hierarchy) Meronymy (part-name) WordNet distinguishes component parts, substantive Troponymy (hierarchy between verbs) Entailment

Physical internet Networks both at the ASP and router level are available over a period of time – well suited to longitudinal study interesting things to look at densification diameter robustness flow/load

web pages & blogs community structure: find connections between organizations & companies based on their linking patterns especially true for blogs ranking algorithms (links + content) relating links to content (explanation + prediction) easy to gather (for blogs, LiveJournal provides an API), for other webpages can write a simple crawler example: Prof. Mick McQuaid’s diversity study is based in part on course descriptions from universities’ websites

food webs Several datasets available (already in Pajek format) EcoNetwrk – a windows app. to analyze ecological flow networks Interesting to study: network robustness/changes

biological networks protein-protein interactions gene regulatory networks metabolic networks neural networks

Other networks transportation airline rail road networks Enron dataset is public groups & teams sports musicians & bands boards of directors co-authorship networks very readily available