1 Dynamic Models for File Sizes and Double Pareto Distributions Michael Mitzenmacher Harvard University.

Slides:



Advertisements
Similar presentations
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.
Advertisements

Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Chapter 6 Sampling and Sampling Distributions
Analysis and Modeling of Social Networks Foudalis Ilias.
Week 5 - Models of Complex Networks I Dr. Anthony Bonato Ryerson University AM8002 Fall 2014.
© 2003 Prentice-Hall, Inc.Chap 5-1 Business Statistics: A First Course (3 rd Edition) Chapter 5 Probability Distributions.
Chapter 11- Confidence Intervals for Univariate Data Math 22 Introductory Statistics.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
A Brief History of Lognormal and Power Law Distributions and an Application to File Size Distributions Michael Mitzenmacher Harvard University.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Power Law and Its Generative Models Bo Young Kim
Directional triadic closure and edge deletion mechanism induce asymmetry in directed edge properties.
Model Fitting Jean-Yves Le Boudec 0. Contents 1 Virus Infection Data We would like to capture the growth of infected hosts (explanatory model) An exponential.
Continuous Random Variables. For discrete random variables, we required that Y was limited to a finite (or countably infinite) set of values. Now, for.
CP Formal Models of Heavy-Tailed Behavior in Combinatorial Search Hubie Chen, Carla P. Gomes, and Bart Selman
Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.
CS Lecture 6 Generative Graph Models Part II.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
1 A History of and New Directions for Power Law Research Michael Mitzenmacher Harvard University.
1 Pertemuan 06 Sebaran Normal dan Sampling Matakuliah: >K0614/ >FISIKA Tahun: >2006.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
1 A Random-Surfer Web-Graph Model Avrim Blum, Hubert Chan, Mugizi Rwebangira Carnegie Mellon University.
Chapter 6 Sampling and Sampling Distributions
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
Information Networks Power Laws and Network Models Lecture 3.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Distributions of Randomized Backtrack Search Key Properties: I Erratic behavior of mean II Distributions have “heavy tails”.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University (With thanks to David Parkes!)
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
Probability theory 2 Tron Anders Moger September 13th 2006.
Further distributions
Traffic Modeling.
Statistical Physics and the “Problem of Firm Growth” Dongfeng Fu Advisor: H. E. Stanley K. Yamasaki, K. Matia, S. V. Buldyrev, DF Fu, F. Pammolli, K. Matia,
Models and Algorithms for Complex Networks Power laws and generative processes.
Theory of Probability Statistics for Business and Economics.
1 Statistical Distribution Fitting Dr. Jason Merrick.
ENGR 610 Applied Statistics Fall Week 3 Marshall University CITE Jack Smith.
More Continuous Distributions
1 Lecture 13: Other Distributions: Weibull, Lognormal, Beta; Probability Plots Devore, Ch. 4.5 – 4.6.
Basic Numerical Procedures Chapter 19 1 Options, Futures, and Other Derivatives, 7th Edition, Copyright © John C. Hull 2008.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Yongqin Gao, Greg Madey Computer Science & Engineering Department University of Notre Dame © Copyright 2002~2003 by Serendip Gao, all rights reserved.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Sampling and estimation Petter Mostad
© 2002 Prentice-Hall, Inc.Chap 5-1 Statistics for Managers Using Microsoft Excel 3 rd Edition Chapter 5 The Normal Distribution and Sampling Distributions.
CLASSICAL NORMAL LINEAR REGRESSION MODEL (CNLRM )
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Lecture 3 Types of Probability Distributions Dr Peter Wheale.
Emergence of double scaling law in complex systems 报告人:韩定定 华东师范大学 上海应用物理研究所.
Chapter 6 Sampling and Sampling Distributions
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.
Random Walk for Similarity Testing in Complex Networks
Oliver Schulte Machine Learning 726
Chapter 4 Continuous Random Variables and Probability Distributions
Topics In Social Computing (67810)
Chapter 7: Sampling Distributions
CS224W: Social and Information Network Analysis
Modelling and Searching Networks Lecture 6 – PA models
Further Topics on Random Variables: Derived Distributions
Further Topics on Random Variables: Derived Distributions
Discrete Mathematics and its Applications Lecture 6 – PA models
Further Topics on Random Variables: Derived Distributions
Presentation transcript:

1 Dynamic Models for File Sizes and Double Pareto Distributions Michael Mitzenmacher Harvard University

2 Motivation Understanding file size distributions important for –Simulation tools: SURGE –Explaining network phenomena: power law for file sizes may explain self-similarity of network traffic. –Connections to other similar processes both in and out of computer science.

3 Controversy Recent work on file size distributions –Downey (2001): file sizes have lognormal distribution (model and empirical results). –Barford et al. (1999): file sizes have lognormal body and Pareto (power law) tail. (empirical) Wanted to settle discrepancy. Found rich (and insufficiently cited) history. –Other sciences have known about power laws a long time. –We should look to them before diving in. Helped lead to new file size model.

4 Power Law Distribution A power law distribution satisfies Pareto distribution –Log-complementary cumulative distribution function (ccdf) is exactly linear. Properties –Infinite mean/variance possible

5 Lognormal Distribution X is lognormally distributed if Y = ln X is normally distributed. Density function: Properties: –Finite mean/variance. –Skewed: mean > median > mode –Multiplicative: X 1 lognormal, X 2 lognormal implies X 1 X 2 lognormal.

6 Similarity Easily seen by looking at log-densities. Pareto has linear log-density. For large , lognormal has nearly linear log-density. Similarly, both have near linear log-ccdfs. –Log-ccdfs usually used for empirical, visual tests of power law behavior. Question: how to differentiate them empirically?

7 Lognormal vs. Power Law Question: Is this distribution lognormal or a power law? –Reasonable follow-up: Does it matter? Primarily in economics –Income distribution. –Stock prices. (Black-Scholes model.) But also papers in ecology, biology, astronomy, etc.

8 Generative Models: Lognormal Start with an organism of size X 0. At each time step, size changes by a random multiplicative factor. If F t is taken from a lognormal distribution, each X t is lognormal. If F t are independent, identically distributed then (by CLT) X t converges to lognormal distribution.

9 BUT! If there exists a lower bound: then X t converges to a power law distribution. (Champernowne, 1953) Lognormal model easily pushed to a power law model.

10 Example At each time interval, suppose size either increases by a factor of 2 with probability 1/3, or decreases by a factor of 1/2 with probability 2/3. –Limiting distribution is lognormal. –But if size has a lower bound, power law

11 Example continued After n steps distribution increases - decreases becomes normal (CLT). Limiting distribution:

12 History Power laws –Pareto : income distribution, 1897 –Zipf-Auerbach: city sizes, 1913/1940’s –Zipf-Estouf: word frequency, 1916/1940’s –Lotka: bibliometrics, 1926 –Mandelbrot: economics/information theory, 1950’s+ Lognormal –McAlister, Kapetyn: 1879, –Gibrat: multiplicative processes, 1930’s.

13 A Slight Aside Many things the computer science community think of as new with power law/ heavy tail distributions have long been known in statistical economics. Companion paper: –“A Brief History of Generative Models for Power Law and Lognormal Distributions.” The Web is amazing… –Yule’s 1924 paper “A Mathematical Theory of Evolution…” is on-line.

14 Double Pareto Distributions Consider continuous version of lognormal generative model. –At time t, log X t is normal with mean  t and variance  2 t Suppose observation time is randomly distributed. –Income model: observation time depends on age, generations in the country, etc.

15 Double Pareto Distributions Reed (2000,2001) analyzes case where time distributed exponentially. –Also Adamic, Huberman (1999). Simplest case: 

16 Double Pareto Behavior Double Pareto behavior, density –On log-log plot, density is two straight lines –Between lognormal (curved) and power law (one line) Can have lognormal shaped body, Pareto tail. –The ccdf has Pareto tail; linear on log-log plots. –But cdf is also linear on log-log plots.

17 Double Pareto File Sizes Reed used Double Pareto to explain income distribution –Appears to have lognormal body, Pareto tail. Double Pareto shape closely matches empirical file size distribution. –Appears to have lognormal body, Pareto tail. Is there a reasonable model for file sizes that yields a Double Pareto distribution?

18 Downey’s Ideas Most files derived from others by copying, editing, or filtering. Start with a single file. Each new file derived from old file. Like lognormal generative process. –Individual file sizes converge to lognormal.

19 Problems “Global” distribution not lognormal. –Mixture of lognormal distributions. Everything derived from single file. –Not realistic. –Large correlation: one big file near root affects everybody. Deletions not handled.

20 Recursive Forsest File Size Model Keep Downey’s basic process. At each time step, either –Completely new file generated (prob. p), with distribution F 1 or –New file is derived from old file (prob. 1 - p): Simplifying assumptions. –Distribution F 1 = F 2 = F is lognormal. –Old file chosen uniformly at random.

21 Recursive Forest Depth 0 = new files Depth 1 Depth 2

22 Depth Distribution Node depths have geometric distribution. –# Depth 0 nodes converge to pt; depth 1 nodes converge to p(1-p)t, etc. –So number of multiplicative steps is geometric. –Discrete analogue of exponential distribution of Reed’s model. Yields Double Pareto file size distribution. –File chosen uniformly at random has almost exponential number of time steps. –Lognormal body, heavy tail. –But no nice closed form.

23 Extension: Deletions Suppose files deleted uniformly at random with probability q. –New file generated with probability p. –New file derived with probability 1 - p - q. File depths still geometrically distributed. So still a Double Pareto file size distribution.

24 Extensions: Preferential Attachment Suppose new file derived from old file with preferential attachment. –Old file chosen with weight proportional to ax + b, where x = #current children. File depths still geometrically distributed. So still get a double Pareto distribution.

25 Extensions: Correlation Each tree in the forest is small. –Any multiplicative edge affects few files. Martingale argument shows that small correlations do not affect distribution. Large systems converge to Double Pareto distribution.

26 Extensions: Distributions Choice of distribution F 1, F 2 matter. But not dramatically. –Central limit theorem still applies. –General closed forms very difficult.

27 Previous Models Downey –Introduced simple derivation model. HOT [Zhu, Yu, Doyle, 2001] –Information theoretic model. –File sizes chosen by Web system designers to maximize information/unit cost to user. –Similar to early heavy tail work by Mandelbrot.

28 Conclusions Recursive Forest File Model –is simple, general. –is robust to changes (deletions, preferential attachement, etc.) –explains lognormal body / heavy tail phenomenon.

29 Future Directions Tools for characterizing double-Pareto and double-Pareto lognormal parameters. –Fine tune matches to empirical results. Find evidence supporting/contradicting the model. –File system histories, etc. Applications in other fields. –Explains Double Pareto distributions in generational settings.