1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.

Slides:



Advertisements
Similar presentations
Fundamentals of Probability
Advertisements

© Jim Barritt 2005School of Biological Sciences, Victoria University, Wellington MSc Student Supervisors : Dr Stephen Hartley, Dr Marcus Frean Victoria.
Analysis of Computer Algorithms
Observational Learning in Random Networks Julian Lorenz, Martin Marciniszyn, Angelika Steger Institute of Theoretical Computer Science, ETH.
Introductory Mathematics & Statistics for Business
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Performance in Decentralized Filesharing Networks Theodore Hong Freenet Project.
STATISTICS Sampling and Sampling Distributions
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
1 Introduction to Transportation Systems. 2 PART I: CONTEXT, CONCEPTS AND CHARACTERIZATI ON.
Sampling Research Questions
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
CS1512 Foundations of Computing Science 2 Week 3 (CSD week 32) Probability © J R W Hunter, 2006, K van Deemter 2007.
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Assumptions underlying regression analysis
Probability Distributions
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Chapter 7 Sampling and Sampling Distributions
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Dr. Engr. Sami ur Rahman Data Analysis Lecture 6: SPSS.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Modeling Grid Job Time Properties Lovro.
Configuration management
Fact-finding Techniques Transparencies
Information Systems Today: Managing in the Digital World
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Chapter 3 Learning to Use Regression Analysis Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Slides by Niels-Hugo Blunch Washington and.
(This presentation may be used for instructional purposes)
Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.
1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.
Shadow Prices vs. Vickrey Prices in Multipath Routing Parthasarathy Ramanujam, Zongpeng Li and Lisa Higham University of Calgary Presented by Ajay Gopinathan.
Contingency Tables Prepared by Yu-Fen Li.
Hash Tables.
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
1 Panel Data Analysis – Advantages and Challenges Cheng Hsiao.
Hypothesis Tests: Two Independent Samples
Scale Free Networks.
Domain 1: Planning and Preparation
Science as a Process Chapter 1 Section 2.
1 General Iteration Algorithms by Luyang Fu, Ph. D., State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting LLP 2007 CAS.
1 On Optimal Reinsurance Arrangement Yisheng Bu Liberty Mutual Group.
25 seconds left…...
Introduction to Queuing Theory
Determining How Costs Behave
1 Random Sampling - Random Samples. 2 Why do we need Random Samples? Many business applications -We will have a random variable X such that the probability.
Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.
VPN AND REMOTE ACCESS Mohammad S. Hasan 1 VPN and Remote Access.
IP, IST, José Bioucas, Probability The mathematical language to quantify uncertainty  Observation mechanism:  Priors:  Parameters Role in inverse.
Choosing an Order for Joins
Part 11: Random Walks and Approximations 11-1/28 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.
Analysis and Modeling of Social Networks Foudalis Ilias.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
A Brief History of Lognormal and Power Law Distributions and an Application to File Size Distributions Michael Mitzenmacher Harvard University.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
1 A History of and New Directions for Power Law Research Michael Mitzenmacher Harvard University.
1 Toward Validation and Control of Network Models Michael Mitzenmacher Harvard University.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Dynamic Models for File Sizes and Double Pareto Distributions Michael Mitzenmacher Harvard University.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University (With thanks to David Parkes!)
Models and Algorithms for Complex Networks Power laws and generative processes.
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.
Presentation transcript:

1 New Directions for Power Law Research Michael Mitzenmacher Harvard University

2 Internet Mathematics The Future of Power Law Research Articles Related to This Talk A Brief History of Generative Models for Power Law and Lognormal Distributions Dynamic Models for File Sizes and Double Pareto Distributions

3 Motivation: General Power laws (and/or scale-free networks) are now everywhere. –See the popular texts Linked by Barabasi or Six Degrees by Watts. –In computer science: file sizes, download times, Internet topology, Web graph, etc. –Other sciences: Economics, physics, ecology, linguistics, etc. What has been and what should be the research agenda?

4 My (Biased) View There are 5 stages of power law network research. 1)Observe: Gather data to demonstrate power law behavior in a system. 2)Interpret: Explain the importance of this observation in the system context. 3)Model: Propose an underlying model for the observed behavior of the system. 4)Validate: Find data to validate (and if necessary specialize or modify) the model. 5)Control: Design ways to control and modify the underlying behavior of the system based on the model.

5 My (Biased) View In networks, we have spent a lot of time observing and interpreting power laws. We are currently in the modeling stage. –Many, many possible models. –Ill talk about some of my favorites later on. We need to now put much more focus on validation and control. –And these are specific areas where computer science has much to contribute!

6 Models After observation, the natural step is to explain/model the behavior. Outcome: lots of modeling papers. –And many models rediscovered. Lots of history…

7 History In 1990s, the abundance of observed power laws in networks surprised the community. –Perhaps they shouldnt have… power laws appear frequently throughout the sciences. Pareto : income distribution, 1897 Zipf-Auerbach: city sizes, 1913/1940s Zipf-Estouf: word frequency, 1916/1940s Lotka: bibliometrics, 1926 Yule: species and genera, Mandelbrot: economics/information theory, 1950s+ Observation/interpretation were/are key to initial understanding. My claim: but now the mere existence of power laws should not be surprising, or necessarily even noteworthy. My (biased) opinion: The bar should now be very high for observation/interpretation.

8 Power Law Distribution A power law distribution satisfies Pareto distribution –Log-complementary cumulative distribution function (ccdf) is exactly linear. Properties –Infinite mean/variance possible

9 Lognormal Distribution X is lognormally distributed if Y = ln X is normally distributed. Density function: Properties: –Finite mean/variance. –Skewed: mean > median > mode –Multiplicative: X 1 lognormal, X 2 lognormal implies X 1 X 2 lognormal.

10 Similarity Easily seen by looking at log-densities. Pareto has linear log-density. For large, lognormal has nearly linear log- density. Similarly, both have near linear log-ccdfs. –Log-ccdfs usually used for empirical, visual tests of power law behavior. Question: how to differentiate them empirically?

11 Lognormal vs. Power Law Question: Is this distribution lognormal or a power law? –Reasonable follow-up: Does it matter? Primarily in economics –Income distribution. –Stock prices. (Black-Scholes model.) But also papers in ecology, biology, astronomy, etc.

12 Preferential Attachment Consider dynamic Web graph. –Pages join one at a time. –Each page has one outlink. Let X j (t) be the number of pages of degree j at time t. New page links: –With probability, link to a random page. –With probability (1- ), a link to a page chosen proportionally to indegree. (Copy a link.)

13 Preferential Attachment History This model (without the graphs) was derived in the 1950s by Herbert Simon. –… who won a Nobel Prize in economics for entirely different work. –His analysis was not for Web graphs, but for other preferential attachment problems.

14 Optimization Model: Power Law Mandelbrot experiment: design a language over a d- ary alphabet to optimize information per character. –Probability of jth most frequently used word is p j. –Length of jth most frequently used word is c j. Average information per word: Average characters per word: Optimization leads to power law.

15 Monkeys Typing Randomly Miller (psychologist, 1957) suggests following: monkeys type randomly at a keyboard. –Hit each of n characters with probability p. –Hit space bar with probability 1 - np > 0. –A word is sequence of characters separated by a space. Resulting distribution of word frequencies follows a power law. Conclusion: Mandelbrots optimization not required for languages to have power law

16 Generative Models: Lognormal Start with an organism of size X 0. At each time step, size changes by a random multiplicative factor. If F t is taken from a lognormal distribution, each X t is lognormal. If F t are independent, identically distributed then (by CLT) X t converges to lognormal distribution.

17 BUT! If there exists a lower bound: then X t converges to a power law distribution. (Champernowne, 1953) Lognormal model easily pushed to a power law model.

18 Double Pareto Distributions Consider continuous version of lognormal generative model. –At time t, log X t is normal with mean t and variance 2 t Suppose observation time is distributed exponentially. –E.g., When Web size doubles every year. Resulting distribution is Double Pareto. –Between lognormal and Pareto. –Linear tail on a log-log chart, but a lognormal body.

19 Lognormal vs. Double Pareto

20 And So Many More… New variations coming up all of the time. Question : What makes a new power law model sufficiently interesting to merit attention and/or publication? –Strong connection to an observed process. Many models claim this, but few demonstrate it convincingly. –Theory perspective: new mathematical insight or sophistication. My (biased) opinion: the bar should start being raised on model papers.

21 Validation: The Current Stage We now have so many models. It may be important to know the right model, to extrapolate and control future behavior. Given a proposed underlying model, we need tools to help us validate it. We appear to be entering the validation stage of research…. BUT the first steps have focused on invalidation rather than validation.

22 Examples : Invalidation Lakhina, Byers, Crovella, Xie –Show that observed power-law of Internet topology might be because of biases in traceroute sampling. Chen, Chang, Govindan, Jamin, Shenker, Willinger –Show that Internet topology has characteristics that do not match preferential-attachment graphs. –Suggest an alternative mechanism. But does this alternative match all characteristics, or are we still missing some?

23 My (Biased) View Invalidation is an important part of the process! BUT it is inherently different than validating a model. Validating seems much harder. Indeed, it is arguable what constitutes a validation. Question: what should it mean to say This model is consistent with observed data.

24 Time-Series/Trace Analysis Many models posit some sort of actions. –New pages linking to pages in the Web. –New routers joining the network. –New files appearing in a file system. A validation approach: gather traces and see if the traces suitably match the model. –Trace gathering can be a challenging systems problem. –Check model match requires using appropriate statistical techniques and tests. –May lead to new, improved, better justified models.

25 Sampling and Trace Analysis Often, cannot record all actions. –Internet is too big! Sampling –Global: snapshots of entire system at various times. –Local: record actions of sample agents in a system. Examples: –Snapshots of file systems: full systems vs. actions of individual users. –Router topology: Internet maps vs. changes at subset of routers. Question: how much/what kind of sampling is sufficient to validate a model appropriately? –Does this differ among models?

26 To Control In many systems, intervention can impact the outcome. –Maybe not for earthquakes, but for computer networks! –Typical setting: individual agents acting in their own best interest, giving a global power law. Agents can be given incentives to change behavior. General problem: given a good model, determine how to change system behavior to optimize a global performance function. –Distributed algorithmic mechanism design. –Mix of economics/game theory and computer science.

27 Possible Control Approaches Adding constraints: local or global –Example: total space in a file system. –Example: preferential attachment but links limited by an underlying metric. Add incentives or costs –Example: charges for exceeding soft disk quotas. –Example: payments for certain AS level connections. Limiting information –Impact decisions by not letting everyone have true view of the system.

28 Conclusion : My (Biased) View There are 5 stages of power law research. 1)Observe: Gather data to demonstrate power law behavior in a system. 2)Interpret: Explain the import of this observation in the system context. 3)Model: Propose an underlying model for the observed behavior of the system. 4)Validate: Find data to validate (and if necessary specialize or modify) the model. 5)Control: Design ways to control and modify the underlying behavior of the system based on the model. We need to focus on validation and control. –Lots of open research problems.

29 A Chance for Collaboration The observe/interpret stages of research are dominated by systems; modeling dominated by theory. –And need new insights, from statistics, control theory, economics!!! Validation and control require a strong theoretical foundation. –Need universal ideas and methods that span different types of systems. –Need understanding of underlying mathematical models. But also a large systems buy-in. –Getting/analyzing/understanding data. –Find avenues for real impact. Good area for future systems/theory/others collaboration and interaction.