1 Toward Validation and Control of Network Models Michael Mitzenmacher Harvard University
2 Internet Mathematics The Future of Power Law Research Articles Related to This Talk A Brief History of Generative Models for Power Law and Lognormal Distributions
3 Motivation: General Network Science and Engineering is emerging as its own (sub)field. –NSF : cross-cutting area starting this year. –Courses : Cornell (Easley/Kleinberg), Kearns (U Penn), many others. For undergrads, not just grads! –In popular culture: books like Linked by Barabasi or Six Degrees by Watts. –Other sciences: Economics, biology, physics, ecology, linguistics, etc. What has been and what should be the research agenda?
4 My (Biased) View The 5 stages of networking research. 1)Observe: Gather data to demonstrate a behavior in a system. (Example: power law behavior.) 2)Interpret: Explain the importance of this observation in the system context. 3)Model: Propose an underlying model for the observed behavior of the system. 4)Validate: Find data to validate (and if necessary specialize or modify) the model. 5)Control: Design ways to control and modify the underlying behavior of the system based on the model.
5 My (Biased) View In networks, we have spent a lot of time observing and interpreting behaviors. We are currently very active in modeling. –Many, many possible models. –Perhaps easiest to write papers about. We need to now put much more focus on validation and control. –Have been moving in this direction. –And these are specific areas where computer science has much to contribute!
6 Models After observation, the natural step is to explain/model the behavior. Outcome: lots of modeling papers. –And many models rediscovered. Example : power laws Lots of history…
7 History In 1990’s, the abundance of observed power laws in networks surprised the community. –Perhaps they shouldn’t have… power laws appear frequently throughout the sciences. Pareto : income distribution, 1897 Zipf-Auerbach: city sizes, 1913/1940’s Zipf-Estouf: word frequency, 1916/1940’s Lotka: bibliometrics, 1926 Yule: species and genera, Mandelbrot: economics/information theory, 1950’s+ Observation/interpretation were/are key to initial understanding. My claim: but now the mere existence of power laws should not be surprising, or necessarily even noteworthy. My (biased) opinion: The bar should now be very high for observation/interpretation.
8 So Many Models… Preferential Attachment Optimization (HOT) Monkeys typing randomly (scaling) Multiplicative processes Kronecker graphs Forest fire model (densification)
9 What Makes a Good Model… New variations coming up all of the time. Question : What makes a new network model sufficiently interesting to merit attention and/or publication? –Strong connection to an observed process. Many models claim this, but few demonstrate it convincingly. –Theory perspective: significant new mathematical insight or sophistication. A matter of taste? My (biased) opinion: the bar should start being raised on model papers.
10 Validation: The Current Stage We now have so many models. It is important to know the right model, to extrapolate and control future behavior. Given a proposed underlying model, we need tools to help us validate it. We appear to be entering the validation stage of research…. BUT the first steps have focused on invalidation rather than validation.
11 Examples : Invalidation Lakhina, Byers, Crovella, Xie –Show that observed power-law of Internet topology might be because of biases in traceroute sampling. Pedarsani, Figueiredo, Grossglauser –Show that densification may also arise by sampling approaches, not necessarily intrinsic to network. Chen, Chang, Govindan, Jamin, Shenker, Willinger –Show that Internet topology has characteristics that do not match preferential-attachment graphs. –Suggest an alternative mechanism. But does this alternative match all characteristics, or are we still missing some?
12 My (Biased) View Invalidation is an important part of the process! BUT it is inherently different than validating a model. Validating seems much harder. Indeed, it is arguable what constitutes a validation. Question: what should it mean to say “This model is consistent with observed data.”
13 An Alternative View There is no “right model”. A model is the best until some other model comes along and proves better. –Greedy refinement via invalidation in model space. –Statistical techniques: compare likelihood ratios for various models. My (biased) opinion: this is one useful approach; but not the end of the question. –Need methods other than comparison for confirming validity of a model.
14 Time-Series/Trace Analysis Many models posit some sort of actions. –New pages linking to pages in the Web. –New routers joining the network. –New files appearing in a file system. A validation approach: gather traces and see if the traces suitably match the model. –Trace gathering can be a challenging systems problem. –Check model match requires using appropriate statistical techniques and tests. –May lead to new, improved, better justified models.
15 Sampling and Trace Analysis Often, cannot record all actions. –Internet is too big! Sampling –Global: snapshots of entire system at various times. –Local: record actions of sample agents in a system. Examples: –Snapshots of file systems: full systems vs. actions of individual users. –Router topology: Internet maps vs. changes at subset of routers. Question: how much/what kind of sampling is sufficient to validate a model appropriately? –Does this differ among models?
16 To Control In many systems, intervention can impact the outcome. –Maybe not for earthquakes, but for computer networks! –Typical setting: individual agents acting in their own selfish interest. Agents can be given incentives to change behavior. General problem: given a good model, determine how to change system behavior to optimize a global performance function. –Distributed algorithmic mechanism design. –Mix of economics/game theory and computer science.
17 Possible Control Approaches Adding constraints: local or global –Example: total space in a file system. –Example: preferential attachment but links limited by an underlying metric. Add incentives or costs –Example: charges for exceeding soft disk quotas. –Example: payments for certain AS level connections. Limiting information –Impact decisions by not letting everyone have true view of the system.
18 My Related Work : Hash Algorithms On the Internet, we need a measurement and monitoring infrastructure, for validation and control. –Approximate is fine; speed is key. –Must be general, multi-purpose. –Must allow data aggregation. Solution : hash-based architecture. –Eventual goal: every router has a programmable “hash engine”.
19 Vision Three-pronged research data. Low: Efficient hardware implementations of relevant algorithms and data structures. Medium: New, improved data structures and algorithms for old and new applications. High: Distributed infrastructure supporting monitoring and measurement schemes.
20 The High-Level Pitch Lots of hash-based schemes being designed for approximate measurement/monitoring tasks. –But not built into the system to begin with. Want a flexible router architecture that allows: –New methods to be easily added. –Distributed cooperation using such schemes.
21 What We Need On-Chip Memory Hashing Computation Unit Off-Chip Memory CAM(s) Programming Language Memory Unit for Other Computation Communication + Control Control System Communication Architecture
22 Lots of Design Questions How much space for various memory levels? How to dynamically divide memory among competing applications? What hash functions should be included? Openness to new hash functions? What programming language and functionality? What communication infrastructure? Security? And so on…
23 Which Hash Functions? Theorists: –Want analyzable hash functions. –Dislike standard assumption of perfectly random hash functions. –Hard to prove things about actual performance. Practitioners –Want easy implementation, speed, small space. –Want simple analysis (back-of-the-envelope). –Will accept simulated results under right settings.
24 Why Do Weak Hash Functions Work So Well? In reality, assuming perfectly random hash functions seems to be the right thing to do. –Easier to analyze. –Real systems almost always work that way, even with weak hash functions! Can Theory explain strong performance of weak hash functions?
25 Recent Work A new explanation (joint work with Salil Vadhan): Choosing a hash function from a pairwise independent family is enough – if data has sufficient entropy. –Randomness of hash function and data “combine”. –Behavior matches truly random hash function with high probability. Techniques based on theory of randomness extraction. –Extensions of Leftover Hash Lemma.
26 What Functionality? Hash tables should be a basic primitive. “Best” hash tables: cuckoo hashing. –Worst case constant lookup time. –Simple to build, design. How can we make them even better? –Move cuckoo hashing from theory to practice!
27 Cuckoo Hashing [Pagh,Rodler] Basic scheme: each element gets two possible locations. To insert x, check both locations for x. If one is empty, insert. If both are full, x kicks out an old element y. Then y moves to its other location. If that location is full, y kicks out z, and so on, until an empty slot is found.
28 Cuckoo Hashing Examples ABC ED
29 Cuckoo Hashing Examples ABC ED F
30 Cuckoo Hashing Examples ABFC ED
31 Cuckoo Hashing Examples ABFC ED G
32 Cuckoo Hashing Examples EGBFC AD
33 Cuckoo Hashing Examples ABC ED F G
34 Cuckoo Hashing Failures Bad case 1: inserted element runs into cycles. Bad case 2: inserted element has very long path before insertion completes. –Could be on a long cycle. Bad cases occur with small probability when load is sufficiently low, but not low enough: Theoretical solution: re-hash everything if a failure occurs. For 2 choices, load less than 50%, n elements gives failure rate of (1/n); maximum insert time O(log n). –Better space utilization and rate for more choices, more elements per bucket.
35 Recent Work : A CAM-Stash Use a CAM (Content Addressable Memory) to stash away elements that would cause failure. –Joint with Kirsch/Wieder. Intuition: if failures were independent, probability that s elements cause failures goes to (1/n s ). –Failures not independent, but nearly so. –A stash holding a constant number of elements greatly reduces failure probability. –Implemented as a CAM in hardware, or a cache line in hardware/software. Lookup requires also looking at stash.
36 Modeling : Economic Principles Joint work with Corbo, Jain, Parkes. Exploration : what models make sense for AS connectivity. –Extending approach of Chang, Jamin, Mao, Willinger. –Entering nodes link according to business model, utility function. –Nodes revise their links based on new entrants. Like the forest fire model. Future considerations: how to validate such models.
37 Conclusion : My (Biased) View There are 5 stages of networking research. 1)Observe: Gather data to demonstrate power law behavior in a system. 2)Interpret: Explain the import of this observation in the system context. 3)Model: Propose an underlying model for the observed behavior of the system. 4)Validate: Find data to validate (and if necessary specialize or modify) the model. 5)Control: Design ways to control and modify the underlying behavior of the system based on the model. We need to focus on validation and control. –Lots of open research problems.
38 A Chance for Collaboration The observe/interpret stages of research are dominated by systems; modeling dominated by theory. –And need new insights, from statistics, control theory, economics!!! Validation and control require a strong theoretical foundation. –Need universal ideas and methods that span different types of systems. –Need understanding of underlying mathematical models. But also a large systems buy-in. –Getting/analyzing/understanding data. –Find avenues for real impact. Good area for future systems/theory/others collaboration and interaction.
39 More About Me Website: –Links to papers –Link to book –Link to blog : mybiasedcoin mybiasedcoin.blogspot.com