1 Toward Validation and Control of Network Models Michael Mitzenmacher Harvard University.

Slides:



Advertisements
Similar presentations
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.
Advertisements

Scale Free Networks.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Optimal Fast Hashing Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Hebrew Univ., Israel)
1 Hashing, randomness and dictionaries Rasmus Pagh PhD defense October 11, 2002.
Topology Generation Suat Mercan. 2 Outline Motivation Topology Characterization Levels of Topology Modeling Techniques Types of Topology Generators.
A Brief History of Lognormal and Power Law Distributions and an Application to File Size Distributions Michael Mitzenmacher Harvard University.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Cuckoo Hashing : Hardware Implementations Adam Kirsch Michael Mitzenmacher.
Good Research Questions. A paradigm consists of – a set of fundamental theoretical assumptions that the members of the scientific community accept as.
PRESENTED BY: ILYA NELKENBAUM KEREN ARMON SUPERVISOR: MR. YOSSI KANIZO 09/03/2011 Cuckoo the Kicking Bird 1.
Data Structures Hash Tables
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Discrete-Event Simulation: A First Course Steve Park and Larry Leemis College of William and Mary.
Cuckoo Hashing and CAMs Michael Mitzenmacher. Background For the past several years, I have had funding from Cisco to research hash tables and related.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
CS 104 Introduction to Computer Science and Graphics Problems
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CSCI 4550/8556 Computer Networks Comer, Chapter 11: Extending LANs: Fiber Modems, Repeaters, Bridges and Switches.
Advanced Algorithms for Massive Datasets Basics of Hashing.
E E Module 18 M.H. Clouqueur and W. D. Grover TRLabs & University of Alberta © Wayne D. Grover 2002, 2003 Analysis of Path Availability in Span-Restorable.
1 A History of and New Directions for Power Law Research Michael Mitzenmacher Harvard University.
Algorithmic Models for Sensor Networks Stefan Schmid and Roger Wattenhofer WPDRTS, Island of Rhodes, Greece, 2006.
The Future of Internet Research Scott Shenker (on behalf of many networking collaborators)
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Knowledge is Power Marketing Information System (MIS) determines what information managers need and then gathers, sorts, analyzes, stores, and distributes.
History-Independent Cuckoo Hashing Weizmann Institute Israel Udi WiederMoni NaorGil Segev Microsoft Research Silicon Valley.
1 Dynamic Models for File Sizes and Double Pareto Distributions Michael Mitzenmacher Harvard University.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Data Structures Hashing Uri Zwick January 2014.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University (With thanks to David Parkes!)
Thinking Actively in a Social Context T A S C.
Models and Algorithms for Complex Networks Power laws and generative processes.
Data Structures & Algorithms and The Internet: A different way of thinking.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
IT253: Computer Organization
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
CS Data Structures I Chapter 2 Principles of Programming & Software Engineering.
Task Analysis Methods IST 331. March 16 th
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
HASH TABLES -Paritosh Gupta. Problem. Required Search for The Precious One way would be to map all the data. And get key-value pairs. This means providing.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
6 December On Selfish Routing in Internet-like Environments paper by Lili Qiu, Yang Richard Yang, Yin Zhang, Scott Shenker presentation by Ed Spitznagel.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Chapter 11 Extending LANs 1. Distance limitations of LANs 2. Connecting multiple LANs together 3. Repeaters 4. Bridges 5. Filtering frame 6. Bridged network.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Systems Analyst (Module V) Ashima Wadhwa. The Systems Analyst - A Key Resource Many organizations consider information systems and computer applications.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
1 Dr. Michael D. Featherstone Introduction to e-Commerce Network Theory 101.
1 Chapter 11 Understanding Randomness. 2 Why Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:
Writing a Classical Argument
Investigate Plan Design Create Evaluate (Test it to objective evaluation at each stage of the design cycle) state – describe - explain the problem some.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
1 New Directions for Power Law Research Michael Mitzenmacher Harvard University.
The simultaneous evolution of author and paper networks
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Objective of This Course
From Use Cases to Implementation
Presentation transcript:

1 Toward Validation and Control of Network Models Michael Mitzenmacher Harvard University

2 Internet Mathematics The Future of Power Law Research Articles Related to This Talk A Brief History of Generative Models for Power Law and Lognormal Distributions

3 Motivation: General Network Science and Engineering is emerging as its own (sub)field. –NSF : cross-cutting area starting this year. –Courses : Cornell (Easley/Kleinberg), Kearns (U Penn), many others. For undergrads, not just grads! –In popular culture: books like Linked by Barabasi or Six Degrees by Watts. –Other sciences: Economics, biology, physics, ecology, linguistics, etc. What has been and what should be the research agenda?

4 My (Biased) View The 5 stages of networking research. 1)Observe: Gather data to demonstrate a behavior in a system. (Example: power law behavior.) 2)Interpret: Explain the importance of this observation in the system context. 3)Model: Propose an underlying model for the observed behavior of the system. 4)Validate: Find data to validate (and if necessary specialize or modify) the model. 5)Control: Design ways to control and modify the underlying behavior of the system based on the model.

5 My (Biased) View In networks, we have spent a lot of time observing and interpreting behaviors. We are currently very active in modeling. –Many, many possible models. –Perhaps easiest to write papers about. We need to now put much more focus on validation and control. –Have been moving in this direction. –And these are specific areas where computer science has much to contribute!

6 Models After observation, the natural step is to explain/model the behavior. Outcome: lots of modeling papers. –And many models rediscovered. Example : power laws Lots of history…

7 History In 1990’s, the abundance of observed power laws in networks surprised the community. –Perhaps they shouldn’t have… power laws appear frequently throughout the sciences. Pareto : income distribution, 1897 Zipf-Auerbach: city sizes, 1913/1940’s Zipf-Estouf: word frequency, 1916/1940’s Lotka: bibliometrics, 1926 Yule: species and genera, Mandelbrot: economics/information theory, 1950’s+ Observation/interpretation were/are key to initial understanding. My claim: but now the mere existence of power laws should not be surprising, or necessarily even noteworthy. My (biased) opinion: The bar should now be very high for observation/interpretation.

8 So Many Models… Preferential Attachment Optimization (HOT) Monkeys typing randomly (scaling) Multiplicative processes Kronecker graphs Forest fire model (densification)

9 What Makes a Good Model… New variations coming up all of the time. Question : What makes a new network model sufficiently interesting to merit attention and/or publication? –Strong connection to an observed process. Many models claim this, but few demonstrate it convincingly. –Theory perspective: significant new mathematical insight or sophistication. A matter of taste? My (biased) opinion: the bar should start being raised on model papers.

10 Validation: The Current Stage We now have so many models. It is important to know the right model, to extrapolate and control future behavior. Given a proposed underlying model, we need tools to help us validate it. We appear to be entering the validation stage of research…. BUT the first steps have focused on invalidation rather than validation.

11 Examples : Invalidation Lakhina, Byers, Crovella, Xie –Show that observed power-law of Internet topology might be because of biases in traceroute sampling. Pedarsani, Figueiredo, Grossglauser –Show that densification may also arise by sampling approaches, not necessarily intrinsic to network. Chen, Chang, Govindan, Jamin, Shenker, Willinger –Show that Internet topology has characteristics that do not match preferential-attachment graphs. –Suggest an alternative mechanism. But does this alternative match all characteristics, or are we still missing some?

12 My (Biased) View Invalidation is an important part of the process! BUT it is inherently different than validating a model. Validating seems much harder. Indeed, it is arguable what constitutes a validation. Question: what should it mean to say “This model is consistent with observed data.”

13 An Alternative View There is no “right model”. A model is the best until some other model comes along and proves better. –Greedy refinement via invalidation in model space. –Statistical techniques: compare likelihood ratios for various models. My (biased) opinion: this is one useful approach; but not the end of the question. –Need methods other than comparison for confirming validity of a model.

14 Time-Series/Trace Analysis Many models posit some sort of actions. –New pages linking to pages in the Web. –New routers joining the network. –New files appearing in a file system. A validation approach: gather traces and see if the traces suitably match the model. –Trace gathering can be a challenging systems problem. –Check model match requires using appropriate statistical techniques and tests. –May lead to new, improved, better justified models.

15 Sampling and Trace Analysis Often, cannot record all actions. –Internet is too big! Sampling –Global: snapshots of entire system at various times. –Local: record actions of sample agents in a system. Examples: –Snapshots of file systems: full systems vs. actions of individual users. –Router topology: Internet maps vs. changes at subset of routers. Question: how much/what kind of sampling is sufficient to validate a model appropriately? –Does this differ among models?

16 To Control In many systems, intervention can impact the outcome. –Maybe not for earthquakes, but for computer networks! –Typical setting: individual agents acting in their own selfish interest. Agents can be given incentives to change behavior. General problem: given a good model, determine how to change system behavior to optimize a global performance function. –Distributed algorithmic mechanism design. –Mix of economics/game theory and computer science.

17 Possible Control Approaches Adding constraints: local or global –Example: total space in a file system. –Example: preferential attachment but links limited by an underlying metric. Add incentives or costs –Example: charges for exceeding soft disk quotas. –Example: payments for certain AS level connections. Limiting information –Impact decisions by not letting everyone have true view of the system.

18 My Related Work : Hash Algorithms On the Internet, we need a measurement and monitoring infrastructure, for validation and control. –Approximate is fine; speed is key. –Must be general, multi-purpose. –Must allow data aggregation. Solution : hash-based architecture. –Eventual goal: every router has a programmable “hash engine”.

19 Vision Three-pronged research data. Low: Efficient hardware implementations of relevant algorithms and data structures. Medium: New, improved data structures and algorithms for old and new applications. High: Distributed infrastructure supporting monitoring and measurement schemes.

20 The High-Level Pitch Lots of hash-based schemes being designed for approximate measurement/monitoring tasks. –But not built into the system to begin with. Want a flexible router architecture that allows: –New methods to be easily added. –Distributed cooperation using such schemes.

21 What We Need On-Chip Memory Hashing Computation Unit Off-Chip Memory CAM(s) Programming Language Memory Unit for Other Computation Communication + Control Control System Communication Architecture

22 Lots of Design Questions How much space for various memory levels? How to dynamically divide memory among competing applications? What hash functions should be included? Openness to new hash functions? What programming language and functionality? What communication infrastructure? Security? And so on…

23 Which Hash Functions? Theorists: –Want analyzable hash functions. –Dislike standard assumption of perfectly random hash functions. –Hard to prove things about actual performance. Practitioners –Want easy implementation, speed, small space. –Want simple analysis (back-of-the-envelope). –Will accept simulated results under right settings.

24 Why Do Weak Hash Functions Work So Well? In reality, assuming perfectly random hash functions seems to be the right thing to do. –Easier to analyze. –Real systems almost always work that way, even with weak hash functions! Can Theory explain strong performance of weak hash functions?

25 Recent Work A new explanation (joint work with Salil Vadhan): Choosing a hash function from a pairwise independent family is enough – if data has sufficient entropy. –Randomness of hash function and data “combine”. –Behavior matches truly random hash function with high probability. Techniques based on theory of randomness extraction. –Extensions of Leftover Hash Lemma.

26 What Functionality? Hash tables should be a basic primitive. “Best” hash tables: cuckoo hashing. –Worst case constant lookup time. –Simple to build, design. How can we make them even better? –Move cuckoo hashing from theory to practice!

27 Cuckoo Hashing [Pagh,Rodler] Basic scheme: each element gets two possible locations. To insert x, check both locations for x. If one is empty, insert. If both are full, x kicks out an old element y. Then y moves to its other location. If that location is full, y kicks out z, and so on, until an empty slot is found.

28 Cuckoo Hashing Examples ABC ED

29 Cuckoo Hashing Examples ABC ED F

30 Cuckoo Hashing Examples ABFC ED

31 Cuckoo Hashing Examples ABFC ED G

32 Cuckoo Hashing Examples EGBFC AD

33 Cuckoo Hashing Examples ABC ED F G

34 Cuckoo Hashing Failures Bad case 1: inserted element runs into cycles. Bad case 2: inserted element has very long path before insertion completes. –Could be on a long cycle. Bad cases occur with small probability when load is sufficiently low, but not low enough: Theoretical solution: re-hash everything if a failure occurs. For 2 choices, load less than 50%, n elements gives failure rate of  (1/n); maximum insert time O(log n). –Better space utilization and rate for more choices, more elements per bucket.

35 Recent Work : A CAM-Stash Use a CAM (Content Addressable Memory) to stash away elements that would cause failure. –Joint with Kirsch/Wieder. Intuition: if failures were independent, probability that s elements cause failures goes to  (1/n s ). –Failures not independent, but nearly so. –A stash holding a constant number of elements greatly reduces failure probability. –Implemented as a CAM in hardware, or a cache line in hardware/software. Lookup requires also looking at stash.

36 Modeling : Economic Principles Joint work with Corbo, Jain, Parkes. Exploration : what models make sense for AS connectivity. –Extending approach of Chang, Jamin, Mao, Willinger. –Entering nodes link according to business model, utility function. –Nodes revise their links based on new entrants. Like the forest fire model. Future considerations: how to validate such models.

37 Conclusion : My (Biased) View There are 5 stages of networking research. 1)Observe: Gather data to demonstrate power law behavior in a system. 2)Interpret: Explain the import of this observation in the system context. 3)Model: Propose an underlying model for the observed behavior of the system. 4)Validate: Find data to validate (and if necessary specialize or modify) the model. 5)Control: Design ways to control and modify the underlying behavior of the system based on the model. We need to focus on validation and control. –Lots of open research problems.

38 A Chance for Collaboration The observe/interpret stages of research are dominated by systems; modeling dominated by theory. –And need new insights, from statistics, control theory, economics!!! Validation and control require a strong theoretical foundation. –Need universal ideas and methods that span different types of systems. –Need understanding of underlying mathematical models. But also a large systems buy-in. –Getting/analyzing/understanding data. –Find avenues for real impact. Good area for future systems/theory/others collaboration and interaction.

39 More About Me Website: –Links to papers –Link to book –Link to blog : mybiasedcoin mybiasedcoin.blogspot.com