Download presentation
Presentation is loading. Please wait.
Published byTrevor Wilkerson Modified over 9 years ago
1
Bruno Ribeiro CS69000-DM1 Topics in Data Mining
2
Bruno Ribeiro Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment on blackboard Deadline to select projects ◦ Sept 29 2 Announcement Reminder
3
Bruno Ribeiro Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.Optimal Skampling for the Flow Size Distribution 3 Today
4
Bruno Ribeiro Why is your bus often full? 4 Waiting Time Paradox but
5
Bruno Ribeiro Set Size Estimation Problem 5 sample prob = p More likely to observe sets with large no. elements How much more likely to see green set than blue set? Observed sets
6
Bruno Ribeiro Set Size Distribution Estimation 6 random sampling estimation Set size distribution observed data original data
7
Bruno Ribeiro 7 Example Application Do we see c 0 ?
8
Bruno Ribeiro 8 Problem Formulation (corrected)
9
Bruno Ribeiro If edges arrive independently at random… Estimate original average degree ◦ Knowing the sampling probability p 9 Application 1: Estimate Latent Characteristics Observed during window [0, T ]Underlying “true network” “e.g. phone calls” p p p
10
Bruno Ribeiro Estimate the original flow size distribution from counts of no. sampled packet 10 Application 2: TCP flow size estimation TCP flow packets packet samplin g … no packet sampled (flow not sampled) 1 packet sampled all packets sampled random sampling estimation Set size distribution observed data original data
11
Bruno Ribeiro 11 Maximum Likelihood Estimation in practice… accuracy of proposed estimator sampling rate=1/100 without proto. info. with proto. info. n
12
Bruno Ribeiro Fisher information data processing inequality “debug” measurement methods 12 What I will show Lessons: Feature engineering by trial & error is tricky and expensive Analyze last step ◦ enough information to proceed to estimate? ◦ exists better summary function? ◦ where information lost?
13
Bruno Ribeiro Data processing inequality: “No processing can increase the amount of statistical information already contained in the data” 13 Estimating characteristics from sampling Nature raw samples sample summary characteristic summary sampling Estimator Data processing inequality
14
Bruno Ribeiro Fisher information ◦ Amount of information observations carry about the unknown characteristic Cramér-Rao inequality ◦ Connect the Fisher information with the minimum Mean Squared Error (MSE) achievable by any unbiased estimator 14 “Debugging” the sampling design Nature raw samples sample summary characteristic summary sampling Best Estimator Data processing inequality poor good summary best estimator quality of estimates? done back to the drawing board assumption: θ
15
Bruno Ribeiro [The finding] that the amount of information extracted in the process of estimation could never exceed the quantity supplied by the data Combined with the practical fact that directly available processes of computation would extract almost always a very large fraction of the total available [information], shifted the moral balance. The weight of [the statistician’s] responsibility was thrown back on to the process by which the data had come into existence. […] what types of observational programs would yield the most information for a given expenditure in time, money and labor. R. A. Fisher 1947 15
16
Bruno Ribeiro 16 Problem Formulation
17
Bruno Ribeiro where or in matrix form 17 Fisher Information
18
Bruno Ribeiro Suppose we obtain unbiased estimates from observations Mean squared error (covariance matrix) Cramér-Rao Bound for 18 Cramér-Rao Lower Bound Inverse Fisher information
19
Bruno Ribeiro But Must consider parameter constraint 19 Cramér-Rao Lower Bound CRLB without constraint CRLB with constraint
20
Bruno Ribeiro Fisher Information with Priors Fisher information with priors total FI FI of prior FI original
21
Bruno Ribeiro Different Sampling Designs FS = Flow sampling: Sample sets with probability q SH = Randomly sample first element with probability q’ but collect all future elements of same set DS = Dual Sampling: Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle” PS = Packet Sampling: Sample elements with probability p moca seeing as a stream of elements jg
22
Bruno Ribeiro Results: Different Sampling Designs (Veitch & Tune’14) FS = Flow sampling SH = Sample and hold DS = Dual sampling PS = Packet sampling
23
Bruno Ribeiro Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.Optimal Skampling for the Flow Size Distribution 23 Today
24
Bruno Ribeiro Part 1: Random Sampling v.s. Data Streaming 24
25
Bruno Ribeiro Fisher information to of sample summary? 25 What if we decided to bypass sampling?
26
Bruno Ribeiro 26 0 Sketching router Estimation phase powerful back end server powerful back end server 0 0 universal hash function 1 12 0 0 Sketch phase 1 2 collision!! counters summary flow size distribution estimate Prevent collisions keep unique packet ID (flow sampling) Disambiguate
27
Bruno Ribeiro Why? ◦ Fisher information analysis shows collided counter ≃ 0 information 27 Eviction Sketch
28
Bruno Ribeiro 28
29
Bruno Ribeiro Set Size Estimation Errors in Practice p = 0.25 (a) N=10,000 and (b) N=50,000 sampled sets (c) N ∊ {5,10,20,50,100} x 10 3 sampled sets
30
Bruno Ribeiro Set Size Estimation Errors in Practice II p = 0.90 (a) N=10,000 and (b) N=50,000 sampled sets (c) N ∊ {5,10,20,50,100} x 10 3 sampled sets
31
Bruno Ribeiro Scaling on max set size: Phase transition of estimation errors - observable set sizes W – size of largest set T i ( S ) – estimate of θ i
32
Bruno Ribeiro Infinite support & power laws If is power law with infinite support (W ∞) ◦ if p < ½ any unbiased estimator is inaccurate might as well output random estimates ◦ if p > ½ estimates can be accurate if enough samples are collected 32
33
Bruno Ribeiro How to collect data!! 33 Next Class
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.