Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Basic Ideas in Probability and Statistics for Experimenters: Part I: Qualitative Discussion Shivkumar.

Similar presentations


Presentation on theme: "Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Basic Ideas in Probability and Statistics for Experimenters: Part I: Qualitative Discussion Shivkumar."— Presentation transcript:

1 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Basic Ideas in Probability and Statistics for Experimenters: Part I: Qualitative Discussion Shivkumar Kalyanaraman Rensselaer Polytechnic Institute shivkuma@ecse.rpi.edu http://www.ecse.rpi.edu/Homepages/shivkuma Based in part upon slides of Prof. Raj Jain (OSU), Bill Foley (RPI) He uses statistics as a drunken man uses lamp-posts – for support rather than for illumination … A. Lang

2 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 q Why Probability and Statistics: The Empirical Design Method… q Qualitative understanding of essential probability and statistics q Especially the notion of inference and statistical significance q Key distributions & why we care about them… q Reference: Chap 12, 13 (Jain) and q http://mathworld.wolfram.com/topics/ProbabilityandStatistics.html http://mathworld.wolfram.com/topics/ProbabilityandStatistics.html Overview

3 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Theory (Model) vs Experiment q A model is only as good as the NEW predictions it can make (subject to “statistical confidence”) q Physics: q Theory and experiment frequently changed roles as leaders and followers q Eg: Newton’s Gravitation theory, Quantum Mechanics, Einstein’s Relativity. q Einstein’s “thought experiments” vs real-world experiments that validate theory’s predictions q Medicine: q FDA Clinical trials on New drugs: Do they work? q “Cure worse than the disease” ? q Networking: q How does OSPF or TCP or BGP behave/perform ? q Operator or Designer: What will happen if I change …. ?

4 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Why Probability? BIG PICTURE q Humans like determinism q The real-world is unfortunately random! q => CANNOT place ANY confidence on a single measurement or simulation result that contains potential randomness q However, q We can make deterministic statements about measures or functions of underlying randomness … q … with bounds on degree of confidence q Functions of Randomness : q Probability of a random event or variable q Average (mean, median, mode), Distribution functions (pdf, cdf), joint pdfs/cdfs, conditional probability, confidence intervals, q Goal: Build “probabilistic” models of reality q Constraint: minimum # experiments q Infer to get a model (I.e. maximum information) q Statistics: how to infer models about reality (“population”) given a SMALL set of expt results (“sample”)

5 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Why Care About Statistics? How to make this empirical design process EFFICIENT?? How to avoid pitfalls in inference! Measure, simulate, experiment Model, Hypothesis, Predictions

6 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 Why Care? Real-World Measurements: Internet Growth Estimates q Growth of 80%/year q Sustained for at least ten years … q … before the Web even existed.  Internet is always changing. You do not have a lot of time to understand it. Exponential Growth Model Fits

7 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 Probability q Think of probability as modeling an experiment q Eg: tossing a coin! q The set of all possible outcomes is the sample space: S q Classic “Experiment”: Tossing a die:S = {1,2,3,4,5,6} q Any subset A of S is an event: A = {the outcome is even} = {2,4,6}

8 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 Probability of Events: Axioms P is the Probability Mass function if it maps each event A, into a real number P(A), and: i.) ii.) P(S) = 1 iii.)If A and B are mutually exclusive events then,

9 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Probability of Events …In fact for any sequence of pair-wise- mutually-exclusive events, we have

10 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Other Properties Can be Derived q q q q q q q q Derived by breaking up above sets into mutually exclusive pieces and comparing to fundamental axioms!!

11 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Recall: Why care about Probability? q …. We can be deterministic about measures or functions of underlying randomness.. q Functions of Randomness : q Probability or Odds of a random event or variable; q “Risk” associated with an event: Probability*RV q Or loosely “expectation” of the event q “Upside” risks vs “Downside” risks q Measures of dispersion, skew (a.k.a higher-order “moments”) q Even though the experiment has a RANDOM OUTCOME (eg: 1, 2, 3, 4, 5, 6 or heads/tails) or EVENTS (subsets of all outcomes) q The probability function has a DETERMINISTIC value q If you forget everything in this class, do not forget this!

12 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Conditional Probability = (conditional) probability that the outcome is in A given that we know the outcome in B Example: Toss one die. Note that: What is the value of knowledge that B occurred ? How does it reduce uncertainty about A? How does it change P(A) ?

13 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Independence q Events A and B are independent if P(AB) = P(A)P(B). q Also:and q Example: A card is selected at random from an ordinary deck of cards. q A=event that the card is an ace. q B=event that the card is a diamond.

14 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Random Variable as a Measurement q We cannot give an exact description of a sample space in these cases, but we can still describe specific measurements on them q The temperature change produced. q The number of photons emitted in one millisecond. q The time of arrival of the packet.

15 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Random Variable as a Measurement q Thus a random variable can be thought of as a measurement on an experiment

16 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 0 1 2 3 4 5 Histogram: Plotting Frequencies Frequency Relative Frequency Percent 01525354555 Lower Boundary Bars Touch ClassFreq. 15 but < 25 3 25 but < 35 5 35 but < 45 2 Count

17 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 Probability Distribution Function (pdf) a.k.a. frequency histogram, p.m.f (for discrete r.v.)

18 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Probability Mass Function for a Random Variable q The probability mass function (PMF) for a (discrete valued) random variable X is: q Note thatfor q Also for a (discrete valued) random variable X

19 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 PMF and CDF: Example

20 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Cumulative Distribution Function q The cumulative distribution function (CDF) for a random variable X is q Note that is non-decreasing in x, i.e. q Also and

21 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Probability density functions (pdf) Emphasizes main body of distribution, frequencies, various modes (peaks), variability, skews

22 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Cumulative Distribution Function (CDF) Emphasizes skews, easy identification of median/quartiles, converting uniform rvs to other distribution rvs median

23 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Complementary CDFs (CCDF) Useful for focussing on “tails” of distributions: Line in a log-log plot => “heavy” tail

24 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 CCDF plot on log-log scale 012345678 -6 -5 -4 -3 -2 0 HTTP Connections (Sizes) log10(x) log10(1-F(x)) Real Example: Distribution of HTTP Connection Sizes is Heavy Tailed

25 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Recall: Why care about Probability? q Humans like determinism q The real-world is unfortunately random! q CANNOT place ANY confidence on a single measurement q We can be deterministic about measures or functions of underlying randomness … q Functions of Randomness : q Probability of a random event or variable q Average (mean, median, mode), Distribution functions (pdf, cdf), joint pdfs/cdfs, conditional probability, confidence intervals,

26 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26 Numerical Data Properties Central Tendency (Location) Variation (Dispersion) Shape

27 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 Numerical Data Properties & Measures Numerical Data Properties Mean Median Mode Central Tendency Range Variance Standard Deviation Variation Skew Shape Interquartile Range

28 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 Indices of Central Tendency (Location) q Goal: Summarize (compress) entire distribution with one “representative” or “central” number q Eg: Mean, Median, Mode q Allows you to “discuss” the distribution with just one number q Eg: What was the “average” TCP goodput in your simulation? q Eg: What is the median HTTP transfer size on the web (or in a data set)?

29 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 Expectation (Mean) = Center of Gravity

30 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 Expectation of a Random Variable q The expectation (average) of a (discrete-valued) random variable X is q Three coins example:

31 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31 Median, Mode q Median = F -1 (0.5), where F = CDF q Aka 50% percentile element q I.e. Order the values and pick the middle element q Used when distribution is skewed q Considered a “robust” measure q Mode: Most frequent or highest probability value q Multiple modes are possible q Need not be the “central” element q Mode may not exist (eg: uniform distribution) q Used with categorical variables

32 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32

33 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33 Summary of Central Tendency Measures MeasureEquationDescription Mean  X i /n Balance Point Median(n+1) Position Position 2 Middle Value When Ordered Modenone Most Frequent

34 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34 Games with Statistics!... employees cite low pay -- most workers earn only $20,000. [mode]... President claims average pay is $70,000! [mean] Who is correct? BOTH! Issue: skewed income distribution $400,000 $70,000 $50,000 $30,000 $20,000

35 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35

36 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36 Indices/Measures of Spread/Dispersion: Why Care? You can drown in a river of average depth 6 inches! Lesson: The measure of uncertainty or dispersion may matter more than the index of central tendency

37 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 37 Real Example: dispersion matters! q What is the fairness between TCP goodputs when we use different queuing policies? q What is the confidence interval around your estimates of mean file size?

38 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38 q Variance: second moment around the mean: q  2 = E((X-  ) 2 ) q Standard deviation =  q Coefficient of Variation (C.o.V.)=  /  q SIQR= Semi-Inter-Quartile Range (used with median = 50 th percentile) q (75 th percentile – 25 th percentile)/2 Standard Deviation, Coeff. Of Variation, SIQR

39 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39 Real Example: CoV q Goal: TCP rate control vs RED in terms of fairness (measured w/ CoV) q Lower CoV is better

40 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40 Covariance and Correlation: Measures of Dependence q Covariance: = q For i = j, covariance = variance! q Independence => covariance = 0 (not vice-versa!) q Correlation (coefficient) is a normalized (or scaleless) form of covariance: q Between –1 and +1. q Zero => no correlation (uncorrelated). q Note: uncorrelated DOES NOT mean independent!

41 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41 Real Example: Randomized TCP vs TCP Reno q Goal: to reduce synchronization between flows (that kills performance), we inject randomness in TCP q Metric: Cov-Coefficient between flows (close to 0 is good)

42 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42 Recall: Why care about Probability? q Humans like determinism q The real-world is unfortunately random! q CANNOT place ANY confidence on a single measurement q We can be deterministic about measures or functions of underlying randomness … q Functions of Randomness : q Probability of a random event or variable q Average (mean, median, mode), Distribution functions (pdf, cdf), joint pdfs/cdfs, conditional probability, confidence intervals,

43 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43 Continuous-valued Random Variables q So far we have focused on discrete(-valued) random variables, e.g. X(s) must be an integer q Examples of discrete random variables: number of arrivals in one second, number of attempts until success q A continuous-valued random variable takes on a range of real values, e.g. X(s) ranges from 0 to as s varies. q Examples of continuous(-valued) random variables: time when a particular arrival occurs, time between consecutive arrivals.

44 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44 Continuous-valued Random Variables q Thus, for a continuous random variable X, we can define its probability density function (pdf) q Note that since is non-decreasing in x we have for all x.

45 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45 Properties of Continuous Random Variables q From the Fundamental Theorem of Calculus, we have q In particular, q More generally,

46 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46 Expectation of a Continuous Random Variable q The expectation (average) of a continuous random variable X is given by q Note that this is just the continuous equivalent of the discrete expectation

47 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47 Important (Discrete) Random Variable: Bernoulli q The simplest possible measurement on an experiment: q Success (X = 1) or failure (X = 0). q Usual notation: q E(X)=

48 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48 Important (discrete) Random Variables: Binomial q Let X = the number of success in n independent Bernoulli experiments ( or trials). P(X=0) = P(X=1) = P(X=2)= In general, P(X = x) = Binomial Variables are useful for proportions (of successes. Failures) for a small number of repeated experiments. For larger number (n), under certain conditions (p is small), Poisson distribution is used.

49 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49 Binomial Distribution Characteristics n = 5 p = 0.1 n = 5 p = 0.5 Mean Standard Deviation

50 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 50 Binomial can be skewed or normal Depends upon p and n !

51 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 51 Example: Utility of Adaptive MSS In Lossy Networks a) Large MSS (small G)b) Small MSS (large G)

52 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 52 How Much Granulation (G)? Key Factor: P(all units lost) When all units/block are lost, the HARQ will fail (lead to timeouts etc). This probability is non-trivial for N =5 N >= 10 is good enough. 3.175% blocks irrecoverably lost, i.e. all units lost and no feedback (eg: timeout)  ~ O(sqrt(N))

53 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 53 Binomials for different loss rates, N =20 10% PER30% PER 50% PER N = 20 for all cases Npq = 1.8 Npq = 4.2 Npq = 5 As Npq >> 1, better approximated by normal distribution (esp) near the mean:  symmetric, sharp peak at mean, exponential- square (e -x^2 ) decay of tails (pmf concentrated near mean)

54 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 54 Important Random Variable: Poisson q A Poisson random variable X is defined by its PMF: (limit of binomial) Where > 0 is a constant q Exercise: Show that and E(X) = q Poisson random variables are good for counting frequency of occurrence: like the number of customers that arrive to a bank in one hour, or the number of packets that arrive to a router in one second.

55 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 55 Continuous Probability Density Function q 1.Mathematical Formula q 2.Shows All Values, x, & Frequencies, f(x) q f(X) Is Not Probability q 3.Properties (Area Under Curve) Value (Value, Frequency) Frequency f(x) ab x fxdx fx () () All X a x b    1 0,

56 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 56 Important Continuous Random Variable: Exponential q Used to represent time, e.g. until the next arrival q Has PDF for some > 0 q Properties: q Need to use integration by Parts!

57 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 57 Memoryless Property of the Exponential q An exponential random variable X has the property that “the future is independent of the past”, i.e. the fact that it hasn’t happened yet, tells us nothing about how much longer it will take. q In math terms

58 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 58 Recall: Why care about Probability? q Humans like determinism q The real-world is unfortunately random! q CANNOT place ANY confidence on a single measurement q We can be deterministic about measures or functions of underlying randomness … q Functions of Randomness : q Probability of a random event or variable q Average (mean, median, mode), Distribution functions (pdf, cdf), joint pdfs/cdfs, conditional probability, confidence intervals,

59 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 59 Important Random Variables: Normal

60 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 60 Normal Distribution: PDF & CDF q PDF: q With the transformation: (a.k.a. unit normal deviate) q z-normal-PDF: z Double exponential & symmetric Can “standardize” to have  = 0,  =1. Simplifies!

61 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 61 Nice things about Normal Distribution q 1.‘Bell-Shaped’ & Symmetrical q 2.Mean, Median, Mode Are Equal  3.‘Middle Spread’ Is 1.33  q 4. Random Variable is continuous and has Infinite Range Mean Median Mode

62 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 62 Rapidly Dropping Tail Probability! Why? Doubly exponential PDF (e^-z^2 term…) A.k.a: “Light tailed” (not heavy-tailed). No skew or tail => don’t have two worry about > 2 nd order parameters (mean, variance)

63 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 63 Height & Spread of Gaussian Can Vary!

64 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 64 But this can be specified with just 2 parameters (  &  ): N(  Parsimonious! Only 2 parameters to estimate!

65 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 65 Normal Distribution Probability Tables Probability is area under curve!

66 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 66 Why Standardize? Easy Table Lookup Normal distributions differ by mean & standard deviation. Each distribution would require its own table. That’s an infinite number!

67 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 67 Standardize the Normal Distribution One table! Normal Distribution Standardized Normal Distribution

68 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 68 Obtaining the Probability.0478.0478.02 0.1.0478 Standardized Normal Probability Table (Portion) ProbabilitiesProbabilities Shaded area exaggerated

69 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 69 Example P(3.8  X  5) Normal Distribution.0478 Standardized Normal Distribution Shaded area exaggerated

70 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 70 P(2.9  X  7.1) Normal Distribution.1664.1664.0832.0832 Standardized Normal Distribution Shaded area exaggerated

71 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 71 Example P(X  8) Normal Distribution Standardized Normal Distribution.1179.1179.5000.3821.3821 Shaded area exaggerated

72 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 72 Finding Z Values for Known Probabilities.1217.1217.01 0.3.1217 Standardized Normal Probability Table (Portion) What is Z given P(Z) =.1217? Shaded area exaggerated

73 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 73 Finding X Values for Known Probabilities Normal Distribution Standardized Normal Distribution.1217.1217 Shaded areas exaggerated

74 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 74 Why Care? Ans: Statistics q Sampling: Take a sample q How to sample? q Eg: Infer country’s GDP by sampling a small fraction of data. Important to sample right. q Randomness gives you power in sampling q Inferring properties (eg: means, variance) of the population (a.k.a “parameters”) q … from the sample properties (a.k.a. “statistics”) q Depend upon theorems like Central limit theorem and properties of normal distribution q The distribution of sample mean (or sample stdev etc) are called “sampling distributions” q Powerful idea: sampling distribution of the sample mean is a normal distribution (under mild restrictions): Central Limit Theorem (CLT)!

75 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 75 Estimation Process Mean, , is unknown Population Random Sample I am 95% confident that  is between 40 & 60. Mean  X  = 50 Sample

76 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 76 Inferential Statistics q 1.Involves q Estimation q Hypothesis Testing q 2.Purpose q Make Decisions About Population Characteristics Population?

77 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 77 Inference Process Population Sample Sample statistic (X) Estimates & tests

78 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 78 Key Terms q 1.Population (Universe) q All Items of Interest q 2.Sample q Portion of Population q 3.Parameter q Summary Measure about Population q 4.Statistic q Summary Measure about Sample

79 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 79 Normal Distribution: Why? Uniform distribution looks nothing like bell shaped (gaussian)! Large spread (  )! Sum of r.v.s from a uniform distribution after very few samples looks remarkably normal BONUS: it has decreasing  ! CENTRAL LIMIT TENDENCY!

80 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 80 Gaussian/Normal Distribution: Why? Goal: estimate mean of Uniform distribution (distributed between 0 & 6) Sample mean = (sum of samples)/N Since sum is distributed normally, so is the sample mean (a.k.a sampling distribution is NORMAL), w/ decreasing dispersion!

81 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 81 Numerical Example: Population Characteristics Population Distribution Summary Measures

82 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 82 Summary Measures of All Sample Means

83 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 83 Comparison Population Sampling Distribution

84 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 84 Recall: Rapidly Dropping Tail Probability! Sample mean is a gaussian r.v., with x =  & s =  /(n) 0.5  With larger number of samples, avg of sample means is an excellent estimate of true mean.  If (original)  is known, invalid mean estimates can be rejected with HIGH confidence!

85 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 85 Many Experiments: Coin Tosses Number of Tosses (n) Sample Mean = Total Heads /Number of Tosses 0.00 0.25 0.50 0.75 1.00 0255075100125 Variance reduces over trials by factor SQRT(n) Corollary: more experiments (n) is good, but not great (why? sqrt(n) doesn’t grow fast)

86 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 86 Standard Error of Mean  1.Standard Deviation of All Possible Sample Means,  X  Measures Scatter in All Sample Means,  X q 2.Less Than Pop. Standard Deviation q 3.Formula (Sampling With Replacement)

87 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 87 Properties of Sampling Distribution of Mean q 1.Unbiasedness q Mean of Sampling Distribution Equals Population Mean q 2.Efficiency q Sample Mean Comes Closer to Population Mean Than Any Other Unbiased Estimator q 3.Consistency q As Sample Size Increases, Variation of Sample Mean from Population Mean Decreases

88 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 88 Unbiasedness  UnbiasedBiased

89 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 89 Efficiency: Sample mean vs Sample median  Sampling distribution of median Sampling distribution of mean

90 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 90 Consistency: Sample Size Smaller sample size Larger sample size 

91 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 91 Sampling from Non-Normal Populations q Central Tendency q Dispersion q Sampling with replacement q Central Tendency q Dispersion q Sampling with replacement Population Distribution Sampling Distribution n =30   X = 1.8 n = 4   X = 5

92 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 92 Central Limit Theorem (CLT) As sample size gets large enough (n  30)...

93 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 93 Central Limit Theorem (CLT) As sample size gets large enough (n  30)... sampling distribution becomes almost normal.

94 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 94 Aside: Caveat about CLT q Central limit theorem works if original distribution are not heavy tailed q Moments converge to limits q Trouble with aggregates of “heavy tailed” distribution samples q Rate of convergence to normal also varies with distributional skew, and dependence in samples q Non-classical version of CLT for some cases (heavy tailed)… q Sum converges to stable Levy-noise (heavy tailed and long-range dependent auto-correlations)

95 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 95 Other interesting points reg. Gaussian q Uncorrelated r.vs. + gaussian => INDEPENDENT! q Important in random processes (I.e. sequences of random variables) q Random variables that are independent, and have exactly the same distribution are called IID (independent & identically distributed) q IID and normal with zero mean and variance  2 => IIDN(0,  2 )

96 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 96 Recall: Why care about Probability? q Humans like determinism q The real-world is unfortunately random! q CANNOT place ANY confidence on a single measurement q We can be deterministic about measures or functions of underlying randomness … q Functions of Randomness : q Probability of a random event or variable q Average (mean, median, mode), Distribution functions (pdf, cdf), joint pdfs/cdfs, conditional probability, confidence intervals, q Goal: Build “probabilistic” models of reality q Constraint: minimum # experiments q Infer to get a model (I.e. maximum information) q Statistics: how to infer models about reality (“population”) given a SMALL set of expt results (“sample”)

97 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 97 Point Estimation … vs.. q 1.Provides Single Value q Based on Observations from 1 Sample q 2.Gives No Information about How Close Value Is to the Unknown Population Parameter  3.Example: Sample Mean  X = 3 Is Point Estimate of Unknown Population Mean

98 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 98 … vs … Interval Estimation q 1.Provides Range of Values q Based on Observations from 1 Sample q 2.Gives Information about Closeness to Unknown Population Parameter q Stated in terms of Probability q Knowing Exact Closeness Requires Knowing Unknown Population Parameter q 3.Example: Unknown Population Mean Lies Between 50 & 70 with 95% Confidence

99 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 99 Key Elements of Interval Estimation Confidence interval Sample statistic (point estimate) Confidence limit (lower) Confidence limit (upper) A probability that the population parameter falls somewhere within the interval.

100 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 100 Confidence Interval q Probability that a measurement will fall within a closed interval [a,b]: (mathworld definition…) = (1-  ) q Jain: the interval [a,b] = “confidence interval”; q the probability level, 100(1-  )= “confidence level”; q  = “significance level” q Sampling distribution for means leads to high confidence levels, I.e. small confidence intervals

101 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 101 Confidence Limits for Population Mean Parameter = Statistic ± Error © 1984-1994 T/Maker Co.

102 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 102 Many Samples Have Same Interval 90% Samples 95% Samples  +1.65   x  x_ XXXX  +1.96   x  -1.65   x  -1.96   x   X  =  ± Z   x

103 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 103 Many Samples Have Same Interval 90% Samples 95% Samples 99% Samples  +1.65   x  +2.58   x  x_ XXXX  +1.96   x  -2.58   x  -1.65   x  -1.96   x   X  =  ± Z   x

104 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 104 q 1.Probability that the Unknown Population Parameter Falls Within Interval  2.Denoted (1 -    Is Probability That Parameter Is Not Within Interval q 3.Typical Values Are 99%, 95%, 90% Confidence Level

105 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 105 Intervals & Confidence Level Sampling Distribution of Mean Large number of intervals Intervals extend from  X - Z   X to  X + Z   X (1 -  ) % of intervals contain .  % do not.

106 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 106 q 1.Data Dispersion  Measured by  q 2.Sample Size    X =  /  n  3.Level of Confidence (1 -  ) q Affects Z Factors Affecting Interval Width Intervals Extend from  X - Z   X to  X + Z   X © 1984-1994 T/Maker Co. —

107 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 107 Meaning of Confidence Interval

108 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 108 Intervals & Confidence Level Sampling Distribution of Mean Large number of intervals Intervals extend from  X - Z   X to  X + Z   X (1 -  ) % of intervals contain .  % do not.

109 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 109 Statistical Inference: Is  A =  B ? Note: sample mean y A is not  A, but its estimate! Is this difference statistically significant? Is the null hypothesis y A = y B false ?

110 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 110 Step 1: Plot the samples

111 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 111 Compare to (external) reference distribution (if available) Since 1.30 is at the tail of the reference distribution, the difference between means is NOT statistically significant!

112 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 112 Random Sampling Assumption! Under random sampling assumption, and the null hypothesis of y A = y B, we can view the 20 samples from a common population & construct a reference distributions from the samples itself !

113 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 113 t-distribution: Create a Reference Distribution from the Samples Itself!

114 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 114 t-distribution

115 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 115 Z t Student’s t Distribution 0 t (df = 5): wider Standard Normal t (df = 13) Bell-ShapedSymmetric ‘Fatter’ Tails

116 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 116 Student’s t Table t values  / 2 Assume: n = 3 df= n - 1 = 2  =.10  /2 =.05

117 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 117 Student’s t Table t values  / 2 Assume: n = 3 df= n - 1 = 2  =.10  /2 =.05.05

118 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 118 Student’s t Table Assume: n = 3 df= n - 1 = 2  =.10  /2 =.05 2.920 t values  / 2.05

119 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 119 Degrees of Freedom (df) q 1.Number of Observations that Are Free to Vary After Sample Statistic Has Been Calculated q 2.Example q Sum of 3 Numbers Is 6 X 1 = 1 (or Any Number) X 2 = 2 (or Any Number) X 3 = 3 (Cannot Vary) Sum = 6 degrees of freedom = n -1 = 3 -1 = 2

120 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 120 Statistical Significance with Various Inference Techniques Normal population assumption not required Random sampling assumption required Std.dev. estimated from samples itself! t-distribution an approx. for gaussian!

121 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 121 Normal,  2 & t-distributions: Useful for Statistical Inference Note: not all sampling distributions are normal

122 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 122 Relationship between Confidence Intervals and Comparisons of Means

123 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 123 Application: Internet Measurement and Modeling: A snapshot

124 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 124 Internet Measurement/Modeling: Why is it Hard? There is No Such Thing as “Typical” q Heterogeneity in: q Traffic mix q Range of network capabilities q Bottleneck bandwidth (orders of magnitude) q Round-trip time (orders of magnitude) q Dynamic range of network conditions q Congestion / degree of multiplexing / available bandwidth q Proportion of traffic that is adaptive/rigid/attack q Immense size & growth q Rare events will occur q New applications explode on the scene

125 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 125 There is No Such Thing as “Typical”, con’t q New applications explode on the scene q Not just the Web, but: Mbone, Napster, KaZaA etc., IM q Even robust statistics fail. q E.g., median size of FTP data transfer at LBL q Oct. 1992: 4.5 KB (60,000 samples) q Mar. 1993: 2.1 KB q Mar. 1998: 10.9 KB q Dec. 1998: 5.6 KB q Dec. 1999: 10.9 KB q Jun. 2000: 62 KB q Nov. 2000: 10 KB q Danger: if you misassume that something is “typical”, nothing tells you that you are wrong!

126 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 126 The Search for Invariants q In the face of such diversity, identifying things that don’t change (aka “invariants”) has immense utility q Some Internet traffic invariants: q Daily and weekly patterns q Self-similarity on time scales of 100s of msec and above q Heavy tails q both in activity periods and elsewhere, e.g., topology q Poisson user session arrivals q Log-normal sizes (excluding tails) q Keystrokes have a Pareto distribution

127 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 127 Web traffic … … is streamed onto the Internet … … creating “bursty- looking” link traffic time X = Htailed HTTP Requests/responses Y = “colored” noise {self-similar} (TCP-type transport)

128 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 128

129 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 129 So, why does this burstiness matter? Case: Failure of Poisson Modeling q Long-established framework: Poisson modeling q Central idea: network events (packet arrivals, connection arrivals) are well-modeled as independent q In simplest form, there’s just a rate parameter,  It then follows that the time between “calls” (events) is exponentially distributed, # of calls ~ Poisson q Implications or Properties (if assumptions correct): q Aggregated traffic will smooth out quickly q Correlations are fleeting, bursts are limited

130 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 130 Burstiness: Theory vs. Measurement q For Internet traffic, Poisson models have fundamental problem: they greatly underestimate burstiness q Consider an arrival process: X k gives # packets arriving during k th interval of length T. q Take 1-hour trace of Internet traffic (1995) q Generate (batch) Poisson arrivals with same mean and variance q I.e. “fit” a poisson (batch) process to the trace…

131 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 131

132 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 132  10 Previous Region

133 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 133  100

134 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 134  600

135 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 135

136 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 136 Amen!


Download ppt "Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Basic Ideas in Probability and Statistics for Experimenters: Part I: Qualitative Discussion Shivkumar."

Similar presentations


Ads by Google