Probability distribution functions

Slides:



Advertisements
Similar presentations
Agenda of Week V. Dispersion & RV Objective : Understanding the descriptive statistics Understanding the random variable and probability Week 4 1 Graphs.
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Probability distribution functions Normal distribution Lognormal distribution Mean, median and mode Tails Extreme value distributions.
Hydrologic Statistics Reading: Chapter 11, Sections 12-1 and 12-2 of Applied Hydrology 04/04/2006.
Hydrologic Statistics
MEGN 537 – Probabilistic Biomechanics Ch.4 – Common Probability Distributions Anthony J Petrella, PhD.
Discrete Probability Distributions
Probability distribution functions Normal distribution Lognormal distribution Mean, median and mode Tails Extreme value distributions.
F (x) - PDF Expectation Operator x (fracture stress) Nomenclature/Preliminaries.
The Simple Linear Regression Model: Specification and Estimation
Time-Dependent Failure Models
1 Engineering Computation Part 6. 2 Probability density function.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Statistics and Probability Theory Prof. Dr. Michael Havbro Faber
Continuous Random Variables and Probability Distributions
- 1 - Summary of P-box Probability bound analysis (PBA) PBA can be implemented by nested Monte Carlo simulation. –Generate CDF for different instances.
VARIABILITY. PREVIEW PREVIEW Figure 4.1 the statistical mode for defining abnormal behavior. The distribution of behavior scores for the entire population.
8-1 Introduction In the previous chapter we illustrated how a parameter can be estimated from sample data. However, it is important to understand how.
Lecture II-2: Probability Review
Standard error of estimate & Confidence interval.
Hydrologic Statistics
Identifying Input Distributions 1. Fit Distribution to Historical Data 2. Forecast Future Performance and Uncertainty ◦ Assume Distribution Shape and Forecast.
Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved.
Traffic Modeling.
Chapter 3 Descriptive Measures
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
1 Statistical Distribution Fitting Dr. Jason Merrick.
“ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Statistics 101 Robert C. Patev NAD Regional Technical Specialist (978)
Lab 3b: Distribution of the mean
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Chapter 6: Random Errors in Chemical Analysis CHE 321: Quantitative Chemical Analysis Dr. Jerome Williams, Ph.D. Saint Leo University.
Chapter 12 Continuous Random Variables and their Probability Distributions.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
The Simple Linear Regression Model: Specification and Estimation ECON 4550 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
Stracener_EMIS 7305/5305_Spr08_ Reliability Data Analysis and Model Selection Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Probability distributions
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Reducing Uncertainty in Fatigue Life Estimates Design, Analysis, and Simulation 1-77-Nastran A Probabilistic Approach To Modeling Fatigue.
Chapter II Methods for Describing Sets of Data Exercises.
CHAPTER – 1 UNCERTAINTIES IN MEASUREMENTS. 1.3 PARENT AND SAMPLE DISTRIBUTIONS  If we make a measurement x i in of a quantity x, we expect our observation.
How do we classify uncertainties? What are their sources? – Lack of knowledge vs. variability. What type of measures do we take to reduce uncertainty?
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Random Variables By: 1.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Introduction to Probability - III John Rundle Econophysics PHYS 250
4-1 Continuous Random Variables 4-2 Probability Distributions and Probability Density Functions Figure 4-1 Density function of a loading on a long,
Statistical Modelling
The Simple Linear Regression Model: Specification and Estimation
The Maximum Likelihood Method
Flood Frequency Analysis
IENG 486: Statistical Quality & Process Control
Some probability density functions (pdfs)
Statistics & Flood Frequency Chapter 3 – Part 1
Basic Estimation Techniques
Introduction to Instrumentation Engineering
Hydrologic Statistics
Summary descriptive statistics: means and standard deviations:
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
HYDROLOGY Lecture 12 Probability
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Normal Distribution The Bell Curve.
Advanced Algebra Unit 1 Vocabulary
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Probability distribution functions Normal distribution Lognormal distribution Mean, median and mode Tails Extreme value distributions

Normal (Gaussian) distribution Probability density function (PDF) What does figure tell about the cumulative distribution function (CDF)? The most commonly used distribution is the normal distribution. This may reflect the fact that, if the response of an engineering system is random due to large number of random properties, none dominating, the distribution is likely to be close to normal. The slide shows the equation for the probability density function (PDF), known as the bell-shaped distribution, with a figure taken from Wikipedia. In the figure, the mean is denoted by 𝑥 , while we use the notation 𝜇, the standard deviation is denoted by σ in both the equation and the figure. The area under it in a given region is equal to the probability of the random variable being in this region, so the total area is 1. So the figure shows that X has about 68% chance of being within one standard deviation of the mean. This means that the area of the central region under the curve is about 0.68. The cumulative distribution function is the probability of X being smaller than a given value x, so it is the integral of the PDF from minus infinity to x. Noting that the distribution is symmetric, the CDF at the mean should be 0.5, and the CDF at the mean plus one standard deviation should be about 0.5+0.68/2=0.84. A more accurate answer from Matlab normcdf(1)=0.8413. The well known six-sigma standard in industry is actually (for historical reasons) a 4.5 sigma standard 1-normcdf(4.5)=3.4e-6, or 3.4 defects per million. The general form of normcdf is normcdf(X,mu, sigma), where the three variables can be vectors or matrices of the same size.

More on the normal distribution Normal distribution is denoted 𝑁 𝜇, 𝜎 2 , with the square giving the variance. If X is normal, Y=aX+b is also normal. What would be the mean and standard deviation of Y? Similarly, if X and Y are normal variables, any linear combination, aX+bY is also normal. Can often use any function of a normal random variables by using a linear Taylor expansion. Example: X=N(10,0.52) and Y=X2 . Then 𝑋 2 ≈100+20(𝑋−10 𝑋 2 ≈100+20(𝑋−10 Y ≈ N(100,102) The notation for a normal distribution is 𝑁 𝜇, 𝜎 2 , with the square of the standard deviation being the variance. The fact that we specify the variance rather than the standard deviation may be related that it is easier to estimate the former than the latter as we will see on the next slide. One of the attractions of the normal distribution is that a linear function of a normal variable is normal, as can be checked from the PDF. For any random variable, if we add a constant we change the mean without changing the standard deviation. So if X has mean 𝜇, X+b will have mean 𝜇+𝑏. Similarly, if we multiply a random variable by a constant a, both the mean and standard deviation are multiplied by that constant. As useful is that any linear combination of normal variables is a normal variable. This can extend to any function of normal variables if the randomness induced in the function is small enough so that a linear Taylor series of the function is a good approximation. For example, if X=N(10,0.52) and Y=X2 then we can use the Taylor series expansion 𝑋 2 ≈100+20(𝑋−10 to approximate 𝑋 2 as N(100,102). In fact, X has a mean of about 100.25, and standard deviation of about 10.005

Estimating mean and standard deviation Given a sample from a normally distributed variable, the sample mean is the best linear unbiased estimator (BLUE) of the true mean. For the variance the equation gives the best unbiased estimator, but the square root is not an unbiased estimate of the standard deviation For example, for a sample of 5 from a standard normal distribution, the standard deviation will be estimated on average as 0.94 (with standard deviation of 0.34) Given a sample from a random variable, the mean of the sample is the best linear unbiased estimator (BLUE) of the true mean. For a normal variable the standard estimate of 𝜎 2 ≈ 1 𝑛−1 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 2 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖        is also BLUE. Note that the variance of the sample has n instead of n-1 in the denominator. This is due to the fact that we estimate the mean in the sample. If the mean in known (e.g. if we estimate from a sample the standard deviation of 𝑥 9 +sin𝑥, where x is N(0,1)) we do use n rather than n-1. However, taking the square root does not provide an unbiased estimate of the standard deviation. The following Matlab sequence x=randn(5,9000000); s=std(x); s2=s.^2; mean(s) 0.9400; mean(s2) 0.9999; std(s) 0.3411 shows that for a sample of 5 numbers from the standard normal distribution, the estimate of the standard deviation will average only 0.94 with a substantial standard deviation of 0.34. Look in Wikipedia under “unbiased estimation of standard deviation” for more accurate formulas.

Lognormal distribution If ln(X) has normal distribution X has lognormal distribution. That is, if X is normally distributed exp(X) is lognormally distributed. Notation: ln𝑁 𝜇, 𝜎 2 PDF Mean and variance The normal distribution is not appropriate for variables that have to be positive, like density or length. The longnormal distribution is one of the popular distributions for such random variables. It is defined such as ln(X) is normally distributed and therefore often denoted as ln N(𝜇, 𝜎 2 ). The probability distribution function is 𝑓(𝑥)= 1 𝑥 2𝜋𝜎 exp − ln𝑥−𝜇 2 2 𝜎 2 And the mean and variance are then given as 𝜇 𝑋 =exp 𝜇+ 𝜎 2 2 , 𝜎 𝑋 2 =𝑉𝑎𝑟 𝑋 = 𝑒 𝜎 2 −1 𝑒 2𝜇+ 𝜎 2 The figure in the slide is is taken from a Matlab publication Suppose the income of a family of four in the United States follows a lognormal distribution with µ = log(20,000) and σ2 = 1.0. ( 𝜇 𝑋 =32974, 𝜎 𝑋 =43224). Then the figure is produced with the following sequence;: x = (10:1000:125010)'; y = lognpdf(x,log(20000),1.0); plot(x,y) set(gca,'xtick',[0 30000 60000 90000 120000]) set(gca,'xticklabel',{'0','$30,000','$60,000', '$90,000','$120,000'})

Question Suppose the income of a family of four in the United States follows a lognormal distribution with µ = log(20,000) and σ2 = 1.0. ( 𝜇 𝑋 =32974, 𝜎 𝑋 =43224). See figure: What is your estimate of the mode (that is the most common income)? The median?

Mean, mode and median Mode (highest point) = exp[𝜇− 𝜎 2 Median (50% of samples) =𝑒 𝜇 Figure for 𝜇=0. The lognormal distribution also allows us to introduce the concepts of mode and median. The mode is the point with the highest PDF, and for the lognormal distribution it is at = exp[𝜇− 𝜎 2 . The median is the point where 50 percent of the samples will be below and 50% above. That is, the area of the PDF on both sides is 0.5, or the value of the CDF there is 0.5. For the longnormal distribution the median is at 𝑒 𝜇 . For the income distribution shown on the previous slide the equations indicate that the mode is $7,357. That is, if we sample many families, the largest concentration (highest point on a histogram) would be near $7357. The median is $20,000, that is half of the families would have income below 20,000 and half above. Finally, the mean was $32,974. The figure on this slide shows the lognormal distribution for 𝜇=0 and two values of 𝜎. For the lower value of 𝜎, the distribution is not strongly skewed, so that the mode, median and mean are close. For 𝜎=1, on the other hand, the distribution is highly skewed, and the three parameters are very different as they are for the income figure from the previous slide.

Light and heavy tails Normal distribution has light tail; 4.5 sigma is equivalent to 3.4e-6 failure or defect probability. Lognormal can have heavy tail 𝜇=0,𝜎=0.25,7.5e−4 , 𝜇=0,𝜎=1,0.0075 For many safety problems, the probability of failure must be very low, which means that we are interested not in the center of the distribution, but in its tails. The normal distribution is light tailed. This means that being more than 3-4 standard deviations from the mean is very unlikely. For example, in Slide 2 (see notes) we saw that the so-called six-sigma standard, which corresponds to 4.5 standard deviations from the mean reflects probability of failure of 3.4 per million. This applies to any normal distribution regardless of the mean and standard deviation. Many distributions, such as income or strength are heavier tailed, and the lognormal distribution may fit them better. For example, the almost symmetric case in the figure with 𝜇=0,𝜎=0.25, has a probability of 7.5e-4 at 4.5 standard deviations, and the case with 𝜇=0,𝜎=1, has a probability of 0.0075. This latter case was calculated with the following Matlab sequence: m=exp(0.5); v=exp(1)*(exp(1)-1); sig=sqrt(v); sig6=m+4.5*sig sig6 =11.3741 logncdf(sig6,0,1) =0.9925

Fitting distribution to data Usually fit CDF to minimize maximum distance (Kolmogorov-Smirnoff test) Generated 20 points from N(3,12). Normal fit N(3.48,0.932) Lognormal lnN(1.24,0.26) Almost same mean and standard deviation. Given sampling data we fit a distribution by finding a CDF that is close to the experimental CDF. Usually, we use the Kolmogorov-Smirnoff (K-S) criterion, which is the maximum difference between the two CDfs. Here this is illustrated by first generating a sample of twenty points from N(3,12) 3.4263 4.0990 3.6194 2.2412 3.0901 2.5178 3.1540 5.3013 4.0712 5.5182 2.6944 2.9772 3.8018 2.6601 3.1646 3.7553 3.2361 3.5960 2.0353 4.6775 The figure shows in blue the experimental CDF and the lower and upper 90% confidence bounds (in blue). Then the normal fit in red and the lognormal fit in green. The normal fit, N(3.48,0.932) indicates 16% error in the mean and 7% in the standard deviation compared to the distribution used to generate the data. The lognormal fit has almost the same mean and standard deviation, but it is substantially different in the tail. Surprisingly it is a better fit to the data using the K-S test than the normal. However, in view of the large uncertainty bounds on the experimental CDF this is clearly believable. The Matlab sequence used to generate the fit and plot is x=randn(20,1)+3; [ecd,xe,elo,eup]=ecdf(x); pd=fitdist(x,'normal') mu = 3.481862 sigma = 0.927932 pd=fitdist(x,'lognormal') mu = 1.21473 sigma = 0.262613 xd=linspace(1,8,1000); cdfnorm=normcdf(xd,3.4819,0.92793); cdflogn=logncdf(xd,1.2147,0.26261) plot(xe,ecd,'LineWidth',2); hold on; plot(xd,cdflogn,'g','LineWidth',2) plot(xd,cdfnorm,'r','LineWidth',2); xlabel('x');ylabel('CDF') legend('experimental','lognormal','normal','Location','SouthEast') plot(xe,elo,'LineWidth',1); plot(xe,eup,'LineWidth',1)

Extreme value distributions No matter what distribution you sample from, the mean of the sample tends to be normally distributed as sample size increases (what mean and standard deviation?) Similarly, distributions of the minimum (or maximum) of samples belong to other distributions. Even though there are infinite number of distributions, there are only three extreme value distributions. Type I (Gumbel) derived from normal. Type II (Frechet) e.g. maximum daily rainfall Type III (Weibull) weakest link failure

Maximum of normal samples With normal distribution, maximum of sample is more narrowly distributed than original distribution. Max of 10 standard normal samples. 1.54 mean, 0.59 standard deviation Max of 100 standard normal samples. 2.50 mean, 0.43 standard deviation The normal distribution decays exponentially, that is, has a light tail. Therefore when you take the maximum of a set of samples, its distribution is narrower than the original distribution. This is illustrated here for the case of 10 samples and 100 samples drawn from the standard normal distribution. The left histogram and the values of the mean and standard deviation are obtained with the Matlab sequence; x=randn(10,100000); maxx=max(x); hist(maxx,50) mean(maxx) std(maxx) We see that by the time weruse 100 samples the maximum has a standard deviation of only 0.43 compared to 1 for the original distribution.

Gumbel distribution . Mean, median, mode and variance For large number of samples, the minimum of normal samples converges to a distribution called Type 1 Extreme Value Distribution or the Gumbel distribution. The slide provides its PDF CDF and its mean, median and mode. Note that the distribution is defined for the minimum of a sample. If we desire the distribution for the maximum of a sample, we need to look for the minimum of the negative. This was done in fitting a distribution to the maximum of samples of size 10 and 100 drawn from the standard normal distribution. 10,000 such sets of samples were drawn, and the negative of their maxima were fitted to the Gumbel distribution. The left figure shows that the CDF of samples of 10 is markedly different from the Gumbel, but for 100 they agree quite well. The Matlab sequence for the 100 samples was as follows (the information on mu and sigma was output, and then it was input to define them). x=randn(100,100000); maxx=max(x); fitdist(-maxx','ev') extreme value distribution mu = -2.30676 sigma = 0.36862 [F,X]=ecdf(-maxx); plot(X,F,'r'); hold on xd=linspace(-5.3,-1,1000); evcd=evcdf(xd,mu,sigma); plot(xd,evcd); legend('fitted ev1','-max100 data')

Weibull distribution Probability distribution Its log has Gumbel dist. Used to describe distribution of strength or fatigue life in brittle materials. If it describes time to failure, then k<1 indicates that failure rate decreases with time, k=1 indicates constant rate, k>1 indicates increasing rate. Can add 3rd parameter by replacing x by x-c. The Gumbel distribution, being the limiting case of the normal, is fairly light tail. Weibul is a heavier tailed limiting distribution. This can be also shown by the fact that its logarithm obeys the Gumbel distribution. It is also called Type 3 Extreme Value distribution. The specific equation for the PDF and the figure are taken from Wikipedia. The figure on the right shows that variety of the PDF for Weibull. The figure on the left compares the experimental CDF of the logarithm of a sample generated from the Weibull distribution (Matlab wblrnd) and the Gumbel distribution (Matlab ev) to the sample. The excellent agreement confirms the relation between Weibull and Gumbell.

Exercises Estimate how much rain will Gainesville have in 2014 as well as the aleatory and and epistemic uncertainty in your estimate. Find how many samples of normally distributed numbers you need in order to estimate the mean with an error that will be less than 5% of the true standard deviation 90% of the time. Use the fact that the mean of a sample of a normal variable has the same mean and a standard deviation that is reduced by the square root of the number of samples. Both the lognormal and Weibull distributions are used to model strength. Fit 100 data generated from a standard lognormal distribution by both lognormal and Weibull distributions. Repeat with 5 randomly generated samples. In each case measure the distance using the KS distance, and translate the result to a sentence of the following format: The maximum difference between the two CDFs is at x=2, where the true probability of x<2 is 60%, the probability from the experimental CDF is 61%, the probability from the lognormal fit is 62% and the probability from the Weibull fit is 64% (these numbers are invented for the purpose of illustrating the format). Generate a histogram of word lengths in this assignment, including hyphens and the math (e.g., x=2 is a 3-letter word), but not punctuation marks. Select an appropriate number of boxes for the histogram and explain your selection). Then fit the distribution of word lengths with five standard distributions including normal, lognormal, and Weibull using the K-S criterion. What distribution fits best? Compare the graphs of the CDFs.