Download presentation
1
Choosing a Probability Distribution
Institute for Water Resources 2010 Choosing a Probability Distribution Charles Yoe, Ph.D.
2
Probability x Consequence
Quantitative risk assessment requires you to use probability Sometimes you will estimate the probability of an event Sometimes you will use distributions to Describe data Model variability Represent our uncertainty What distribution do you use?
3
Probability—Language of Random Variables
Constant Variables Some things vary predictably Some things vary unpredictably Random variables It can be something known but not known by us A constant is a numerical characteristic that does not change. There are many important constants in life numerical and otherwise. Pi, Avogadro’s number, inches in a foot, pounds in a ton, and so on. Ask: Tell me something that is constant. Your name, number of eyes, number of children. Is there anything relative about a constant? Time? Place? What varies predictably? Crowds at stadiums--if actual number is not important we sure know there will be many people at a game. Maybe who will be home for dinner at your house. Our level of resolution is important in determining what is predictable. Big or little may be predictable, but the exact number is not.
4
Checklist for Choosing a Distributions From Some Data
Can you use your data? Understand your variable Source of data Continuous/discrete Bounded/unbounded Meaningful parameters Do you know them? (1st or 2nd order) Univariate/multivariate Look at your data—plot it Use theory Calculate statistics Use previous experience Distribution fitting Expert opinion Sensitivity analysis How do we choose a distribution to represent our data? What do your data look like? Do you know from previous experience of your own or others? Do we have theory that suggest what the distribution should be—sampling theory and statistics. Use distribution fitting tests. We’ll do this later. Expert opinion. Estimate the parameters of a distribution or estimate cdf. Do analysis making different assumptions about the distribution.
5
First! Do you have data? If so, do you need a distribution or can you just use your data? Answer depends on the question(s) you’re trying to answer as well as your data
6
Use Data If your data are representative of the population germane to your problem use them One problem could be bounding data What are the true min & max? Any dataset can be converted into a Cumulative distribution function General density function
7
Fitting Empirical Distribution to Data
If continuous & reasonably extensive May have to estimate minimum & maximum Rank data x(i) in ascending order Calculate the percentile for each value Use data and percentiles to create cumulative distribution function
8
When You Can’t Use Your Data
Given wide variety of distributions it is not always easy to select the most appropriate one Results can be very sensitive to distribution choice Using wrong assumption in a model can produce incorrect results=>poor decisions=> undesirable outcomes
9
Understand Your Data What is source of data? Experiments Observation
Surveys Computer databases Literature searches Simulations Test case The source of the data may affect your decision to use it or not. Understand your variable
10
Type of Variable? Is your variable discrete or continuous ?
Barges in a tow Houses in floodplain People at a meeting Results of a diagnostic test Casualties per year Relocations and acquisitions Type of Variable? Average number of barges per tow Weight of an adult striped bass Sensitivity or specificity of a diagnostic test Transit time Expected annual damages Duration of a storm Shoreline eroded Sediment loads Is your variable discrete or continuous ? Do not overlook this! Discrete distributions- take one of a set of identifiable values, each of which has a calculable probability of occurrence Continuous distributions- a variable that can take any value within a defined range Understand your variable
11
What Values Are Possible?
Is your variable bounded or unbounded? Bounded-value confined to lie between two determined values Unbounded-value theoretically extends from minus infinity to plus infinity Partially bounded-constrained at one end (truncated distributions) Use a distribution that matches Understand your variable
12
Continuous Distribution Examples
Unbounded Normal t Logistic Left Bounded Chi-square Exponential Gamma Lognormal Weibull Bounded Beta Cumulative General/histogram Pert Uniform Triangle Understand your variable
13
Discrete Distribution Examples
Unbounded None Left Bounded Poisson Negative binomial Geometric Bounded Binomial Hypergeometric Discrete Discrete Uniform Understand your variable
14
Are There Parameters Does your variable have parameters that are meaningful? Parametric--shape is determined by the mathematics describing a conceptual probability model Require a greater knowledge of the underlying Non-parametric—empirical distributions for which the mathematics is defined by the shape required Intuitively easy to understand Flexible and therefore useful Understand your variable
15
Choose Parametric Distribution If
Theory supports choice Distribution proven accurate for modelling your specific variable (without theory) Distribution matches any observed data well Need distribution with tail extending beyond the observed minimum or maximum Understand your variable
16
Choose Non-Parametric Distribution If
Theory is lacking There is no commonly used model Data are severely limited Knowledge is limited to general beliefs and some evidence Understand your variable
17
Parametric and Non-Parametric
Normal Lognormal Exponential Poisson Binomial Gamma Uniform Pert Triangular Cumulative Understand your variable
18
Do You Know the Parameters?
Probability distribution with precisely known parameters (N(100,10)) is called a 1st order distribution Probability distribution with some uncertainty about its parameters (N(m,s)) is called a 2nd order distribution Risknormal(risktriang(90,100,103),riskuniform(8,11)) Understand your variable
19
Is It Dependent on Other Variables
Univariate and multivariate distributions Univariate--describes a single parameter or variable that is not probabilistically linked to any other in the model Multivariate--describe several parameters that are probabilistically linked in some way Engineering relationships are often multivariate Understand your variable
20
Continuing Checklist for Choosing a Distributions
Look at your data—plot it Use theory Calculate statistics Use previous experience Distribution fitting Expert opinion Sensitivity analysis How do we choose a distribution to represent our data? What do your data look like? Do you know from previous experience of your own or others? Do we have theory that suggest what the distribution should be—sampling theory and statistics. Use distribution fitting tests. We’ll do this later. Expert opinion. Estimate the parameters of a distribution or estimate cdf. Do analysis making different assumptions about the distribution.
21
Plot--Old Faithful Eruptions
What do your data look like? You could calculate Mean & SD and assume its normal Beware, danger lurks Always plot your data
22
Which Distribution? Examine your plot
Look for distinctive shapes of specific distributions Single peaks Symmetry Positive skew Negative values Gamma, Weibull, beta are useful and flexible forms
23
Theory-Based Choice Most compelling reason for choice Formal theory
Central limit theorem Theoretical knowledge of the variable Behavior Math—range Informal theory Sums normal, products lognormal Study specific Your best documented thoughts on subject
24
Calculate Statistics Summary statistics may provide clues Normal
Low coefficient of variation Equal mean and median Exponential has positive skew Equal mean and standard deviation Consider outliers
25
Outliers Extreme observations can drastically influence a probability model No prescriptive method for addressing them If observation is an error remove it If not what is data point telling you? What about your world-view is inconsistent with this result? Should you reconsider your perspective? What possible explanations have you not yet considered?
26
Outliers (cont) Your explanation must be correct, not merely plausible
Consensus is poor measure of truth If you must keep it and can't explain it Use conventional practices and live with skewed consequences Choose methods less sensitive to such extreme observations (Gumbel, Weibull)
27
Previous Experience Have you dealt with this issue successfully before? Have others? What did other analyses or risk assessments use? What does the literature reveal?
28
Goodness of Fit Provides statistical evidence to test hypothesis that your data could have come from a specific distribution H0 these data come from an “x” distribution Small test statistic and large p mean accept H0 It is another piece of evidence not a determining factor
29
GOF Tests Chi-Square Test Most common—discrete & continuous
Data are divided into a number of cells, each cell with at least five Usually 50 observations or more Kolomogorov-Smirnov Test More suitable for small samples than Chi-Square Better fit for means than tails Andersen-Darling Test Weights differences between theoretical and empirical distributions at their tails greater than at their midranges Desirable when better fit at extreme tails of distribution are desired
30
Kolmogorov-Smirnov Statistic
Blue = data Red = true/hypothetical Find biggest difference between the two K-S statistic is largest difference consistent with your n α
31
Defining Distributions w/ Expert Opinion
Data never collected Data too expensive or impossible Past data irrelevant Opinion needed to fill holes in sparse data New area of inquiry, unique situation that never existed
32
What Experts Estimate The distribution itself
Judgment about distribution of value in population E.g. population is normal Parameters of the distribution E.g. mean is x and standard deviation is y
33
Parametric or Non-parametric distributions
Modeling Techniques Disaggregation (Reduction) Subjective Probability Elicitation PDF or CDF Parametric or Non-parametric distributions
34
Elicitation Techniques Needed
Literature shows we do not assess subjective probabilities well In part due to heuristics we use Representativeness Availability Anchoring and adjustment There are methods to counteract our heuristics and to elicit our expert knowledge
35
Sensitivity Analysis Unsure which is the best distribution?
Try several If no difference you are free to use any one Significant differences mean doing more work
36
Take Away Points Choosing the best distribution is where most new risk assessors feel least comfortable. Choice of distribution matters. Distributions come from data and expert opinion. Distribution fitting should never be the basis for distribution choice.
37
Questions? Charles Yoe, Ph.D.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.