Normalizing Transformations and fitting a marginal distribution

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Outline input analysis input analyzer of ARENA parameter estimation
Hydrologic Statistics
G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1 Statistical Data Analysis: Lecture 2 1Probability, Bayes’ theorem 2Random variables and.
1 Choice of Distribution 1.Theoretical Basis e.g. CLT, Extreme value 2.Simplify calculations e.g. Normal or Log Normal 3.Based on data: - Histogram - Probability.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 21 = Start Chapter “Confidence Interval Estimation” (CIE)
Lecture II-2: Probability Review
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Flood Frequency Analysis
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Random variables Petter Mostad Repetition Sample space, set theory, events, probability Conditional probability, Bayes theorem, independence,
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
Probability theory 2 Tron Anders Moger September 13th 2006.
Random Sampling, Point Estimation and Maximum Likelihood.
SPC for Real-World Processes William A. Levinson, P.E. Intersil Corporation Mountaintop, PA.
2 Input models provide the driving force for a simulation model. The quality of the output is no better than the quality of inputs. We will discuss the.
Normal Distributions Z Transformations Central Limit Theorem Standard Normal Distribution Z Distribution Table Confidence Intervals Levels of Significance.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Tests for Random Numbers Dr. Akram Ibrahim Aly Lecture (9)
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
M11-Normal Distribution 1 1  Department of ISM, University of Alabama, Lesson Objective  Understand what the “Normal Distribution” tells you.
Lab 3b: Distribution of the mean
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Chapter 5.6 From DeGroot & Schervish. Uniform Distribution.
Chapter 9 Input Modeling Banks, Carson, Nelson & Nicol Discrete-Event System Simulation.
Probability = Relative Frequency. Typical Distribution for a Discrete Variable.
EAS31116/B9036: Statistics in Earth & Atmospheric Sciences Lecture 3: Probability Distributions (cont’d) Instructor: Prof. Johnny Luo
HYPOTHESIS TESTING Distributions(continued); Maximum Likelihood; Parametric hypothesis tests (chi-squared goodness of fit, t-test, F-test) LECTURE 2 Supplementary.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
1 WHY WE USE EXPLORATORY DATA ANALYSIS DATA YES NO ESTIMATES BASED ON NORMAL DISTRIB. KURTOSIS, SKEWNESS TRANSFORMATIONS QUANTILE (ROBUST) ESTIMATES OUTLIERS.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Modeling and Simulation CS 313
Stat 223 Introduction to the Theory of Statistics
Chapter 4: Basic Estimation Techniques
Probability Distributions
Capital Budgeting in the Chemical Industry
Concepts in Probability, Statistics and Stochastic Modeling
Standard Errors Beside reporting a value of a point estimate we should consider some indication of its precision. For this we usually quote standard error.
Why Stochastic Hydrology ?
BAE 5333 Applied Water Resources Statistics
Stat 223 Introduction to the Theory of Statistics
Probability Model Fitting Steps
Modeling and Simulation CS 313
Models to Represent the Relationships Between Variables (Regression)
Evaluating Univariate Normality
Two Concepts of Probability
Flood Frequency Analysis
Nonparametric Density Estimation
Checking Regression Model Assumptions
Subject Name: SYSTEM MODELING AND SIMULATION [UNIT-7 :INPUT MODELING]
Special Topics In Scientific Computing
Market Risk VaR: Historical Simulation Approach
Statistical Methods For Engineers
Interval Estimation.
Alafia river: Autocorrelation Autocorrelation of standardized flow.
Checking Regression Model Assumptions
Hydrologic Statistics
POINT ESTIMATOR OF PARAMETERS
QQ Plot Quantile to Quantile Plot Quantile: QQ Plot:
Goodness-of-Fit Tests Applications
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
Statistics Lecture 12.
Stat 223 Introduction to the Theory of Statistics
Diagnostics and Remedial Measures
Introductory Statistics
Presentation transcript:

Normalizing Transformations and fitting a marginal distribution Much theory relies on the central limit theorem so applies to Normal Distributions Where the data is not normally distributed normalizing transformations are used Log Box Cox (Log is a special case of Box Cox) A specific PDF, e.g. Gamma A non parametric PDF

Approach Select the class of distributions you want to fit Estimate parameters using an appropriate goodness of fit measure Likelihood PPCC (Filliben’s statistic) Kolmogorov Smirnov p value Shapiro Wilks W

Normalizing transformation for arbitrary distribution Arbitrary distribution F(x) Normal distribution Fn(y) x y Normalizing transformation Back transformation

Kernel Density Estimate (KDE) Place “kernels” at each data point Sum up the kernels Width of kernel determines level of smoothing Determining how to choose the width of the kernel could be a full day lecture! Narrow kernel Sum of kernels Medium kernel Individual kernels Wide kernel

1-d KDE of Log-transformed Flow Level of smoothing: 0.5 Rug plot: shows location of data points Level of smoothing: 0.2 Level of smoothing: 0.8

Non parametric PDF in R # Read in Willamette R. flow data q=matrix(scan("willamette_data.txt"),ncol=3,byrow=T) # Assign variables yr=q[,1] mo=q[,2] flow=q[,3]   # Format flows into a matrix fmat=matrix(flow,ncol=12,byrow=T) # focus on January and February # Marginal distributions # Create histogram for each month, with actual streamflow data on x-axis and KDE # of marginal distribution using....Gaussian kernel and nrd0 bandwidth par(mfrow=c(1,2)) for(i in 1:2){ x=fmat[,i] hist(x,nclass=15,main= month.name[i] ,xlab="cfs",probability=T) lines(density(x,bw="nrd0",na.rm=TRUE),col=2) rug(x,,,,2) box() } hist(x,nclass=15,main= month.name[i] ,xlab="cfs",probability=T) lines(density(x,bw="nrd0",na.rm=TRUE),col=2) rug(x,,,,2)

Non parametric CDF in R cdf.r=function(density) { x=density$x yt=cumsum(density$y) n=length(yt) y=(yt-yt[1])/(yt[n]-yt[1]) # force onto the range 0,1 without checking for significant error list(x=x,y=y) } dd=density(x,bw="nrd0",na.rm=TRUE) cdf=cdf.r(dd) plot(cdf,type="l") cdf.r=function(density) { x=density$x yt=cumsum(density$y) n=length(yt) y=(yt-yt[1])/(yt[n]-yt[1]) # force onto the range 0,1 without checking for significant error list(x=x,y=y) } dd=density(x,bw="nrd0",na.rm=TRUE) cdf=cdf.r(dd) plot(cdf,type="l")   ylookup.r=function(x,cdf) int=sum(cdf$x<x) # This identifies the interval for interpolation n=length(cdf$x) if(int < 1){ y=cdf$y[1] }else if(int > n-1) y=cdf$y[n] else y=((x-cdf$x[int])*cdf$y[int+1]+(cdf$x[int+1]-x)*cdf$y[int])/(cdf$x[int+1]-cdf$x[int]) return(y) xlookup.r=function(y,cdf) int=sum(cdf$y<y) # This identifies the interval for interpolation x=cdf$x[1] x=cdf$x[n] x=((y-cdf$y[int])*cdf$x[int+1]+(cdf$y[int+1]-y)*cdf$x[int])/(cdf$y[int+1]-cdf$y[int]) return(x) ylookup.r=function(x,cdf) xlookup.r=function(y,cdf) { int=sum(cdf$y<y) # This identifies the interval for interpolation x=((y-cdf$y[int])*cdf$x[int+1]+(cdf$y[int+1]-y)*cdf$x[int])/(cdf$y[int+1]-cdf$y[int]) return(x) }

Gamma Estimate parameters using moments or maximum likelihood

Box-Cox Normalization The Box-Cox family of transformations that includes the logarithmic transformation as a special case (l=0). It is defined as: z = (x -1)/ ;   0 z = ln(x);  = 0 where z is the transformed data, x is the original data and  is the transformation parameter.

Log normalization with lower bound z = ln(x-)

Determining Transformation Parameters (, ) PPCC (Filliben’s Statistic): R2 of best fit line of the QQplot Kolomgorov-Smirnov (KS) Test (any distribution): p-value Shapiro-Wilks Test for Normality: p-value

Quantiles Rank the data Theoretical distribution, e.g. Standard Normal x1 x2 x3 . xn pi qi qi is the distribution specific theoretical quantile associated with ranked data value xi

Quantile-Quantile Plots QQ-plot for Raw Flows QQ-plot for Log-Transformed Flows ln(xi) qi xi qi Need transformation to make the Raw flows Normally distributed.

Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using PPCC This is close to 0,  = -0.14

Kolmogorov-Smirnov Test Specifically, it computes the largest difference between the target CDF FX(x) and the observed CDF, F*(X). The test statistic D2 is: where X(i) is the ith largest observed value in the random sample of size n.

Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using Kolmogorov-Smirnov (KS) Statistic This is not as close to 0,  = -0.39

shapiro.test(x) in R http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/wilkshap.htm

Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using Shapiro-Wilks Statistic This is close to 0,  = -0.14. Same as PPCC.