Normalizing Transformations and fitting a marginal distribution Much theory relies on the central limit theorem so applies to Normal Distributions Where the data is not normally distributed normalizing transformations are used Log Box Cox (Log is a special case of Box Cox) A specific PDF, e.g. Gamma A non parametric PDF
Approach Select the class of distributions you want to fit Estimate parameters using an appropriate goodness of fit measure Likelihood PPCC (Filliben’s statistic) Kolmogorov Smirnov p value Shapiro Wilks W
Normalizing transformation for arbitrary distribution Arbitrary distribution F(x) Normal distribution Fn(y) x y Normalizing transformation Back transformation
Kernel Density Estimate (KDE) Place “kernels” at each data point Sum up the kernels Width of kernel determines level of smoothing Determining how to choose the width of the kernel could be a full day lecture! Narrow kernel Sum of kernels Medium kernel Individual kernels Wide kernel
1-d KDE of Log-transformed Flow Level of smoothing: 0.5 Rug plot: shows location of data points Level of smoothing: 0.2 Level of smoothing: 0.8
Non parametric PDF in R # Read in Willamette R. flow data q=matrix(scan("willamette_data.txt"),ncol=3,byrow=T) # Assign variables yr=q[,1] mo=q[,2] flow=q[,3] # Format flows into a matrix fmat=matrix(flow,ncol=12,byrow=T) # focus on January and February # Marginal distributions # Create histogram for each month, with actual streamflow data on x-axis and KDE # of marginal distribution using....Gaussian kernel and nrd0 bandwidth par(mfrow=c(1,2)) for(i in 1:2){ x=fmat[,i] hist(x,nclass=15,main= month.name[i] ,xlab="cfs",probability=T) lines(density(x,bw="nrd0",na.rm=TRUE),col=2) rug(x,,,,2) box() } hist(x,nclass=15,main= month.name[i] ,xlab="cfs",probability=T) lines(density(x,bw="nrd0",na.rm=TRUE),col=2) rug(x,,,,2)
Non parametric CDF in R cdf.r=function(density) { x=density$x yt=cumsum(density$y) n=length(yt) y=(yt-yt[1])/(yt[n]-yt[1]) # force onto the range 0,1 without checking for significant error list(x=x,y=y) } dd=density(x,bw="nrd0",na.rm=TRUE) cdf=cdf.r(dd) plot(cdf,type="l") cdf.r=function(density) { x=density$x yt=cumsum(density$y) n=length(yt) y=(yt-yt[1])/(yt[n]-yt[1]) # force onto the range 0,1 without checking for significant error list(x=x,y=y) } dd=density(x,bw="nrd0",na.rm=TRUE) cdf=cdf.r(dd) plot(cdf,type="l") ylookup.r=function(x,cdf) int=sum(cdf$x<x) # This identifies the interval for interpolation n=length(cdf$x) if(int < 1){ y=cdf$y[1] }else if(int > n-1) y=cdf$y[n] else y=((x-cdf$x[int])*cdf$y[int+1]+(cdf$x[int+1]-x)*cdf$y[int])/(cdf$x[int+1]-cdf$x[int]) return(y) xlookup.r=function(y,cdf) int=sum(cdf$y<y) # This identifies the interval for interpolation x=cdf$x[1] x=cdf$x[n] x=((y-cdf$y[int])*cdf$x[int+1]+(cdf$y[int+1]-y)*cdf$x[int])/(cdf$y[int+1]-cdf$y[int]) return(x) ylookup.r=function(x,cdf) xlookup.r=function(y,cdf) { int=sum(cdf$y<y) # This identifies the interval for interpolation x=((y-cdf$y[int])*cdf$x[int+1]+(cdf$y[int+1]-y)*cdf$x[int])/(cdf$y[int+1]-cdf$y[int]) return(x) }
Gamma Estimate parameters using moments or maximum likelihood
Box-Cox Normalization The Box-Cox family of transformations that includes the logarithmic transformation as a special case (l=0). It is defined as: z = (x -1)/ ; 0 z = ln(x); = 0 where z is the transformed data, x is the original data and is the transformation parameter.
Log normalization with lower bound z = ln(x-)
Determining Transformation Parameters (, ) PPCC (Filliben’s Statistic): R2 of best fit line of the QQplot Kolomgorov-Smirnov (KS) Test (any distribution): p-value Shapiro-Wilks Test for Normality: p-value
Quantiles Rank the data Theoretical distribution, e.g. Standard Normal x1 x2 x3 . xn pi qi qi is the distribution specific theoretical quantile associated with ranked data value xi
Quantile-Quantile Plots QQ-plot for Raw Flows QQ-plot for Log-Transformed Flows ln(xi) qi xi qi Need transformation to make the Raw flows Normally distributed.
Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using PPCC This is close to 0, = -0.14
Kolmogorov-Smirnov Test Specifically, it computes the largest difference between the target CDF FX(x) and the observed CDF, F*(X). The test statistic D2 is: where X(i) is the ith largest observed value in the random sample of size n.
Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using Kolmogorov-Smirnov (KS) Statistic This is not as close to 0, = -0.39
shapiro.test(x) in R http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/wilkshap.htm
Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using Shapiro-Wilks Statistic This is close to 0, = -0.14. Same as PPCC.