Project Plan Task 8 and VERSUS2 Installation problems Anatoly Myravyev and Anastasia Bundel, Hydrometcenter of Russia March 2010
Task 8: Statistical features like confidence intervals and the Bootstrap method
Formal definition of confidence intervals (CIs): Estimation of an unknown value defines a distribution Р corresponding to a random sample X from the population ={Р }. If for a given α>0 there exist random variables = (α, Х) such that P ( – < < + ) 1– α, then the interval ( –, + ) is called the confidence interval for of level 1– α. The random interval contains the unknown value , which is not random.
The statistical problem lies in the construction of CIs Cases with known probability distribution function of the population: parametric CIs Cases where the pdf is not known: non-parametric CIs
Parametric CIs Normal distribution assumption is most frequent. The underlying sample must be an iid-sample (independent and identically distributed). Pluses: –Easy and not computer-intensive Minuses: –Cannot be used for scores with non-normal distributions without some normalization (proportions, odds ratio, correlation coefficients, …), or require complicated calculation formulas
Non-parametric CIs Construction of artificial datasets from a given collection of real data by resampling the observations. Pluses: –Highly adaptable to different testing situations because no assumptions regarding an underlying theoretical distribution of data are required –Computational ease Minuses: –The assumptions for sample statistics must not be overlooked: representativeness, iid
Bootstrapping Operates by constructing the artificial data using sampling with replacement from the original data (Efron 1979, Wassermann 2006) Highly elaborated computational technique (R-project) The most common and popular resampling method in verification (Wilks 1995)
Different bootstrap methods – how to construct CIs from the samples obtained Percentile CIs Bias-corrected Cis (BSa) Normal approximation CIs Basic bootstrap CIs Bootstrap-t CIs Approximated bootstrap CIs (ABC), etc. A compromise between their accuracy and computational burden must be made. used at present in MET Package
Implementation of CIs using R package boot Boot is one of the required packages for R verification package The intention is to introduce commands analogous to the MySQL v_index table in a form like index_booted<-boot(index(fcs,obs), 1000) index_ci<-(index_booted, conf=c(0.95, 0.99), type=c(“perc, ”bca”)
Conclusions The accuracy of statistical scores depends among other things on the following: –Sampling uncertainty –Validity of assumptions about representativeness and iid of the sample –Observational uncertainty –Uncertainty in the physical processes (Gilleland, 2008) Different α can be used (e.g. CIs of level 0.95, 0.99, even 0.70, etc) depending on the scope of analysis Bayesian prediction intervals?
Conclusions (2) In view of ambiguities about a “most precise” method for the CI construction, we should try several procedures on real frc and obs data available. Both parametric and non- parametric statistics are rightful (MET experience!) The decision making (what is good, what is bad) should be performed on the multi-criteria basis
Problems with VERSUS2 functioning In the Hydrometcenter of Russia
Problems with VERSUS2 functioning Installation is done in the RedHat environment without errors The new data leave traces in the MySQL tables and the test (Pirmin-) files are acquired However, the data information gets lost in the vicinity of the Data Availability tab (Model? Date Intervals?...) A tutorial variant for the package is urgently needed with valid obs and frc data
Thank you for your attention!