Two topics in R: Simulation and goodness-of-fit HWU - GS.

Slides:



Advertisements
Similar presentations
R Language. What is R? Variables in R Summary of data Box plot Histogram Using Help in R.
Advertisements

Stem and leaf diagrams and box plots Statistics. Draw a stem and leaf diagram using the set of data below
Project Maths - Teaching and Learning Relative Frequency % Bar Chart to Relative Frequency Bar Chart What is the median height.
Chapter 8 Random-Variate Generation Banks, Carson, Nelson & Nicol Discrete-Event System Simulation.
EITM Tournament Results. Did we converge? Point est. 97.5% quantile V V V V V V V
Section 2.2, Part 2 Determining Normality AP Statistics.
Helicopter System Reliability Analysis Statistical Methods for Reliability Engineering Mark Andersen.
1 Boxplots. 2 Boxplots & the 5 # Summary To construct a boxplot, first obtain the 5 number summary { Min, Q 1, M, Q 3, Max } Q 1 : 1 st quartile = 25.
Analysis of Simulation Input.. Simulation Machine n Simulation can be considered as an Engine with input and output as follows: Simulation Engine Input.
5 Number Summary Box Plots. The five-number summary is the collection of The smallest value The first quartile (Q 1 or P 25 ) The median (M or Q 2 or.
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
By: Jade Wright, Garth Lo Bello, Andrew Roberts, Prue Tinsey and Tania Young.
M08-Numerical Summaries 2 1  Department of ISM, University of Alabama, Lesson Objectives  Learn what percentiles are and how to calculate quartiles.
STAT 250 Dr. Kari Lock Morgan
Chapter 5 Statistical Models in Simulation
Moment Generating Functions
Data from OpenIntro Statistics, exercise 1.36 The infant mortality rate is defined as the number of infant deaths per 1,000 live births. The data we consider.
Are You Smarter Than a 5 th Grader?. 1,000,000 5th Grade Topic 15th Grade Topic 24th Grade Topic 34th Grade Topic 43rd Grade Topic 53rd Grade Topic 62nd.
Section 1 Topic 31 Summarising metric data: Median, IQR, and boxplots.
Fall Final Topics by “Notecard”.
Chapter 2 Analysis using R. Few Tips for R Commands included here CANNOT ALWAYS be copied and pasted directly without alteration. –One major reason is.
Continued… Obj: draw Box-and-whisker plots representing a set of data Do now: use your calculator to find the mean for 85, 18, 87, 100, 27, 34, 93, 52,
1 Further Maths Chapter 2 Summarising Numerical Data.
Sampling Error SAMPLING ERROR-SINGLE MEAN The difference between a value (a statistic) computed from a sample and the corresponding value (a parameter)
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
Chapter 5 Exploring Data: Distributions February 9, 2010 Brandon Groeger.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Section 2.2b Graphical Displays of Distributions.
Chapter 5: Boxplots  Objective: To find the five-number summaries of data and create and analyze boxplots CHS Statistics.
Unit 3: Averages and Variations Week 6 Ms. Sanchez.
The field of statistics deals with the collection,
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
Box Plots March 20, th grade. What is a box plot? Box plots are used to represent data that is measured and divided into four equal parts. These.
Foundations of Math I: Unit 3 - Statistics Arithmetic average Median: Middle of the data listed in ascending order (use if there is an outlier) Mode: Most.
Chapter 31Introduction to Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2012 John Wiley & Sons, Inc.
Math 4030 – 7b Normality Issues (Sec. 5.12) Properties of Normal? Is the sample data from a normal population (normality)? Transformation to make it Normal?
Course Description Probability theory is a powerful tool that helps Computer Science and Electrical Engineering students explain, model, analyze, and design.
Module 8 Test Review. Find the following from the set of data: 6, 23, 8, 14, 21, 7, 16, 8  Five Number Summary: Answer: Min 6, Lower Quartile 7.5, Median.
Chapter 3: Uncertainty "variation arises in data generated by a model" "how to transform knowledge of this variation into statements about the uncertainty.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 1 In-Jeong Chung Intelligent Information System lab. Department.
Application of the Bootstrap Estimating a Population Mean
Box and Whisker Plots or Boxplots
Stem and leaf diagrams and box plots
Both the mean and the median are measures of central tendency
Univariate Data Exploration
2.6: Boxplots CHS Statistics
K-Means Lab.
Introduction to Matlab
Box Plots and Outliers.
One Quantitative Variable: Measures of Spread
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
Constructing Box Plots
Key statistical concepts.
Two topics in R: Simulation and goodness-of-fit
Statistics Fractiles
Part I Review Highlights, Chap 1, 2
Mean As A Balancing Point
Descriptive Statistics
Comparing Statistical Data
A tool to measure complexity in public health interventions: Its statistical properties and meta-regression approach to adjust it in meta-analysis  N.
Math 1.
Fall Final Topics by “Notecard”.
Box and Whisker Plots and the 5 number summary
Describing Data Coordinate Algebra.
Unit 2: Box Plots (Tukey) Descriptive Statistics Part Four
Continuous distribution curve.
Simulate Multiple Dice
Number Summaries and Box Plots.
Presentation transcript:

Two topics in R: Simulation and goodness-of-fit HWU - GS

Some useful distributions Used with insurance and financial data:  Exponential: Exp( λ )  Gamma( α, β )  Log-normal: LN( μ, σ 2 )  Weibull( ν, λ )  etc etc … 2

Exponential: Exp( λ ) 3

Exp( λ ) ( cont.) Distribution of values can then be plotted in R: par(mfrow=c(1,2)) hist(y1, col="cyan",main="Histogram of Y1 ~ Exp(2)") boxplot(y1, horizontal=T, col="cyan",main="Boxplot of Y1")

Exp( λ ) ( cont.) And summary statistics can be computed: descriptives <- list(summary(y1), var(y1))

Gamma( α, β ) 6

Gamma( α, β ) ( cont.) To obtain > descriptives [[1]] Min. 1st Qu. Median Mean 3rd Qu. Max [[2]] [1]

Log-normal: LN( μ, σ 2 ) 8

Log-normal: LN( μ, σ 2 ) ( cont.) simulate.ln.f <- function(n,mu,sigma2){ y3 = exp(rnorm(n, mean=mu, sd=sqrt(sigma2))) # par(mfrow=c(1,2)) hist(y3, col="cyan", main=paste("Histogram of Y3 ~ LN(", mu, ",", sigma2,")")) boxplot(y1, horizontal=T, col="cyan",main="Boxplot of Y3") # descriptives <- list(summary(y3), var(y3)); # return(descriptives) } 9

Log-normal: LN( μ, σ 2 ) ( cont.) > simulate.ln.f(n=200, mu=0, sigma2=0.1) [[1]] Min. 1st Qu. Median Mean 3rd Qu. Max [[2]] [1]

Weibull( ν, λ ) 11

Weibull( ν, λ ) ( cont.) 12

Weibull( ν, λ ) ( cont.) And put all of this in a function: simulate.weib.f <- function(n, nu, lambda){ y4 = weib.r(n,nu,lambda) # par(mfrow=c(1,2)) hist(y4, col="cyan”, main=paste("Histogram of Y4 ~ Weib(", nu, ",", lambda,")")) boxplot(y1, horizontal=T, col="cyan",main="Boxplot of Y4") # descriptives <- list(summary(y4), var(y4)); # return(list(y4,descriptives)) } 13

Weibull( ν, λ ) ( cont.) > simulate.weib.f(n=200, nu=2, lambda=0.5) Min. 1st Qu. Median Mean 3rd Qu. Max [1]

Goodness of fit 15

Empirical v theoretical CDF plot  Consider the Weibull(2, 0.5) example from before.  If the data are truly form this distn, then their empirical CDF should be close to the theoretical CDF of the Weibull(2, 0.5).  Plot these 2 in R and compare visually. 16

Empirical v theoretical CDF plot ( cont.) We will need the cdf of the Weibull distn: weib.cdf <- function(q, nu, lambda){ cdf = 1- exp(-lambda*q^nu) return(cdf) } Then generate some data: weib.data = simulate.weib.f(n=200, nu=2, lambda=0.5)[[1]] 17

Empirical v theoretical CDF plot ( cont.) Then produce the plot: grid.x = seq(min(weib.data), max(weib.data), length=100) plot(grid.x,weib.cdf(grid.x,nu,lambda),type="l",col="red", ylim=c(0,1)) s = c(1:length(weib.data)) lines(sort(weib.data), s/length(weib.data), type="s") legend("bottomright", legend=c("cdf","ecdf"),col=c("red","black"),lty=c(1,1)) title(main="Empirical v theoretical CDF") 18

Kolmogorov-Smirnov g-o-f test We can quantify the significance of the difference between cdf and ecdf using the KS test.  H0: the data follow a specified (continuous) distn v. H1: they don’t follow the specified distribution  Use test statistic:  Reject H0 at significance level α if D n > critical value associated with the sampling distribution of D n (obtained by tables) or use p-value provided in R. More details in: Daniel, W.W. (1990) Applied nonparametric statistics, 2nd ed., PWS- Kent 19

Kolmogorov-Smirnov g-o-f test ( cont.) Put KS test and cdf/ecdf plot in a single R function: ks.weib.f <- function(data,nu,lambda){ # Perform test ks <- ks.test(data,weib.cdf,nu,lambda) # Plot ecdf and cdf grid.x = seq(min(data), max(data), length=100) par(mfrow=c(1,1)) plot(grid.x,weib.cdf(grid.x,nu,lambda),type="l",col="red", ylim=c(0,1)) s = c(1:length(data)) lines(sort(data),s/length(data), type="s") title(main="Empirical v theoretical CDF") legend("bottomright", legend=c("cdf","ecdf"),col=c("red","black"),lty=c(1,1)) # return(ks) } 20

Kolmogorov-Smirnov g-o-f test ( cont.) Run it for some data: > weib.data = simulate.weib.f(n=200, nu=2, lambda=0.5)[[1]] > ks.weib.f(weib.data, nu=2, lambda=0.5) One-sample Kolmogorov-Smirnov test D = , p-value =

Kolmogorov-Smirnov g-o-f test ( cont.) Run it for a different distribution : > weib.data = simulate.weib.f(n=200, nu=2, lambda=0.5)[[1]] > ks.weib.f(weib.data, nu=2, lambda=0.4) One-sample Kolmogorov-Smirnov test D = , p-value =