STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

DISTRIBUTION FITTING.
Statistics: Data Analysis and Presentation Fr Clinic II.
Chapter 6 The Normal Distribution and Other Continuous Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Chapter 5 Continuous Random Variables and Probability Distributions
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Continuous Random Variables and Probability Distributions.
Business Statistics - QBM117 Statistical inference for regression.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Chapter 12 Section 1 Inference for Linear Regression.
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Normal Distribution Recall how we describe a distribution of data:
Power and Sample Size IF IF the null hypothesis H 0 : μ = μ 0 is true, then we should expect a random sample mean to lie in its “acceptance region” with.
Chapter 4 Continuous Random Variables and Probability Distributions
Inference for regression - Simple linear regression
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.
3.3 Density Curves and Normal Distributions
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 PROBABILITIES FOR CONTINUOUS RANDOM VARIABLES THE NORMAL DISTRIBUTION CHAPTER 8_B.
1 Normal Random Variables In the class of continuous random variables, we are primarily interested in NORMAL random variables. In the class of continuous.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Continuous Random Variables.
1 Lecture 13: Other Distributions: Weibull, Lognormal, Beta; Probability Plots Devore, Ch. 4.5 – 4.6.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Exploratory Data Analysis Observations of a single variable.
Random Variables Presentation 6.. Random Variables A random variable assigns a number (or symbol) to each outcome of a random circumstance. A random variable.
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
STT : BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 7: Parametric Survival Models under Censoring STT
Chapter 2 Modeling Distributions of Data Objectives SWBAT: 1)Find and interpret the percentile of an individual value within a distribution of data. 2)Find.
Basic Business Statistics
Stracener_EMIS 7305/5305_Spr08_ Reliability Data Analysis and Model Selection Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Applied Quantitative Analysis and Practices
Data Analysis, Presentation, and Statistics
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 6-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions Basic Business.
Chapter 7: Sampling Distributions Section 7.2 Sample Proportions.
Chap 6-1 Chapter 6 The Normal Distribution Statistics for Managers.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
+ Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population Mean.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics, A First Course 4 th.
Chapter 6 Continuous Random Variables Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 6 The Normal Distribution and Other Continuous Distributions
Continuous random variables
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
STT : BIOSTATISTICS ANALYSIS Dr. Cuixian Chen
CHAPTER 12 More About Regression
STT : Biostatistics Analysis Dr. Cuixian Chen
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
STAT 206: Chapter 6 Normal Distribution.
CHAPTER 12 More About Regression
Lecture Slides Elementary Statistics Thirteenth Edition
The normal distribution
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
CHAPTER 12 More About Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
CHAPTER 12 More About Regression
The hazard function The hazard function gives the so-called “instantaneous” risk of death (or failure) at time t, assuming survival up to time t. Estimate.
The Normal Distribution
Presentation transcript:

STT : BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT

Review: STT315 for Population Quantile STT  Eg1: For Y~Binom(2,0.5), find a) 25 th percentile of Y; b) 30 th percentile; c) 60 th percentile; d) 90 th percentile.  Eg2: find the 25 th percentile of  Discussion: in the def of percentile Why P(Y≤ φ p)≥p?

Sample quantile function  Suppose we have obtained data on the survival r.v. Y. How can we tell whether the data follow certain prob model? See Eg: 2.8, page 32.  In general, QQ-plot and hazard plot will help.  Def 4.1: Let Sn(y) be the empirical survival function. For 0≤u≤1, the sample quantile function Qn(u) is defined as: Qn(u)=inf{y: Sn(y) ≤1-u}.  Eg: For the following ranked data: 3, 5, 7, 8, find  a) Find Sn(y) and plot the graph.  b) Find Qn(0.2), b) Qn(0.6), and c) Qn(0.9), by definition. STT

Review: Empirical survival function with confidence bands for Example 1.3 STT Drawback: this graph did NOT shows right-continuity property of empirical survival function! Question: How to improve it?

Reiwe: Plot Empirical survivor function section 4.1, page 55-56: Assume we have sorted data: Starting at Sn(0)=1; STT However, this graph is difficult to draw using software in a way that shows both right- continuity and piece- wise linearity. Convention: Plot points located in the middle of “jumps”

Plotting Positions STT Convention: Plot points located in the middle of “jumps” on the vertical axis and at the observed data points on the horizontal axis.

Sample quantile function  We call plotting position.  Then the empirical survivor plot for ranked data:  That means is the sample percentile.  Or is the sample quantile. STT

Example 4.1, page 58: empirical survival plot using plotting position STT

## Example 4.1, page 58 : Review Emperical Survivor function introduced previously par(mfrow=c(1, 2)) x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52) x<-sort(x); a<-rle(x); values<-a$values; values #distinct values from the observations length<-a$length; length #replcation for each distict value (f<-table(x)); #We need the fractions to plot the curve - so get the sample size first in n (n=length(x)); #we want S(0)=1 surv1=1-cumsum(f)/n; surv2=c(1,surv1) ; surv2 #now let's plot this curve… use the type="s" to get a step function t=c(0, values) ; surv2 #t is the vector of x's and surv2 is the vector of y's plot(t,surv2,type="s",xlab="Failure age",ylab="Relative Frequencies", col="orange", pch=15, lwd=3) ## Or use type=“P” for point title(main = "Emperical Survivor function") ########################################################### ## Emperical Survivor function using plotting position (sample quantile) x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52) x<-sort(x); n<-length(x); p=1-ppoints(n, a=0.5) ## or equivalently: kjp=1-(1:n-.5)/n; plot(x, p, type="n",xlab="Failure age",ylab="quantile"); points(x,p,col = "blue", pch=15) title(main = "Emperical Survivor plot using plotting points") STT

Probability Plots (QQ-plot)  If a probability model fits the data well, then the p-th sample quantile and p-th population quantile should be in close agreement.  Def 4.4: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points: Recall -- in the usual rectangular coordinate system.  Note: In general, the ranked data are plotted on the vertical axis, and the theoretic/population percentile from specified distribution on the horizontal axis. STT

Probability Plots (QQ-plot) to check Normality  For data: 3, 5, 7, 8.  1. Find the sample quantile for each observation.  For example: y (1) = 3 = (1-0.5)/4 th sample quantile =0.125 th sample quantile. 2. Now assume the data come from Normal distribution, find the population quantile from R or your calculator. For Example, in R: qnorm(0.125) = ?; or in calculator: invNorm(0.125) = ?  3. Plot the QQ plot manually. STT

Manual QQ plot for Normality par(mfrow=c(1,2)) pop.quant=c(-1.15, , 0.319, 1.15) sample.quant=c(3, 5, 7, 8) qqplot(pop.quant, sample.quant) qqnorm(sample.quant) STT

########################################################### # The 100-quantiles are called percentiles. The 4-quantiles are called quartiles. # We have learn Q1, median, and Q3. That is 25th percentile, 50th percentile, 75th percentile. ########################################################### ## QQ-Normal for Normal distribution ## set.seed(100); par(mfrow=c(1,2)); # 1 x 2 pictures on one plot x=rnorm(1000, mean=0, sd=4); qqnorm(x); qqline(x, col = 4); ## QQ-Norm for t-distribution ## y <- rt(1000, df = 4) qqnorm(y); qqline(y, col = 2) dev.off() ## clear the setting of par # Another example y <- rt(1000, df = 4) qqplot(y, rt(300, df = 5)); qqline(y); STT QQ-Norm for Normal and t distribution

STT

Probability Plots  A probability plot provides evidence as to whether (continuous) data have come from a population having a stated probability model S.  In survival analysis, the prob models mostly checked are the Exponential and Weibull.  Survival times are often transformed by logarithms to create a natural measurement scale.  Recall Example 2.1 and Example 2.2, we use stem-leaf plot and calculate the empirical hazard function to determine whether an Exponential or Weibull are possible models. STT

Probability Plots (QQ-plot)  If the points plotted from the QQ-plot fall close to a 45 degree line through the origin (0, 0) of the plot, then the prob model fits the data well.  Points in the plot lying above/below 45 degree line represent data points that are larger/smaller than the prob model predicts they should be.  If the prob plot looks like a “S-shape” plot, then only tails are shorter or longer than they should be.  For QQ-plot of Weibull data is most suitably presented on the logarithm scale (for each axis). The interpretation of a straight-line fit is then relative to the scale involved. STT

 Consider a simple hazard function, the constant hazard h(y)=  for all y≥0. Here we assume , where  0. We have seen that so if we evaluate this for h(y)= , we get Since f(y)=-d(S(y))/dy, we have the exponential probability density with parameter . This means the expected value is  and the variance is  2. Review: Constant Hazard model STT

Review: Exponential distribution  From STT315:  In R: dexp(x, 1/ β ); pexp (x, 1/ β ); qexp (per, 1/ β ); rexp (N, 1/ β ). ## Note that in R, exponential distribution is defined in a different way than we used to have in STT315 class. set.seed(100) y=rexp(10000, 0.1) mean(y) ## beta= , not 0.1!

Probability Plots (QQ-plot) to check Exponential dist  For data: 3, 5, 7, 8.  1. Find the sample quantile for each observation.  For example: y (1) = 3 = (1-0.5)/4 th sample quantile =0.125 th sample quantile. 2. Now assume the data come from Normal distribution, find the population quantile from R or your calculator. For Example, in R: qexp(0.125, 1/mean(y)) = ?;  3. Plot the QQ plot manually. STT

 How do we decided whether a set of survival data is following the exponential distribution? That the hazard is constant?  Look over Example 2.1: 200 randomly generated exponential variables with mean=100. Characteristic skewed distribution, sample mean=107.5, sample s.d.=106.1; (Recall that if Y~exp(  then E(Y)=SD(Y)= . ) The sample stemplot and the sample mean and sd approximate the true shape, center and spread of the exponential.  The estimated hazards (rightmost column) approx..01 (1/100) - constant - see the formula on p.22 for getting these values…  But another way to check the distribution is to compare the quantiles of the exponential distribution with the sample quantiles in a plot known as a qqplot. See R-Quantile plot for a way to compute the quantiles and do the plot… Recall that the p-th quantile of a distribution of a r.v. Y is the value Q s.t. P(Y<=Q)=p. So we must compute the quantiles of the theoretical distribution and compare them (smallest to smallest, next smallest to next smallest, etc.) to the sample quantiles. Exponential Prob Plot for Constant Hazard model STT

STT Constant Hazard model: Example 2.1, page 21  The decimal point is 1 digit(s) to the right of the |  0 |  2 |  4 |  6 |  8 |  10 |  12 |  14 |  16 |  18 |  20 |  22 |  24 | 2113  26 |  28 | 77  30 | 66  32 | 4  34 | 0  36 | 6  38 | 3  40 |  42 |  44 | 78 set.seed(100) y=rexp(10000, 0.1) mean(y) ## beta= , not 0.1! ############################ # random generation for the exponential # distribution with rate rate (i.e., mean=1/rate). # rexp(n, rate) # From Example 2.1, we first generate 200 #exponential varaibles with mean=100. set.seed(100) x<-rexp(n=200,rate=1/100); mean(x); # mean sd(x); #st1andard deviation # do a stemplot of 200 randomly generated # exponential variables with beta=100 stem(x); plot(qexp(ppoints(200,a=.5),rate=1/mean(x)), sort(x)); abline(0,1)

QQ-plot: Example 2.1, page 21 STT

Check Example 1.3, page 6 to see whether the placebo data comes from an Exponential distribution placebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23); Step1: Fit the placebo data to an exponential with mean = mean(placebo) #Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis. Step2: Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line?? #The function in R that draws lines on plots is abline(intercept, slope), so we want abline(0,1) Step3: What do you think? Does the placebo dataset follow an exponential distribution? #Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions. STT

Check Example 1.3, page 6 to see whether it comes from an Exponential distribution placebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23); n=length(placebo); n #The first plot we can do is the so-called QQ-plot: plot(ppoints(n,a=.5),sort(placebo)); #This shows the distribution of placebo; e.g. you can see the median, the quartiles, etc. ########################################################### #To see whether a set of data is following a particular distribution, plot the quantiles of the data against the quantiles of the specific distribution. #Recall that there are many distributions built in to R: qexp, qnorm, etc. Try help(qexp) to see the general format of these functions… ########################################################### #We'll fit our placebo distribution to an exponential with mean = mean(placebo) plot(qexp(ppoints(n,a=.5),rate=1/mean(placebo)), sort(placebo)) #Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis. #Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line?? #The function in R that draws lines on plots is abline(intercept, slope), so we want abline(0,1) #What do you think? Does the placebo dataset follow an exponential distribution? #Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions. STT

Check Example 1.3, page 6 to see whether it comes from an Exponential distribution STT

Review: Power Hazard model  Power Hazard:  Note this is of the form (constant)y constant and if  1 this reduces to the constant hazard we just considered.  Note thatand so STT

Weibull Prob Plots  Recall Power hazard model:  Substitute, then take logarithm:  Take logarithm of base 10 again to create log-life variable:  We now have:  Note: base-10 log are traditionally used in lifetime, but natural log is equivalent with a difference of scale. STT

Weibull Prob Plots  If data fit Weibull model:  Weibull Prob plot is:   It follows a straight line with slope and intercept.  Reacll: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points: STT

 Summary:  To check graphically if a distribution is Weibull, do essentially a qqplot on the log-log scale (see section 4.4, p ). The key formulas are:  (4.2)Now substitute the ordered data and take natural logs  (4.2a)Take logs again…  (4.2b) Write as a linear equation:  (4.3)  (4.3a) So plot the points in 4.4, look for a straight line and the slope will equal 1/  and the intercept will equal log(  ) STT

Example 4.2 on page 62.  Ranked data:  10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27, 29, 29, 30, 35. [Note: n=17]  Step 1: take y=log10 to ranked data;  Step 2: find the plotting position:  Step 3: find x=   Step 4: plot the Weibull Prob plot STT

Example 4.2, page 62 #Now let's do the Weibull probability plot as described in Chapter 4 on p #Formula (4.4) shows that the slope is 1/shape and intercept is log10(scale) #when you do a Weibull probability plot. Let's try it with the data in Ex. 4.2 ############# lifetime=c(10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27, 29, 29, 30, 35) #now take logs base 10 and compute the values needed in formula (4.4) y=log10(lifetime) n=length(lifetime) k=(1:n-.5)/n; ##or k=ppoints(n, a=0.5); x=log10(-log(1-k)) #now plot the x's and y's and put in a vertical line at x=0 to see the intercept plot(x,y) ; #similarly find the slope both graphically or you may use regression lm(y~x) #from the output we see intercept= and slope= abline(lm(y~x), col="blue") ## regression line #so notice that the intercept of the regression line is 1.38; #so log10(beta)=1.38 which implies beta=10^1.38 = #and slope= which implies alpha=1/.2660= STT

More about Weibull Prob Plots  The Weibull prob plot follows a straight line with slope and intercept  Standard Least square regression can provide the estimates to the slope and intercepts.  If slope= is estimated to be close to 1, then it is reasonable to assume the data come from an Exponential model. STT

Cumulative Hazard plot  A hazard plot is:, where  which is called (cumulative) hazard plot scores.  For Exponential model, h(y)=  then  y  y/   If the data come from an exponential distribution, then hazard plot will be a straight line. It is a simpler diagnostic test than the prob plot. STT