Download presentation
Presentation is loading. Please wait.
Published byJocelin Haynes Modified over 9 years ago
1
STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1
2
Review: STT315 for Population Quantile STT520-420 2 Eg1: For Y~Binom(2,0.5), find a) 25 th percentile of Y; b) 30 th percentile; c) 60 th percentile; d) 90 th percentile. Eg2: find the 25 th percentile of Discussion: in the def of percentile Why P(Y≤ φ p)≥p?
3
Sample quantile function Suppose we have obtained data on the survival r.v. Y. How can we tell whether the data follow certain prob model? See Eg: 2.8, page 32. In general, QQ-plot and hazard plot will help. Def 4.1: Let Sn(y) be the empirical survival function. For 0≤u≤1, the sample quantile function Qn(u) is defined as: Qn(u)=inf{y: Sn(y) ≤1-u}. Eg: For the following ranked data: 3, 5, 7, 8, find a) Find Sn(y) and plot the graph. b) Find Qn(0.2), b) Qn(0.6), and c) Qn(0.9), by definition. STT520-420 3
4
Review: Empirical survival function with confidence bands for Example 1.3 STT520-420 4 Drawback: this graph did NOT shows right-continuity property of empirical survival function! Question: How to improve it?
5
Reiwe: Plot Empirical survivor function section 4.1, page 55-56: Assume we have sorted data: Starting at Sn(0)=1; STT520-420 5 However, this graph is difficult to draw using software in a way that shows both right- continuity and piece- wise linearity. Convention: Plot points located in the middle of “jumps”
6
Plotting Positions STT520-420 6 Convention: Plot points located in the middle of “jumps” on the vertical axis and at the observed data points on the horizontal axis.
7
Sample quantile function We call plotting position. Then the empirical survivor plot for ranked data: That means is the sample percentile. Or is the sample quantile. STT520-420 7
8
Example 4.1, page 58: empirical survival plot using plotting position STT520-420 8
9
## Example 4.1, page 58 : Review Emperical Survivor function introduced previously par(mfrow=c(1, 2)) x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52) x<-sort(x); a<-rle(x); values<-a$values; values #distinct values from the observations length<-a$length; length #replcation for each distict value (f<-table(x)); #We need the fractions to plot the curve - so get the sample size first in n (n=length(x)); #we want S(0)=1 surv1=1-cumsum(f)/n; surv2=c(1,surv1) ; surv2 #now let's plot this curve… use the type="s" to get a step function t=c(0, values) ; surv2 #t is the vector of x's and surv2 is the vector of y's plot(t,surv2,type="s",xlab="Failure age",ylab="Relative Frequencies", col="orange", pch=15, lwd=3) ## Or use type=“P” for point title(main = "Emperical Survivor function") ########################################################### ## Emperical Survivor function using plotting position (sample quantile) x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52) x<-sort(x); n<-length(x); p=1-ppoints(n, a=0.5) ## or equivalently: kjp=1-(1:n-.5)/n; plot(x, p, type="n",xlab="Failure age",ylab="quantile"); points(x,p,col = "blue", pch=15) title(main = "Emperical Survivor plot using plotting points") STT520-420 9
10
Probability Plots (QQ-plot) If a probability model fits the data well, then the p-th sample quantile and p-th population quantile should be in close agreement. Def 4.4: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points: Recall -- in the usual rectangular coordinate system. Note: In general, the ranked data are plotted on the vertical axis, and the theoretic/population percentile from specified distribution on the horizontal axis. STT520-420 10
11
Probability Plots (QQ-plot) to check Normality For data: 3, 5, 7, 8. 1. Find the sample quantile for each observation. For example: y (1) = 3 = (1-0.5)/4 th sample quantile =0.125 th sample quantile. 2. Now assume the data come from Normal distribution, find the population quantile from R or your calculator. For Example, in R: qnorm(0.125) = ?; or in calculator: invNorm(0.125) = ? 3. Plot the QQ plot manually. STT520-420 11
12
Manual QQ plot for Normality par(mfrow=c(1,2)) pop.quant=c(-1.15, -0.319, 0.319, 1.15) sample.quant=c(3, 5, 7, 8) qqplot(pop.quant, sample.quant) qqnorm(sample.quant) STT520-420 12
13
########################################################### # The 100-quantiles are called percentiles. The 4-quantiles are called quartiles. # We have learn Q1, median, and Q3. That is 25th percentile, 50th percentile, 75th percentile. ########################################################### ## QQ-Normal for Normal distribution ## set.seed(100); par(mfrow=c(1,2)); # 1 x 2 pictures on one plot x=rnorm(1000, mean=0, sd=4); qqnorm(x); qqline(x, col = 4); ## QQ-Norm for t-distribution ## y <- rt(1000, df = 4) qqnorm(y); qqline(y, col = 2) dev.off() ## clear the setting of par # Another example y <- rt(1000, df = 4) qqplot(y, rt(300, df = 5)); qqline(y); STT520-420 13 QQ-Norm for Normal and t distribution
14
STT520-420 14
15
Probability Plots A probability plot provides evidence as to whether (continuous) data have come from a population having a stated probability model S. In survival analysis, the prob models mostly checked are the Exponential and Weibull. Survival times are often transformed by logarithms to create a natural measurement scale. Recall Example 2.1 and Example 2.2, we use stem-leaf plot and calculate the empirical hazard function to determine whether an Exponential or Weibull are possible models. STT520-420 15
16
Probability Plots (QQ-plot) If the points plotted from the QQ-plot fall close to a 45 degree line through the origin (0, 0) of the plot, then the prob model fits the data well. Points in the plot lying above/below 45 degree line represent data points that are larger/smaller than the prob model predicts they should be. If the prob plot looks like a “S-shape” plot, then only tails are shorter or longer than they should be. For QQ-plot of Weibull data is most suitably presented on the logarithm scale (for each axis). The interpretation of a straight-line fit is then relative to the scale involved. STT520-420 16
17
Consider a simple hazard function, the constant hazard h(y)= for all y≥0. Here we assume , where 0. We have seen that so if we evaluate this for h(y)= , we get Since f(y)=-d(S(y))/dy, we have the exponential probability density with parameter . This means the expected value is and the variance is 2. Review: Constant Hazard model STT520-420 17
18
Review: Exponential distribution From STT315: In R: dexp(x, 1/ β ); pexp (x, 1/ β ); qexp (per, 1/ β ); rexp (N, 1/ β ). ## Note that in R, exponential distribution is defined in a different way than we used to have in STT315 class. set.seed(100) y=rexp(10000, 0.1) mean(y) ## beta= 10.07184, not 0.1!
19
Probability Plots (QQ-plot) to check Exponential dist For data: 3, 5, 7, 8. 1. Find the sample quantile for each observation. For example: y (1) = 3 = (1-0.5)/4 th sample quantile =0.125 th sample quantile. 2. Now assume the data come from Normal distribution, find the population quantile from R or your calculator. For Example, in R: qexp(0.125, 1/mean(y)) = ?; 3. Plot the QQ plot manually. STT520-420 19
20
How do we decided whether a set of survival data is following the exponential distribution? That the hazard is constant? Look over Example 2.1: 200 randomly generated exponential variables with mean=100. Characteristic skewed distribution, sample mean=107.5, sample s.d.=106.1; (Recall that if Y~exp( then E(Y)=SD(Y)= . ) The sample stemplot and the sample mean and sd approximate the true shape, center and spread of the exponential. The estimated hazards (rightmost column) approx..01 (1/100) - constant - see the formula on p.22 for getting these values… But another way to check the distribution is to compare the quantiles of the exponential distribution with the sample quantiles in a plot known as a qqplot. See R-Quantile plot for a way to compute the quantiles and do the plot… Recall that the p-th quantile of a distribution of a r.v. Y is the value Q s.t. P(Y<=Q)=p. So we must compute the quantiles of the theoretical distribution and compare them (smallest to smallest, next smallest to next smallest, etc.) to the sample quantiles. Exponential Prob Plot for Constant Hazard model STT520-420 20
21
STT520-420 21 Constant Hazard model: Example 2.1, page 21 The decimal point is 1 digit(s) to the right of the | 0 | 001234555667799011344455999 2 | 01133455566777888901222344455677889 4 | 0122334455566667779013688 6 | 000012346666679900124689 8 | 334445883467 10 | 0082445556779 12 | 2358880125 14 | 0144467 16 | 12457868 18 | 13493 20 | 17792369 22 | 1934446 24 | 2113 26 | 05839 28 | 77 30 | 66 32 | 4 34 | 0 36 | 6 38 | 3 40 | 42 | 44 | 78 set.seed(100) y=rexp(10000, 0.1) mean(y) ## beta= 10.07184, not 0.1! ############################ # random generation for the exponential # distribution with rate rate (i.e., mean=1/rate). # rexp(n, rate) # From Example 2.1, we first generate 200 #exponential varaibles with mean=100. set.seed(100) x<-rexp(n=200,rate=1/100); mean(x); # mean sd(x); #st1andard deviation # do a stemplot of 200 randomly generated # exponential variables with beta=100 stem(x); plot(qexp(ppoints(200,a=.5),rate=1/mean(x)), sort(x)); abline(0,1)
22
QQ-plot: Example 2.1, page 21 STT520-420 22
23
Check Example 1.3, page 6 to see whether the placebo data comes from an Exponential distribution placebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23); Step1: Fit the placebo data to an exponential with mean = mean(placebo) #Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis. Step2: Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line?? #The function in R that draws lines on plots is abline(intercept, slope), so we want abline(0,1) Step3: What do you think? Does the placebo dataset follow an exponential distribution? #Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions. STT520-420 23
24
Check Example 1.3, page 6 to see whether it comes from an Exponential distribution placebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23); n=length(placebo); n #The first plot we can do is the so-called QQ-plot: plot(ppoints(n,a=.5),sort(placebo)); #This shows the distribution of placebo; e.g. you can see the median, the quartiles, etc. ########################################################### #To see whether a set of data is following a particular distribution, plot the quantiles of the data against the quantiles of the specific distribution. #Recall that there are many distributions built in to R: qexp, qnorm, etc. Try help(qexp) to see the general format of these functions… ########################################################### #We'll fit our placebo distribution to an exponential with mean = mean(placebo) plot(qexp(ppoints(n,a=.5),rate=1/mean(placebo)), sort(placebo)) #Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis. #Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line?? #The function in R that draws lines on plots is abline(intercept, slope), so we want abline(0,1) #What do you think? Does the placebo dataset follow an exponential distribution? #Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions. STT520-420 24
25
Check Example 1.3, page 6 to see whether it comes from an Exponential distribution STT520-420 25
26
Review: Power Hazard model Power Hazard: Note this is of the form (constant)y constant and if 1 this reduces to the constant hazard we just considered. Note thatand so STT520-420 26
27
Weibull Prob Plots Recall Power hazard model: Substitute, then take logarithm: Take logarithm of base 10 again to create log-life variable: We now have: Note: base-10 log are traditionally used in lifetime, but natural log is equivalent with a difference of scale. STT520-420 27
28
Weibull Prob Plots If data fit Weibull model: Weibull Prob plot is: It follows a straight line with slope and intercept. Reacll: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points: STT520-420 28
29
Summary: To check graphically if a distribution is Weibull, do essentially a qqplot on the log-log scale (see section 4.4, p. 61-63). The key formulas are: (4.2)Now substitute the ordered data and take natural logs (4.2a)Take logs again… (4.2b) Write as a linear equation: (4.3) (4.3a) So plot the points in 4.4, look for a straight line and the slope will equal 1/ and the intercept will equal log( ) STT520-420 29
30
Example 4.2 on page 62. Ranked data: 10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27, 29, 29, 30, 35. [Note: n=17] Step 1: take y=log10 to ranked data; Step 2: find the plotting position: Step 3: find x= Step 4: plot the Weibull Prob plot STT520-420 30
31
Example 4.2, page 62 #Now let's do the Weibull probability plot as described in Chapter 4 on p. 61-63 #Formula (4.4) shows that the slope is 1/shape and intercept is log10(scale) #when you do a Weibull probability plot. Let's try it with the data in Ex. 4.2 ############# lifetime=c(10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27, 29, 29, 30, 35) #now take logs base 10 and compute the values needed in formula (4.4) y=log10(lifetime) n=length(lifetime) k=(1:n-.5)/n; ##or k=ppoints(n, a=0.5); x=log10(-log(1-k)) #now plot the x's and y's and put in a vertical line at x=0 to see the intercept plot(x,y) ; #similarly find the slope both graphically or you may use regression lm(y~x) #from the output we see intercept=1.3813 and slope=0.2660 abline(lm(y~x), col="blue") ## regression line #so notice that the intercept of the regression line is 1.38; #so log10(beta)=1.38 which implies beta=10^1.38 =23.98833 #and slope=0.2660 which implies alpha=1/.2660=3.75938 STT520-420 31
32
More about Weibull Prob Plots The Weibull prob plot follows a straight line with slope and intercept Standard Least square regression can provide the estimates to the slope and intercepts. If slope= is estimated to be close to 1, then it is reasonable to assume the data come from an Exponential model. STT520-420 32
33
Cumulative Hazard plot A hazard plot is:, where which is called (cumulative) hazard plot scores. For Exponential model, h(y)= then y y/ If the data come from an exponential distribution, then hazard plot will be a straight line. It is a simpler diagnostic test than the prob plot. STT520-420 33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.