Download presentation
Presentation is loading. Please wait.
Published byΦόβος Μανωλάς Modified over 6 years ago
1
Lesson Topics SAS Procedures for Standard Statistical Tests and Analyses Programs 19 and 20 LSB 9:4-7;12-13 Welcome to lesson 10. In this lesson we will look at some of the SAS procedures used to perform standard statistic tests and analyses. These topics are illustrated in programs 19 and 20 in the course notes.
2
STATISTICAL PROCEDURES IN SAS
Here is a list of the SAS procedures commonly used to perform statistical analyses and hypothesis testing. You may be familiar with some of these tests or analyses. PROC FREQ can be used to perform classical Chi-square tests of 2-way tables which includes comparing proportions of an outcome among 2 or more groups. PROC TTEST performs the classical Students’ T Test comparing 2 sample means. PROC ANOVA performs analyses of variance, comparing more than 2 means. PROC GLM (General Linear Model) can also be used for analyses of variance. It is more general than PROC ANOVA. PROC NPAR1WAY performs non-parametric analyses of variance. This can be used as an alternative to ANOVA if some of the assumptions in ANOVA do not hold true. PROC MEANS which we have used for simple descriptive analyses can be used to perform paired t-tests, for example the change in test scores between two time periods or conditions.
3
STATISTICAL PROCEDURES IN SAS
PROC LOGIST is used to perform logistic regression where you relate one or more factors to a dichotomous outcome. PROC LIFETEST performs survival analyses comparing survival curves among 2 or more groups. Related to this is PROC PHREG which performs Cox Proportional Hazards Regression. These types of analyses are used extensively in the health sciences for epidemiological studies and clinical trials. I also include in this list PROC CORR and PROC REG which we have briefly used earlier in the class. PROC REG performs linear regression where you relate one or more factors to a continuous outcome. PROC CORR computes correlation coefficients. To run these procedures will be rather simple. You simply have to pass on to SAS the appropriate procedure, statements, and options. However, it will be important to understand the output that SAS generates, to find the information you are interested in. Important to understand the output
4
Treatment Groups in TOMHS
1. Beta Blocker 2. Calcium Channel Blocker 3. Diuretic 4. Alpha Blocker 5. ACE Inhibitor 6. Placebo 1 - 5 are blood pressure drugs For illustration, we will be using the TOMHS data comparing the treatment groups for various outcomes. To remind you there are 6 groups in TOMHS, the first five were assigned one of five classes of BP medication. The sixth group received placebo. All patient received counseling to reduce BP through life-style means.
5
Side Effect Questions Have you been troubled in the past 3 months with any of the following? a. Fever b. Sweating ww. Feeling depressed 1. No, not troubled 2. Yes, mildly 3. Yes, moderately 4. Yes, severely Responses 2-4 indicate a positive response. Here is the format of the questions asked patients regarding several conditions that may be related to taking medication. A value of one was entered if the patient was not troubled with the condition, and values 2-4 were positive responses to 3 degree levels. For analyses we will combine responses 2-4.
6
* Program 19 sbpchg = sbp12 - sbpbl;
DATA stat ; INFILE ‘C:\SAS_Files\tomhsfull.data' LRECL = 300 ; INPUT @ ptid $10. @ group @ sbpbl @ sbp @ ursod @ se12_ ; if se12_2 in(2,3,4) then tired12 = 1; else if se12_2 = then tired12 = 2; sbpchg = sbp12 - sbpbl; if group = 6 then active = 2; else active = 1; if group IN(1,2,3,4,5) then drug = group; In program 19 we use a DATA step to read in the patient ID, study group, systolic BP at baseline and 12 months, and the variable to indicate if the patient was troubled with tiredness (side effect number 2). In this program we will use the full TOMHS dataset that includes data for all 902 patients. We create a new variable tired12 with values of 1 if the patient had the condition at 12-months and 2 if they did not. We also compute systolic BP change at 12-months. For some analyses we will combine the 5 active treatment groups so we create a new variable called active with values of 1 and 2, 1 indicating active treatment and 2 indicating placebo. Finally, for some analyses we will compare only the active drug groups, excluding the placebo group. This can be done in several ways but here I compute a variable called drug using the IN function which will have values 1-5 for the active drug groups and have a value of missing for the placebo group. Do you see why the value of drug will be missing for placebo patients, group = 6? That is because the IF statement will not be true for the placebo group, and variables are always initialized to missing.
7
TABLES active*tired/CHISQ RELRISK;
PROC FREQ DATA=stat; TABLES active*tired/CHISQ RELRISK; TITLE 'Chi-square Test Comparing Active vs Placebo Group for Tiredness'; RUN; CHISQ – displays Chi-square test RELRISK – displays odds ratio and relative risk Indepent Variable * dependent variable Now that we have the dataset created we can start running some statistical procedures. We will first do a Chi-square test comparing the proportion of patients reporting tiredness in the active versus the placebo group. For this we add the CHISQ option to the table statement. We also use the RELRISK option to display the estimates of the odds ratio and relative risk of tiredness for the active versus placebo group.
8
P-value Tests if two percentages are significantly different
Table of active by tired12 active tired12 Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚ ‚ ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 2 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ Total Frequency Missing = 43 Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square Likelihood Ratio Chi-Square Continuity Adj. Chi-Square Mantel-Haenszel Chi-Square Tests if two percentages are significantly different Here is the output displayed from PROC FREQ. The 2x2 table is displayed. The numbers in blue are the row percents which are the percentage of patients in each group that report tiredness. We see for the active group percent (112 out of 634 patients) reported tiredness, compared to percent in the placebo group. Under the Statistic table we see a Chi-square value of 4.87 and a corresponding p-value of This is the test for comparing whether the two percentages are significantly different. Here we would reject the hypothesis at the 5% significance level, since the p-value is less than 0.05. You will note there are 4 different Chi-square tests. The first is the traditional Chi-square test, being a function of the observed versus expected cell frequencies. The others use different formulas and approaches to testing the hypothesis. In most cases they will yield similar results. P-value
9
Estimates of the Relative Risk (Row1/Row2)
Table of active by tired12 active tired12 Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚ ‚ ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 2 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ Total Estimates of the Relative Risk (Row1/Row2) Type of Study Value % Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) Cohort (Col1 Risk) OR = Odds of tiredness (Active v Placebo) OR = (112/522)/(55/170) = 0.66 RR = Risk of tiredness (Active v Placebo) RR = 17.67/24.44 = 0.72 The RELRISK option display the odds ratio and the relative risk. These are computed for our data using the formulas shown . Odds ratios and relative risks are used often in the medical field to summarize the difference in rates of an outcome between two groups. The odds of reporting tiredness in the active group is about two-thirds (0.66) that of the placebo group. The 95% confidence interval excludes one so we can state that the OR is significantly different from one. Also given is the relative risk, which is the ratio of the two probabilities, The table gives the ratios and the 95% confidence interval. One note: the odds ratio displayed will depend both on which variable is used as the row variable in the table statement and the values assigned to the two categories. It is good practice to put your independent variable as the row variable and the outcome as your column variable in the tables statement and to define your response condition as 1 for yes and 2 for no.
10
TITLE 'T Test Comparing Active vs Placebo Group for Change in
PROC TTEST DATA=stat; VAR sbpchg; CLASS active; TITLE 'T Test Comparing Active vs Placebo Group for Change in Blood Pressure'; RUN; Testing if mean SBP change is equal between 2 groups. Next we will test if the change in systolic BP differs significantly between the active and placebo groups. For this test we use PROC TTEST. You list your outcome variables in the VAR statement and your group variable in the CLASS statement. The CLASS variable should have just 2 values. Here we tell SAS to do a t-test between active and placebo groups for the variable sbpchg. A separate t-test will be formed for each variable listed under VAR.
11
Lower CL Upper CL Lower CL
Statistics Lower CL Upper CL Lower CL Variable active N Mean Mean Mean Std Dev Std Dev sbpchg sbpchg sbpchg Diff (1-2) Upper CL Variable active Std Dev Std Err Minimum Maximum sbpchg sbpchg sbpchg Diff (1-2) T-Tests Variable Method Variances DF t Value Pr > |t| sbpchg Pooled Equal <.0001 sbpchg Satterthwaite Unequal <.0001 -7.26 = /1.0979 Here is the output from PROC TTEST. The output displayed gives you a lot of information, so you will need to be careful to identify the parts you want. The numbers in red are the change in blood pressure for the active and placebo groups. The difference in means is given right below in blue. Thus, the active group lowered their BP by about 8 mmHg more than the placebo group. This should not be surprising, of course, since the active group was taking real medication. To the immediate side of the difference in means is the lower and upper 95% confidence interval for the difference. The standard error for the difference between the two means is , given in the second panel in blue, and the t-test value, computed as the difference in means divided by the standard error is given in the third panel. This value is which is strongly significant as given by the p-value. The t-test assumes the variation in blood pressure change is the same in both groups. If this were not true you could use the second t-test value which does not assume equal variation between groups.
12
Plot Generated from PROC TTEST
With ODS graphics turned ON you will also get a histogram and boxplot of the outcome variable for the two groups you are comparing. Here you can see that the distribution is shifted to the left (greater changes in SBP) for the active treatment group.
13
CLASS group; VAR sbpchg; TITLE 'Paired T-Test, Are there
PROC MEANS DATA=stat N MEAN STDERR T PRT ; CLASS group; VAR sbpchg; TITLE 'Paired T-Test, Are there significant changes in SBP within each group?'; RUN; Suppose we wanted to determine whether there were significant changes in blood pressure within each group. This pre versus post comparison can be tested using PROC MEANS. We add the keywords STDERR, T, and PRT to display the necessary statistics. We include group as a class variable to test separately within each of the 6 groups. This type of analyses can also be used for crossover designs, where each patient gets each of the two treatments, in random order. Also used for a crossover design where each patient gets each treatment in random order
14
T-value = Mean/Std Error
Analysis Variable : sbpchg N group Obs N Mean Std Error t Value Pr > |t| ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 T-value = Mean/Std Error Here is the output. Displayed for each group is the mean change, the standard error, the t-value, and the significance level. The t-value is computed as the mean divided by the standard error. Here we see that the changes in BP was strongly significant within each group.
15
* Compare 5 active drug groups; * For SBP change;
PROC ANOVA DATA=stat; CLASS drug; * Treat as categories; MODEL sbpchg = drug; MEANS drug/BON ; TITLE 'ANOVA Comparing 5 Active Treatment Groups for Change in SBP '; RUN; To perform analyses of variance (or ANOVA) we use PROC ANOVA. Here we will be comparing the 5 active treatment groups in blood pressure response. We include drug as the class variable. This tells SAS to treat each level of drug as separate categories in any model, rather than as a numeric continuous variable. In the model statement you list the response variables first, followed by an equals sign, followed by the independent variables. Here we have the single independent variable drug. We then add a means statement with the options BON. Besides the means for each group this options will display information as to which groups are significantly different. BON stands for Bonferonni which is a method to correct for multiple comparisons in your overall type I error rate.
16
Class Level Information Class Levels Values drug 5 1 2 3 4 5
The ANOVA Procedure Class Level Information Class Levels Values drug Number of observations NOTE: Due to missing values, only 627 observations can be used in this analysis. The first part of the output from PROC ANOVA tells us the class variable drug has 5 levels, 1 through 5, and that only 627 of the 902 observations on the dataset were used. This difference is mostly due to the placebo group being excluded, since drug has missing values for the placebo group. Also, excluded would be patients with missing BP at 12 months.
17
Dependent Variable: sbpchg
ANOVA TABLE Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total = 668.5/190.7 Source DF Anova SS Mean Square F Value Pr > F drug The next part displayed is the ANOVA table. The F-value is computed as the ratio of two mean squares. This F-value is used to test whether the 5 means are equal. The p-value here is which suggests that not all BP drugs lowered BP the same amount. The second section lists test information for each independent variable in the model. Since drug is the only variable in the model the F-test listed is the same as in the overall ANOVA table. If our F-test is significant then we would usually want to go further and compare the individual means to see which ones differ.
18
Bonferroni (Dunn) t Tests for sbpchg Alpha 0.05
Critical Value of t Minimum Significant Difference 4.92 Adjusts for 10 possible pairwise comparisons Required difference between any 2 groups to be significant. We can do this with the output from the BON option. Bonferroni tests increase the size of the difference between means you would need to observe before you would call the differences significant. SAS displays the difference you would need to observe – here any difference in groups of 4.92 or greater would be declared significant. Using that criteria group 4 (the alpha blocker) is significantly different from group 3 (the diuretic).
19
Comparisons significant at the 0.05 level are indicated by ***.
Difference Simultaneous drug Between % Confidence Comparison Means Limits *** *** This piece of output gives all pair-wise comparisons and the output notes that groups 3 and 4 are significantly different at the 0.05 level.
20
PROC GLM DATA=stat; * GLM (General Linear Model) CLASS drug;
MODEL sbpchg = drug; ESTIMATE 'BB vs Diuretic' drug ; ESTIMATE 'CCB vs Diuretic' drug ; ESTIMATE 'Alpha B vs Diuretic‘ drug ; ESTIMATE 'ACE v Diuretic' drug ; MEANS drug; TITLE ‘GLM Comparing 5 Active Treatment Groups for Change in SBP '; RUN; Compares drug 1 with drug 3 Instead of running PROC ANOVA you can also run PROC GLM to do you analyses. The syntax for the class and model statement is the same. With GLM we can also write estimate statements to compare certain groups or combination of groups. Suppose we consider the diuretic group to be the standard drug and we want to compare each of the other drugs with the standard. There would be four such comparisons. We can set these up with four estimate statements. The estimate statement consists of the key word estimate followed by a label in quotes, followed by the class variable, in our case drug. What follows after is a series of 5 numbers: -1s, 0s, and 1s. The sum of these numbers must equal zero. We enter a 1 and -1 for the two drugs we want to compare and enter zeros for the rest. These are sometimes called contrasts. The first contrast compares drug 1 with drug 3, ignoring the other three drugs. The next contrast compares drug 2 with drug 3, and so forth. In our output we will get tests for each of these contrasts.
21
Output from estimate statements
The GLM Procedure Source DF Type III SS Mean Square F Value Pr > F drug Output from estimate statements Standard Parameter Estimate Error t Value Pr > |t| BB vs Diuretic CCB vs Diuretic Alpha B vs Diuretic ACE v Diuretic Each group has higher BP than the diuretic group. Here is the output from the estimate statements. Note the label we gave each estimate is displayed in the first column. After that is the estimate which in our case is the difference between the 2 means considered. After that is the standard error of the estimate, and the T and P value. Using a P=0.01 level of significance we would declare the alpha blocker group significantly different from the standard diuretic group. With estimate statements you can also compare other contrasts among the means. If you wanted to compare the average of group 2 and 3 with group 4 you would use the statement shown at the bottom. ESTIMATE ‘Avg 2-3 v 4' drug ;
22
PLOT GENERATED FROM PROC GLM
With ODS GRAPHICS turned on you will get side-by-side boxplots by treatment group as shown here.
23
Distribution of Urinary Sodium Excretion
We next want to compare groups for urinary sodium excretion at 12 months. The histogram shown here indicates the distribution is not normally distributed but rather has a long right tail. Thus, we may want to fit a non-parametric model to the data for comparing groups.
24
PROC UNIVARIATE DATA = stat; VAR ursod12;
HISTOGRAM ursod12 / NORMAL MIDPOINTS=0 to 180 by 2; INSET N = 'N' (5.0) MEAN = 'Mean' (5.1) STD = 'Sdev' (5.1) MIN = 'Min' (5.1) MAX = 'Max' (5.1)/ POS=NW HEADER='Summary Statistics'; LABEL bmi = ‘Urinary Sodium'; TITLE ‘Distribution of Urinary Sodium Excretion'; RUN; PROC UNIVARIATE can also be used to display high resolution histograms and normal probability plots. The ODS GRAPHICS ON statement turns on graphics for the procedure that follows. Plots specified in the univariate procedure will be written to an external file in a png format. They can also be viewed by clicking the appropriate link in the results window. To produce a histogram you use the HISTOGRAM statement. The keyword HISTOGRAM is followed by the name of the variable, followed by options, if any, after a slash (/). Here we produce a histogram for bmi with a normal curve superimposed on the plot. For the X-axis we order values from 20 to 40 with bars of width 2 using the MIDPOINTS option. The INSET statement inserts statistics for bmi on the same plot. The POS option here tells SAS to put the statistics in the north-west part of the plot area. Look under the documentation for PROC UNIVARIATE for several examples on using the inset statement. The PROBPLOT statement will produce a high-resolution normal probability plot. The MU and SIGMA options tells SAS to estimate the mean and standard deviation from the data. Needless to say some of this syntax is difficult to remember. However, once you have an example that works you can use it as a template the next time you want to make a histogram.
25
PROC NPAR1WAY DATA=stat WILCOXON ; CLASS drug;
VAR ursod12; * Skewed distribution; TITLE 'Non-parametric Test Comparing Groups in Urinary Sodium'; RUN; *The values for ursod12 are ordered from lowest to highest and given a value of 1 to N. Analyses is then done on these ranked values. To do this with SAS you would use PROC NPAR1WAY which stands for non-parametric one-way, analyses of variance. The syntax is pretty simple. We put the variable drug as the CLASS variable and the variable ursod12 in the VAR statement. Instead of using the actual values for urinary sodium SAS will rank the values form 1 to N and do an analyses on the ranked values. We add the WILCOXON option to tell SAS we want to perform the WILCOXON ranked test.
26
Of values 1-602 Wilcoxon Scores (Rank Sums) for Variable ursod12
Classified by Variable drug Sum of Expected Std Dev Mean drug N Scores Under H Under H Score ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Average scores were used for ties. Kruskal-Wallis Test Chi-Square DF Pr > Chi-Square Of values 1-602 drug N Median ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Here is the output from the procedure. If all you are interested in is the p-value you can ignore everything else. The p-value is 0.04, suggesting there may be real differences among drug groups. The Mean Score column displays the average ranks for each group. The total dataset had 602 observations. So patients in group 5, having an average rank of 270. tended to have lower ranked sodium levels. Group 3 had the highest ranks on average. This is consistent with the median levels displayed in the red box. This is not part of the output but could be obtained from PROC MEANS with the MEDIAN option.
27
INPUT ses asthma count; DATALINES; 1 1 40 1 2 100 2 1 30 2 2 130 ;
* Program 20 * Chi-square tests from summary counts; DATA asthma; INFILE DATALINES; INPUT ses asthma count; DATALINES; 1 1 40 2 1 30 ; Asthma SES |YES | NO | LOW | | | HIGH | | | Well that was a long program, especially looking at all the output. You may not use each of those tests just explained, but for those you do use, you now know how to program it in SAS, at least in its simplest form. You can check out SAS documentation to get more details on these procedures. What if you have the summary counts for a 2x2 table and you want to compute the Chi-square test in SAS? Well, you can do it, but it requires a little trick. The 2x2 table here compares asthma rates between low and high SES groups. We want to test if there is a significant difference in rates. We see that there are 300 persons in the table. We certainly don’t want to type in 300 rows of data. Instead we will enter just 4 rows of data, one for each cell, and later we will tell SAS that these represent multiple observations. We define a variable to represent the row and a variable to represent the column. A third variable will be the count. The row variable we name ses and the column variable we name asthma. We can give these two variables any two values to represent the two rows and two columns but we will conveniently give them values so that the table SAS displays is the way we want.
28
3 INPUT ses asthma count; 4 DATALINES;
SAS LOG 1 DATA asthma; 2 INFILE DATALINES; 3 INPUT ses asthma count; 4 DATALINES; NOTE: The data set WORK.ASTHMA has 4 observations and 3 variables. If we run the program SAS will tell us these are 4 observations and three variables on the dataset asthma.
29
PROC FREQ DATA=asthma; TABLES ses*asthma/CHISQ RELRISK ; WEIGHT COUNT;
TITLE 'Relationship between Asthma and SES'; RUN; ses asthma Frequency| Percent | Row Pct | Col Pct | | | Total | | | | | | | | | | | | | | | | | | | | | | | | Total Odds Ratio (Relative Odds) (40/100)/(30/130)= 1.73 To calculate the Chi-square value we will use PROC FREQ with the CHISQ option. The important statement we add is the WEIGHT statement with the variable count. This tells SAS to weight each of the 4 observations by the value of count. That is, perform the analyses just as if there are “count” number of each observation. This is the little trick we use. The output is displayed below the code. Note we have 300 total observations and that we have duplicated the table we read in. SAS will then calculate the correct Chi-square test. The odds ratio of having asthma for low versus high SES is 1.73. Note: This is the odds ratio of having asthma (low v high SES)
30
Statistics for Table of ses by asthma Statistic DF Value Prob
Chi-Square Estimates of the Relative Risk (Row1/Row2) Type of Study Value % Confidence Limits Case-Control Loosely speaking: There is a 73% increase chance of asthma if you are low SES (versus high SES). Here is the SAS output giving the Chi-square value and the odds ratio and 95% confidence interval. Using P=0.05 we would declare the difference in asthma rates to be significant. The odds ratio is Loosely speaking this means there is a 73% increased risk of having asthma for those with low SES, versus high SES.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.