Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lesson 4 Descriptive Procedures

Similar presentations


Presentation on theme: "Lesson 4 Descriptive Procedures"— Presentation transcript:

1 Lesson 4 Descriptive Procedures
Procedures displaying associations between 2 variables Procedures FREQ, CORR, REG, SGPLOT Comment and Option Statements Program 4 in course notes LSB: See syllabus LSB: Chapter 11 – Debugging Programs Welcome to lesson 4. In the previous lesson we covered the MEANS and UNIVARIATE procedures that are used to summarize continuous variables. In this lesson we will look at some procedures that summarize categorical data, both in tabular and graphical form. We will also look at some procedures used to describe the relationship between 2 variables, whether continuous or categorical. These examples are illustrated in program 4. We will also look at use of comments in your program and discuss option statements. Lastly I will give some tips of how to debug your programs.

2 Program 4 DATA weight ; INFILE ‘C:\SAS_Files\tomhs.dat' ;
ptid $10. @12 clinic $1. @27 age 2. @30 sex 1. @58 height 4. @85 weight 5. @140 cholbl 3. ; bmi = (weight* )/(height*height); RUN; The top of program 4 reads in data from the TOMHS dataset: variables read-in are patient ID, clinical center, age, sex, height, weight and blood cholesterol. These are all baseline data. Again the data dictionary included in the course notes gives the positions and informats of these variables on the raw data file. As in program 3, we compute body mass index. The name of the SAS dataset we create is weight.

3 PROC FREQ DATA=weight; TABLES clinic sex ;
TITLE 'Frequency Distribution of Clinical Center and Gender'; RUN; Frequency Distribution of Clinical Center and Gender The FREQ Procedure Cumulative Cumulative clinic Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ A B C D sex Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ PROC FREQ is used to summarize categorical data, displaying frequency distributions. You can also use PROC FREQ to produce 2-way tabulations and perform traditional Chi-square tests for contingency tables. Here we tell SAS to display frequency distributions for the variables clinic and sex. Note the sub-statement used is TABLES not VAR. We can list as many variables as we want on the tables statement. We will get a frequency distribution for each variable. For each category the frequency, percent, cumulative frequency, and cumulative percent is displayed. We see here that there are 73 men and 27 women in the study. There are four clinical sites, A-D, clinic C has the most patients, 36 or 36% of the total.

4 *2-Way Frequency Tables ;
PROC FREQ DATA=weight; TABLES sex*clinic ; TITLE 'Cross Tabulation of Clinical Center and Sex'; RUN; PROC FREQ can also be used to display 2-way classifications. You indicate that by inserting an asterisk (*) between the two variables. The TABLE statement here tells SAS to display a 2-way table of sex and clinic. The variable sex will be the row variable and the variable clinic will be the column variable. Reversing the order will give you the same information, the row and column variables will just be reversed. Row variable Column variable

5 Cross Tabulation of Clinical Center and Sex The FREQ Procedure
Table of sex by clinic sex clinic Frequency| Percent | Row Pct | Col Pct |A |B |C |D | Total 1 | | | | | | | | | | | | | | | | | | | | 2 | | | | | | | | | | | | | | | | | | | | Total Percent men in clinic A Here is the output from PROC FREQ. When you go from a 1-way to a 2-way table you need to take care in understanding the output, as several counts and percentages are given. We see here that there are 8 cells (2 levels of sex times 4 levels of clinic) plus the row and column totals. Within each cell are 4 numbers: the cell frequency, the cell percent, the row percent, and the column percent. Let’s look at the cell sex=1 and clinic = A. The first number, 12, is the number of observations in this cell (12 persons are men and in clinical center A). This is percent of all patients (12/100 x 100%). The 3rd number in the cell is 16.44, the percent of all men (12/73 x 100%) that are in clinic A. The 4th number is 66.67, the percent men in clinic A (12/18 x 100%). Usually, either the row or column percentage is the most appropriate statistic to summarize. Here the numbers in blue are the percentages of men in each clinic. Clinic C has the highest percentage of men (83.33%). Also given are the row and column totals (sometimes called marginal totals). These are the results you would get for a 1-way frequency. If any of the two variables are missing then that observation will be excluded from the table.

6 * Getting only the counts ;
PROC FREQ DATA=weight; TABLES sex*clinic / nopercent norow nocol; RUN; sex clinic Frequency|A |B |C |D Total 1 | | | | | 2 | | | | | Total Sometimes you want only the counts in the cells. To remove all the percentages add the NOPERCENT NOROW and NOCOL options to the TABLES statement.

7 *Adding a two-way plot ; PROC FREQ DATA=weight; TABLES sex*clinic/
PLOTS=FREQPLOT(TWOWAY=GROUPHORIZONTAL); RUN; There are several plots you can generate with ODS using the PLOTS option of the table statement. One such plot is a two way frequency plot where the distribution of sex will be displayed for each clinical center. The code is shown here. Here is the plot generated from the FREQPLOT option. From this plot you can see there are about twice as many men than women enrolled in clinics A,B and D but in clinic C there are many more men enrolled than women, about 4-5 times as many.

8 OTHER USEFUL TABLE OPTIONS
CHISQ – performs chi-square analyses for 2-way tables MISSING – includes missing data as a separate category LIST – makes condensed table (useful when looking at 3-way or higher tables) Other useful TABLE options for PROC FREQ are listed here. The CHISQ option will display statistics from a Chi-square test. We will see examples of this in a later session covering statistical tests. The MISSING option treats missing data as a separate category. This is useful when you want to account for all observations in your tabulations. The LIST option gives a condensed table that is useful for looking at 3-way or higher tables. You may want to try some of these options to see the output that is displayed. A list of all options is given in the SAS help. Look under “the FREQ procedure”.

9 ODS GRAPHICS /WIDTH=300px ; PROC SGPLOT; VBAR clinic;
* Using PROC SGPLOT for bar charts; ODS GRAPHICS /WIDTH=300px ; PROC SGPLOT; VBAR clinic; TITLE "Vertical Bar Chart of Clinical Center"; LABEL clinic = "Clinical Center"; Plot can be imbedded into an HTML document or kept as a separate file. The file can be inserted in Office documents. We saw in the last lesson that PROC SGPLOT can be used for various graphics. Here we will look at how to make a bar chart. The main sub-statement is VBAR for vertical bar charts and HBAR for horizontal bar charts. Here we produce a vertical bar chart for clinic. The keyword is VBAR followed by the variable name. The height of the bars will be the number of patients in each clinical center. We add a title and a label for clinic and we are ready to go. The link to the plot will be displayed in the results window. You can click on the link to display the plot. You can copy and paste the plot into an office application A png file will also be generated to your default folder. Your default folder is displayed on the bottom right of your SAS session. SAS will provide a name for the file. At the top you will note an ODS graphics statement. This is not needed for the SGPLOT procedure except here we want to limit the size of the graph produced. If ODS graphics is turned on before other procedures then procedure specific graphs will be automatically generated, just like procedure specific tabulations are generated.

10 TITLE “Horizontal Bar Chart of Clinical Center";
* Same plot displayed horizontally; PROC SGPLOT; HBAR clinic; TITLE “Horizontal Bar Chart of Clinical Center"; LABEL clinic = "Clinical Center"; To change the plot to a horizontal bar chart just change the keyword VBAR with HBAR. The result is shown here.

11 * DATALABEL puts values on top of bar;
PROC SGPLOT; YAXIS LABEL = "Mean Cholesterol" VALUES = (0 to 300 by 50); VBAR clinic/RESPONSE=cholbl STAT=MEAN DATALABEL ; TITLE 'Mean Cholesterol by Clinical Center'; LABEL clinic = "Clinical Center"; RUN; Let’s look at another example of using PROC SGPLOT. Suppose you want the height of the bars to be a statistic of another variable, rather than the count within the group. You can do this by adding the response option to the VBAR statement. Here we want the height of the bars to be the mean cholsterol. The mean statistic is giving after the STAT keyword. One could choose a different statistic as well, for example, the MEDIAN. The DATALABEL option puts the value of the statistic at the top of the bar. The YAXIS statement is used to give the range of values on the y-axis as well as providing a label. The output is shown below the code.

12 * LIMITSTAT adds SE bars;
PROC SGPLOT NOAUTOLEGEND; YAXIS LABEL = "Mean Cholesterol" VALUES = (0 to 300 by 50); VBAR clinic/RESPONSE=cholbl STAT=MEAN LIMITSTAT=STDERR ; TITLE 'Mean Cholesterol by Clinical Center'; LABEL clinic = "Clinical Center"; RUN; Often times statisticians like to add SE bars to the plot. Here is how you can do that.

13 * Using SGPLOT to make regression plot; PROC SGPLOT DATA=weight;
YAXIS LABEL = "Body Mass Index (BMI)" ; XAXIS LABEL = "Age (y)" ; REG X=age Y=bmi/CLM; WHERE sex = 2; TITLE 'Plot of BMI and Age for Women'; RUN; We say last time how to create an X-Y plot with a fitted regression line. Here we see another example, plotting BMI versus age for women... You simply tell SAS which variable is your X and which variable is your Y variable. The clm option provides a confidence level band around the line. The X and Y axis statements can be used to modify the axis, here we use it to provide a label for each axis. Here we are exploring the relationship between body mass index (the Y variable) and age (the X variable) restricting to just women.. I do that here mainly to reduce the amount of points on the plot. We see there is an inverse relationship between age and bmi for women which may mean that obese persons don’t live long enough to show up on upper age levels of the plot.

14 * Using SGPANEL to make paneled graphs;
proc format; value sex 1=‘Men’ 2=‘Women’; run; proc sgpanel noautolegend; panelby sex/novarname columns=2 spacing=5; rowaxis label = "BMI (kg/m2)" ; colaxis label = "Age (y)" ; reg x=age y=bmi; format sex sex.; TITLE 'Plot of BMI Verus Age for Men and Women'; RUN; Our last example of plotting will be using the SGPANEL procedure. The SGPANEL procedure is used when we want some type of plot separately by some grouping variable, displayed on the same plot. Here we want to plot BMI versus age, showing the results separately for men and women. The PANELBY statement is what controls the group by plotting, here plots for men and women going across (columns=2). You will note instead of an xaxis and yaxis statements you now use rowaxis and colaxis statements. The reg statement is the same as when using proc sgplot. You will note that I defined a format for sex so that the text “Men” and “Women” appear rather than the coded values of 1 and 2. We will cover this in more detail in a later class.

15 PROC CORR DATA=weight; VAR bmi age; WHERE gender = 2;
TITLE 'Correlation of BMI and Age for Women'; RUN; Pearson Correlation Coefficients, N = 27 Prob > |r| under H0: Rho=0 bmi age bmi 0.0203 age Correlation Coefficient P-value testing if correlation is significantly different from zero Another procedure you might use to investigate the relationship between continuous variables is PROC CORR, which displays correlation coefficients for two or more variables. Here we display the correlation between BMI and age for women. The sub-statement is VAR, followed by a list of variables. SAS will display the correlation among all the variables listed. Here we list just two variables, age and bmi, using the where statement again to limit the analyses to women. The output is arranged in a matrix, here a 2x2 matrix. The diagonal values are always equal to one (the correlation of the variable with itself). The off diagonals contain the correlations between the row and column variable. Here the correlation between age and bmi is Underneath the correlation is the significance level for the test that the true correlation is 0. The p-value here is Correlation matrices are symmetric so the same correlations will be displayed in the upper and lower parts of the matrix. The sample size is given above the matrix here where N=27. If there are different sample sizes for different pairs of correlations, which could be the case if you listed 3 or more variables in the VAR statement, the sample size is given in the matrix.

16 TITLE 'Simple Linear Regression'; RUN; Partial Output
PROC REG DATA=weight ; MODEL bmi=age; WHERE gender = 2; TITLE 'Simple Linear Regression'; RUN; Partial Output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 age Regression equation: bmi = *age *Note: many options for plotting within proc reg. ODS graphics on will produce many plots by default. The last procedure we will look at is PROC REG, which performs simple and multiple linear regression analyses. This would not normally be thought of as a descriptive procedure, but I include it here to complete the analyses you might do for looking at the relationship between 2 continuous variables. I will illustrate its use using simple regression, with BMI as the dependent variable and age as the independent variable. Note the main sub-statement is the MODEL statement, where the keyword MODEL is followed by the dependent variable, an equals sign, and then followed by a list of independent variables (here only one, variable age). Partial output is displayed here. Under Parameter Estimates is the estimated intercept and slope for the regression equation. Also included is the standard error of the estimate, the t-value (the estimate divided by the SE) and p-value. The regression equation is shown here in blue, extracted from the output. The interpretation for the coefficient for age is that for each year of age BMI decreases by an average of 0.29 kg/m2. There are many other features in PROC REG. If you turn on ODS graphics you will get several plots including a matrix of diagnostic plots. These are shown on the next 2 slides.

17 Fit plot from PROC REG Here is the regression line plot you get by turning on ODS graphics. This is essentially the same plot you can produce using the SGPLOT procedure.

18 Here are a whole bunch of plots used for diagnosis of your model including analyses of residuals. Residuals are the difference between the actual data point and the estimated value based on the regression model. A big residual means that point is not typical for that value of X. You can learn about these in a class on regression. Remember you can get plots in addition to tabular output by turning ODS graphics ON. If you do not want the plots use ODS GRAPHICS OFF.

19 Using Comments in Program
Two Purposes Documenting your program Temporarily delete part of a program See page 3 LSB I want to spend a little time talking about using comments in your program. In general, comments are text in your program that SAS ignores when it processes your program. There are two purposes for using comments. One is for documenting your program. SAS programs can easily get complicated and you can document your program with comments so that you or someone else looking at the program 6-months from now can maybe determine what is being done. The second purpose is to temporarily delete part of your program. I will explain why you might want to do that in the examples that follow. Read page 3 of LSB for a discussion on using comments.

20 Examples of Comment Code
* Run proc univariate for variable BMI; * * High resolution graphs can also be produced. The following makes a plot of a histogram with the best fit normal curve and summary statistics. * *; PROC MEANS DATA = weight N MEAN STDDEV; * CLASS sex ; VAR bmi; run; PROC MEANS DATA = weight /* N MEAN STDDEV*/; CLASS sex ; There are two ways to include comments in your program. The first is to start a statement with an asterisk (*), followed by your comment, followed by a semi-colon. This is called a comment statement. Any statement starting with an asterisk is ignored by the SAS processor. The statement will appear in the log. The first statement here is a comment because it starts with an asterisk and ends with a semi-colon. So is the next statement which spans 5 lines. Despite it being long the statement starts with an * and ends with a semi-colon. There are certain patterns of characters, like shown in this second comment, that programmers use to make their comments look nice. Another example here is under PROC UNIVARIATE where the ID statement is a comment because it starts with an *. This is an example of “commenting out” code. This a simple way to delete the statement for one run of the program. Later on you may want to use the statement again which you can do by simply removing the asterisk. Now suppose you want to comment out the PLOT option in a PROC UNIVARIATE. You don’t want to delete it because you want to use it later and you might not remember where it goes. Since PLOT is only part of a statement you cannot use the comment statement. Instead you use the second method of commenting, where you start with the two characters /* followed by the comment, followed by */ (this ends the comment). This method allows you to comment within a statement and across several statements.

21 PROC UNIVARIATE DATA = weight; VAR bmi; /*
Temporarily Removing Code: Do not want to produce histogram but may want to run it at another time PROC UNIVARIATE DATA = weight; VAR bmi; /* HISTOGRAM bmi / NORMAL MIDPOINTS=20 to 40 by 2; INSET N = 'N' (5.0) MEAN = 'Mean' (5.1) STD = 'Sdev' (5.1) MIN = 'Min' (5.1) MAX = 'Max' (5.1)/ POS=lm HEADER='Summary Statistics'; */ LABEL bmi = 'Body Mass Index (kg/m2)'; RUN; This is another example that uses the /* method. Here we comment out two consecutive statements. You will note that in the enhanced editor in PC SAS that comment code will appear in green (or some other color than SAS code). This is very helpful because it helps you avoid errors when forming comments. You can clearly see what is and what is not a comment.

22 What is wrong with this program ?
* This is my first SAS program DATA bp; INFILE ... (more lines) Let’s test your SAS diagnostic skills. What is wrong with this program? I run the program and I get all kinds of errors. Do you see the error? Well, I forgot to end my comment with a semi-colon. So the text DATA bp becomes part of the comment and is ignored by the SAS processor. The INFILE statement is then out of place and you will get more error messages than you think you could ever generate. Look out for this kind of error. The color coding, however, will help you avoid this error.

23 Option Statement OPTION NOCENTER LINESIZE = 78;
OPTION NODATE NONUMBER; Many, many options (run PROC OPTIONS) Usually put at top of program Can put in autoexec.sas so they will always be in effect. The OPTION statement is used to control various settings for your session such as the line and page size of your output window. There are many options all with default values, but you will usually need to modify just a few. There is also a PROC OPTIONS that will display all the options and the current settings. The option statement is usually placed at the top of your program. You can change options during your program so that the settings change but in most cases you will just set them up-front at the top of your program and leave them at these settings. You can use one statement for all of your options or use multiple option statements. Here are a few examples. The first option statement tells SAS not to center output and to use a line size of 78 characters. The second option statement tells SAS to not place the date or page number on the top of the output page. If you have options that you would like to always start with then you can put them in a special file called autoexec.sas. You need to place this program in your current folder. SAS will run this file when SAS is started so you don’t need to add these statements to all your programs.

24 Debugging SAS Programs
Finding and Correcting Errors Chapter 11: LSB

25 Checking the Log It is always a good idea to check the log window.
Start at the beginning of the log file, and correct the first error. Sometimes one mistake can create many errors.

26 Missing Semicolons Missing semicolons are the most common mistake to make. DATA weight INFILE ‘C:\SAS_Files\tomhs.dat' ; ERROR: No DATALINES or INFILE statement.

27 How to figure out what happened:
The Error said that there wasn’t a DATALINES or INFILE statement, but you know that there was one. SAS must not have identified the INFILE statement as an INFILE statement. Checking the code shows that that SAS thought that the INFILE statement was part of the DATA statement because a semicolon was missing.

28 Another Missing Semicolon:
PROC FREQ DATA=weight; TABLES sex clinic TITLE 'Frequency Distribution of Clinical Center and Gender'; RUN; ERROR: Variable TITLE not found.

29 How to figure out what happened:
SAS says that the variable TITLE wasn’t found. You know that TITLE isn’t a variable. SAS must think that TITLE is part of a list of variables. There is no semicolon separating TITLE from the variables SEX and CLINIC!

30 Invalid Data If SAS is expecting a number, but gets text instead, you can get invalid data notes. @ clinic $1. Is replaced with: @ clinic NOTE: Invalid data for clinic in line

31 Mixing up PROCs PROC FREQ DATA=tdata; VAR clinic group educ ;
--- 180 ERROR : Statement is not valid or it is used out of proper order.

32 Misspelled Variable in a PROC
PROC FREQ DATA=weight; TABLES sex clinic ; Is replaced with: TABLES sex clinc ; You get: ERROR: Variable CLINC not found.

33 Uninitialized Variables
From Program 4, if: bmi = (weight* )/(height*height); Is replaced with: bmi = (wieght* )/(height*height); You get: NOTE: Variable wieght is uninitialized.

34 What’s an Uninitialized Variable?
An uninitialized variable is a variable that SAS considers to be nonexistent. This usually occurs when a variable name on the RHS of an equation is misspelled. In the example, the error was caused by a misspelling—SAS had no variable called wieght.


Download ppt "Lesson 4 Descriptive Procedures"

Similar presentations


Ads by Google