Lesson 5 - Topics Creating new variables in the data step

Slides:



Advertisements
Similar presentations
Programming with Alice Computing Institute for K-12 Teachers Summer 2011 Workshop.
Advertisements

Extending the Definition of Exponents © Math As A Second Language All Rights Reserved next #10 Taking the Fear out of Math 2 -8.
PROBLEM SOLVING WITH LOOPS Chapter 7. Concept of Repetition Structure Logic It is a computer task, that is used for Repeating a series of instructions.
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
Lesson 4 - Topics Creating new variables in the data step SAS Functions.
Chapter 4 concerns various SAS procedures (PROCs). Every PROC operates on: –the most recently created dataset –all the observations –all the appropriate.
Changing Bases. Base 10: example number ³ 10² 10¹ 10 ⁰ ₁₀ 10³∙2 + 10²∙1 + 10¹∙ ⁰ ∙0 = 2120 ₁₀ Implied base 10 Base 8: 4110 ₈ 8³ 8².
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
4 - Conditional Control Structures CHAPTER 4. Introduction A Program is usually not limited to a linear sequence of instructions. In real life, a programme.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 16 & 17 By Tasha Chapman, Oregon Health Authority.
11 Making Decisions in a Program Session 2.3. Session Overview  Introduce the idea of an algorithm  Show how a program can make logical decisions based.
Lesson 10 - Topics SAS Procedures for Standard Statistical Tests and Analyses Programs 19 and 20 LSB 8:16-17.
Lecture 3 Topic - Descriptive Procedures
INTRODUCTION TO STATISTICS
Solving Equations Conceputally
Linear Algebra Review.
CHAPTER 4 REPETITION CONTROL STRUCTURE / LOOPING
Data Types Variables are used in programs to store items of data e.g a name, a high score, an exam mark. The data stored in a variable is entered from.
Digital Logic & Design Dr. Waseem Ikram Lecture 02.
Introduction To Repetition The for loop
CS161 Introduction to Computer Science
Chapter 5 Probability 5.2 Random Variables 5.3 Binomial Distribution
Lesson 6 - Topics Formatting Output Working with Dates
Chapter 2 Section 8.
EGR 2261 Unit 4 Control Structures I: Selection
Z-scores & Shifting Data
Graphing Equations and Inequalities
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
Distribution of the Sample Means
Lecture 2 Topics - Descriptive Procedures
Chapter 3: Working With Your Data
Topics Introduction to Repetition Structures
Lecture 07 More Repetition Richard Gesick.
Introduction to Summary Statistics
Introduction to Summary Statistics
Lecture 4B More Repetition Richard Gesick
Chapter 2 Section 8.
Lesson 9 - Topics Restructuring datasets LSB: 6:14
While loops The while loop executes the statement over and over as long as the boolean expression is true. The expression is evaluated first, so the statement.
Lesson 8 - Topics Creating SAS datasets from procedures
Arrays, For loop While loop Do while loop
Introduction to Summary Statistics
Conditions and Ifs BIS1523 – Lecture 8.
Introduction to Summary Statistics
Number and String Operations
Digital Logic & Design Lecture 02.
Gillis Mathematics Inequalities.
File I/O in C Lecture 7 Narrator: Lecture 7: File I/O in C.
Loops CIS 40 – Introduction to Programming in Python
Key Words and Introduction to Expressions
Coding Concepts (Basics)
Solving Linear Equations
Lecture 2 Topics - Descriptive Procedures
Chi Square (2) Dr. Richard Jackson
Producing Descriptive Statistics
Mean Deviation Standard Deviation Variance.
Chapter 3: Selection Structures: Making Decisions
Boolean Expressions to Make Comparisons
Java Programming Loops
Topics Introduction to Repetition Structures
Chapter 3: Selection Structures: Making Decisions
Random Variables Random variable a variable (typically represented by x) that takes a numerical value by chance. For each outcome of a procedure, x takes.
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms
CISC101 Reminders Assignment 3 due today.
Class code for pythonroom.com cchsp2cs
Software Development Techniques
Module 4 Loops and Repetition 9/19/2019 CSE 1321 Module 4.
Frequency Distributions
Lecture 2 Topics - Descriptive Procedures
Presentation transcript:

Lesson 5 - Topics Creating new variables in the data step SAS Functions Programs 5-6 in course notes LSB 3:1-6,11-12 Welcome to lesson 5. In this session we will be covering how to create new variables in the data step including using SAS functions. These topics will be illustrated in programs 5 and 6.

Creating New Variables Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); Indirect assignments (if/then/else) if age < 50 then young = 1; else young = 2; if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; Let’s look at how you create new variables in the DATA step - variables that would be then included on the SAS dataset created. There are two types or ways of creating new variables. The first is what I call direct assignments. You put the new variable name followed by an equals sign followed by the formula. Here we see three examples; the last one we have seen before when defining body mass index. In some cases there is no direct formula; for example when dividing a variable into levels based on cut-points. To do that you will need to use indirect assignments using if statements or if-then-else statements. Here are two examples, one dividing age into two categories and one dividing income into three categories. We will see how SAS processes these statements next.

Direct Assignments (Formulas) Example c = a + b ; So if a = 2, b =3, c = 5; What if a is missing, what is c? C will be missing What if b is missing? Using direct statements is rather easy: on the left-hand-side of the equation is the new variable you are creating; on the right hand side is the arithmetic expression using conventional symbols. Here we define c as “a + b” so if a=2 and b=3 then c will be equal to 5. What if a or b (or both a and b) are missing? What does SAS assign the new variable c? Well, SAS sets the variable as perhaps you would expect – it will set C to missing.

If/then/else Statements With if-then-else definitions SAS stops executing after the first true statement if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; What if income is 10? What if income is 23? What if income is 30? What if income is missing? Tax = 1 Tax = 2 Tax = 3 Here is an example of using if-then-else statements to create a new variable. The important thing to remember with these statements is that SAS will stop executing the if statements after the first true statement is encountered. So what happens when income equals 10? Well, the first if statement will be true so the variable tax will be set to 1. Notice the second if statement is also true; however; as indicated, SAS stops executing the statements when the first statement is true (here when income is less than 15) so the second statement is not executed. Going through the logic you will note that when income is 23 then tax will be set to 2 (the second if statement is the first true statement) and when income is 30 then tax will be set to 3 (the last if statement is the first true statement). What if income is missing, what then is the value of tax? It turns out that the variable tax will be set to 1 because missing values are stored as large negative numbers – so the first statement is true when income is missing. This is an important thing to remember when using if-then-else statements to create new variables. So our code here is not exactly what we would want. We will look at other examples and how to deal with missing data when creating new variables in the next program.

Creating New Variables In TOMHS data on education level was collected as shown here. There are 9 categories of education. Suppose you want to do some analyses combining categories so that there are just 2 levels: college graduate and non-college graduate. To do this you would create a new variable, say called grad, based on the original variable for education. The new variable would have two levels, one value for college graduates and another value for non college graduates. A look at the values for education would indicate that values of 7-9 need to be combined to indicate a college graduate and values of 1-6 need to be combined to indicate a non-college graduate. If the data were missing for education we would want the new variable to also be missing. There are a few ways to create such a new variable. We will look at them in Program 5. Create a new variable with 2 levels, one for college graduates and one for non-college graduates.

INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. @ 49 educ 1. Program 5 DATA tdata; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. @ 49 educ 1. @123 sbp12 3. ; * This way will code missing values to the value 2; if educ < 7 then grad1 = 2 ; else if educ >=7 then grad1 = 1 ; * The next two ways are equivalent and are correct; if educ < 7 and educ ne . then grad2 = 2; else if educ >=7 then grad2 = 1; * IN is a special function in SAS ; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1; New variable defines go after the input statement The program begins by reading in three variables from the TOMHS dataset; variables for patient ID, education, and systolic BP. New variables are defined after the input statement as seen here, i.e. you cannot create a new variable based on education until education is read-in. We are going to create three new variables to represent college degree status to illustrate the different ways to accomplish this and to show the problems that can occur. Each is going to use IF-THEN-ELSE statements. The first way defines the variable grad1. If the value for education (variable educ) is less than 7 then grad1 is assigned a value of 2; else if the value of educ is 7 or greater then grad1 is assigned a value of 1. (Note: It does not really matter what 2 values we choose for grad1, just that the values are different. I will talk more on this later). There is only one problem with this syntax: if educ is missing then grad1 is assigned a value of 2. This is because, as noted before, missing values are stored internally by SAS as a large negative value. So the statement educ < 7 is true if educ is missing. We do not want to do this so we have to do a little more coding. If we replace the first if statement with a compound if statement as seen above (if educ < 7 and educ ne .) we solve the problem. How so? Well, if educ is missing then both IF statements that define grad2 will be false (check this out yourself). Thus grad2 will never be assigned a value through these IF statements and since new variables start out as missing, then grad2 will be (or rather stay) missing. For the 3rd way we use the IN function. If educ is “in” any of the listed values the statement will be true. This is a great way to define new variables if the original variable takes on integer values and there are not too many of them. Missing values for educ will not be assigned a value for grad3 because both if statements are false if education is missing.

TABLES educ grad1 grad2 grad3 ; PROC FREQ DATA=tdata; TABLES educ grad1 grad2 grad3 ; Cumulative Cumulative educ Frequency Percent Frequency Percent --------------------------------------------------------- 1 3 3.03 3 3.03 3 4 4.04 7 7.07 4 23 23.23 30 30.30 5 14 14.14 44 44.44 6 12 12.12 56 56.57 7 16 16.16 72 72.73 8 10 10.10 82 82.83 9 17 17.17 99 100.00 Frequency Missing = 1 grad1 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.00 43 43.00 2 57 57.00 100 100.00 grad2 Frequency Percent Frequency Percent 1 43 43.43 43 43.43 2 56 56.57 99 100.00 grad3 Frequency Percent Frequency Percent Coded the missing value for educ to 2 To check that our coding did what we wanted it to do, we can run a PROC FREQ on each of the new variables along with the original education variable. There is one person missing education. For grad1 note there are no missing observations. That is because the observation with education missing got coded as a 2 for grad1.by the way we defined the variable. The coding for grad2 took care of missing data properly so you get the correct coding. The variable grad3 which was defined using the IN function also produces the correct coding. By displaying the frequency for the original variable and the new variable we can somewhat check our work. We can see that the sum of the first 6 levels for education equals the number for category 2 for the variable grad2.

TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW NOCOL; PROC FREQ DATA=tdata; TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW NOCOL; TITLE 'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad1 educ grad1 Frequency‚ 1‚ 2‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ . ‚ 0 ‚ 1 ‚ 1 1 ‚ 0 ‚ 3 ‚ 3 3 ‚ 0 ‚ 4 ‚ 4 4 ‚ 0 ‚ 23 ‚ 23 5 ‚ 0 ‚ 14 ‚ 14 6 ‚ 0 ‚ 12 ‚ 12 7 ‚ 16 ‚ 0 ‚ 16 8 ‚ 10 ‚ 0 ‚ 10 9 ‚ 17 ‚ 0 ‚ 17 Total 43 57 100 This shows that the missing value for educ got assigned a value of 2 The surest way to verify your recoding is to perform a cross-tabulation of the original and new variable. Here is displayed the distribution of the variable educ cross-tabulated with grad1. The missing option was added to the tables statement of freq procedure to display missing as a separate category. We see that the missing value for educ was assigned a value of 2 for grad1. This would tell us that our coding was incorrect. If we do the same cross-tabulation with the variable grad2 we will see that the assignment for missing values was correct, i.e. missing values for educ get assigned as missing for variable grad2.

TABLES educ*grad2 /MISSING NOCUM NOPERCENT NOROW NOCOL; PROC FREQ DATA=tdata; TABLES educ*grad2 /MISSING NOCUM NOPERCENT NOROW NOCOL; TITLE 'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad2 educ grad2 Frequency| .| 1| 2| Total ---------+--------+--------+--------+ . | 1 | 0 | 0 | 1 1 | 0 | 0 | 3 | 3 3 | 0 | 0 | 4 | 4 4 | 0 | 0 | 23 | 23 5 | 0 | 0 | 14 | 14 6 | 0 | 0 | 12 | 12 7 | 0 | 16 | 0 | 16 8 | 0 | 10 | 0 | 10 9 | 0 | 17 | 0 | 17 Total 1 43 56 100 This shows that the missing value for educ got assigned a value of missing Here is the cross-tabulation with the variable grad2; we see that the assignment for missing values was correct, i.e. missing values for educ get assigned as missing for variable grad2.

* Recode sbp12 into 3 levels; if sbp12 = . then sbp12c = . ; else if sbp12 < 120 then sbp12c = 1 ; else if sbp12 < 140 then sbp12c = 2 ; else if sbp12 >=140 then sbp12c = 3 ; With if-then-else definitions SAS stops executing after the first true statement Now let’s look at another example – this time dividing a variable into more than two categories. We will take our BP variable and divide it into three categories. We will use IF-THEN-ELSE coding to do this. As indicated before, the way these statements work is that after the first IF part is true the statement will stop. If the IF portion is false then the next if portion will execute. See if you can follow the logic for the coding of the new blood pressure variable. Keep in mind that once an IF portion is true the new value will be assigned and the statement will stop. If the value for sbp12 is missing (represented as a period) then the new variable sbp12c is set to missing. If the value is less than 120 (but not missing) then a value of 1 is assigned. If the value is less than 140 (but not missing and not less than 120) then a value if 2 is assigned. Lastly if the value is greater than or equal to 140 (but not missing and not less than 120 and not less than 140) a value of 3 is assigned. As you can notice you need to be very careful in defining new variables using if-then-else statements. Note also how I lined up the statement. This is for readability and for checking your code. This is a good practice to follow. Values < 120 will be assigned value of 1 Values 120-139 will be assigned value of 2 Values >=140 will be assigned value of 3 Missing values will be assigned to missing

Cumulative Cumulative PROC FREQ DATA=tdata; TABLES sbp12c sbp12; RUN; OUTPUT Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 sbp12 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 93 1 1.09 1 1.09 94 1 1.09 2 2.17 101 1 1.09 3 3.26 104 1 1.09 4 4.35 105 1 1.09 5 5.43 (more values) 147 1 1.09 87 94.57 148 1 1.09 88 95.65 149 1 1.09 89 96.74 153 1 1.09 90 97.83 154 1 1.09 91 98.91 158 1 1.09 92 100.00 Here is the PROC FREQ results for the new and original variables. You should always have the same number of missing values – as you see here there are 8. If they are different there is reason to think your code to define the new variable is incorrect.

* Easy but costly error to make; if sbp12 = . then sbp12c = . ; else if sbp12 < 140 then sbp12 = 2 ; else if sbp12 >=140 then sbp12c = 3 ; PROC FREQ DATA=tdata; TABLES sbp12c; RUN; The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51 How come no values of 2 and why so many missing? Let’s look at an example to show how easy it is to make a mistake in your coding. Do you see what happened here? Why are there so many missing values for the new variable and why are there no values of 2. Look up at the code. There is an error in the third line of the statement where I typed sbp12 = 2 instead of sbp12c = 2. You would not get an error, warning, or note in the SAS log. You would just have to go back and look at your code. That is why it is good practice to make your code very readable. This error also points out the fact that if/then coding is very flexible, it is not just for creating new variables. In the example above we actually “wrote over” the variable sbp12 which is allowed in SAS.

Important Facts When Creating New Variable 1. New variables are initialized to missing 2. Missing values are < any value if var < value (true if var is missing) 3. Reference missing values for numeric variables as . 4. Reference missing values for character variables as ' ' if sbp = . then ... (or if missing(sbp)) if clinic = ' ' then ... Here is a summary of important facts when creating new variables, in particular when using IF-THEN-ELSE logic. First, new variables are initialized to missing; so if none of your IF statements are true when defining the new variable then the new variable will be missing. Next missing value are less than any value; thus every IF statement of the form IF var < value will be true. Also, you reference missing numeric variables with a period (.) and missing character variables with a blank (enclosed by quotes). There is also a missing function which you can use to reference missing values whether the variable is character or numeric.

SAS Handling of Missing Data When Creating New Variables Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); If any variable on the right-hand side is missing then the new variable will be missing Indirect assignments if age < 50 then young = 1; else young=2; New variables are initialized to missing but may be given a value if any of the IF statements are true Here is a summary of how SAS handles missing data when creating new variables. With direct assignments, i.e. a formula, there is no worry about missing values since if any variable on the right-hand side of the equation is missing then the new variable will be missing. This is what you would usually want. For indirect statements you need to be careful using the less than (<) or less than or equal to (<=) operators. Avoid this problem by using a compound IF (and variable ne .) or using a separate IF in the beginning to take care of missing values (see example from before to see how this was done).

Checks you can make to be sure new variables are created correctly Display original and new variables. PROC PRINT DATA=tdata (OBS=20); VAR educ college ; Run PROC MEANS on original and new variable. Make sure both variables have same number of missing values. PROC MEANS DATA=tdata; VAR educ college; Run PROC FREQ on original and new variable. PROC FREQ DATA=tdata; TABLES educ college educ*college; Here are some suggestions to check if new variables are created correctly. You can use PROC PRINT to compare the original and new variables. This is a bit tedious but is a direct way to know you are coding the new variable correctly. You will want to do this on a subset of cases. Perhaps a better check is using PROC FREQ or PROC MEANS. Check that there are the same number of missing values in the original and new variables. Check also for gaps and missing levels that you would not expect to be in the new variable. A complete cross-tabulation could also be done.

What Value to Set New Variable if age < 20 then teenager = 1; else if age >=20 then teenager = 2; if age >=20 then teenager = 0; if age < 20 then teenager = ‘YES’; else if age >=20 then teenager = ‘NO’; As mentioned earlier when you code a new variable into 2 levels it does not matter what 2 values you use. In the example here the first coding assigns for teenager a value of 1 and 2; the second example uses 1 and 0; the third example defines teenager as a character variable and assigns values of YES and NO. There are advantages to each. Coding as 1 or 2 will display the affirmative level (the YES value) first in a PROC FREQ (coding 1 or 0 or YES and NO will display the opposite order). Using 0/1 coding has the advantage in that running PROC MEANS on the variable will display the fraction of YES’s in the MEAN statistic and the number of YES’s in the SUM statistic. This can be useful when you have many YES/NO variables. It is also a useful coding when you are using the variable as a dummy or indicator variable in regression analyses (Note coding as 0 and 100 would give you the percent rather than the fraction when using PROC MEANS). Using character variables to represent the categories is usually not a good practice. For example, you could not use PROC MEANS for this variable or use the variable in a regression equation. It is better to assign them as numeric and use a format, if desired, to clarify the values.

If-then-do statements * Conditionally execute several statements; * Create indicator variables for race; * Make sure race variable not missing; if race ne . then do; white = 0; black = 0; asian = 0; other = 0; if race = 1 then white = 1; if race = 2 then black = 1; if race = 3 then asian = 1; if race = 4 then other = 1; end; You can conditionally execute statements by using if-then-do logic. Here is an example. We want to create indicator variables for four race groups based on the values of variable race. Before assigning values to each indicator variable we want to be sure the variable race is not missing. We do that the if-then-do statement as the first statement here. Only if race is not missing will the next statements (up to the end statement) be executed. If race is missing then none of the statements before the end will be executed. Thus, the value for “white” will remain missing (same with other indicator variables). You “end” if-then-do logic with the end statement. After the end statement execution continues with the next statement after the end.

var white black asian other; run; proc means n mean sum; var white black asian other; run; The MEANS Procedure Variable N Mean Sum ----------------------------------------------- white 100 0.8100000 81.0000000 black 100 0.1800000 18.0000000 asian 100 0.0100000 1.0000000 other 100 0 0 Using proc means we get the number and proportion of each race. These indicator variables could also be entered into a regression model.

DO LOOPS WITH ARRAYS - Used to Shorten Code - Used when repeating same code - Used with DO/END loop ARRAY wtlb(3) wt1 wt2 wt3; ARRAY wtkg(3) newwt1 newwt2 newwt3; DO index = 1 to 3; wtkg(index) = wtlb(index) / 2.2; END; /* same as the following code Newwt1 = wt1 / 2.2 ; Newwt2 = wt2 / 2.2; Newwt3 = wt3 / 2.2; *************************************/ There is another tool that can be used in the DATA step that is useful when you want to apply the same function or formula to several variables. This is the ARRAY statement which in conjunction with DO loops can substantially reduce the amount of code to write. Use of ARRAYS and a DO loop is best illustrated with an example.. Suppose we have several variables that are weights recorded in pounds and we want to compute a new variable for each weight that is the weight in kilograms. We could have a separate statement for each variable like shown here at the bottom. However, if we had many weight variables this would require many lines of repeated similar-like statements. With ARRAYS and DO loops we can write code that replaces the multiple lines of code with a single line of code. We first define an array called wtlb that has 3 variables (or elements) with weight in pounds (variables wt1-wt3). We define a similar array called wtkg that will contain three new variables (yet to be defined) that will contain corresponding weight variables in kilograms. Our DO loop will loop 3 times, once for each of the weight variables. The word following the DO creates a variable called index that takes on the values 1, 2, and 3, incrementing by one each time the loop executes. The loop ends after the third iteration. The first time through the loop index will be equal to 1. The next statement will take the first variable in the array wtkg (variable newwt1) and assign it the value of the first variable in the array wtlb (variable wt1) divided by 2.2. When the END statement is reached the loop will start over for index = 2. This will continue 3 times.

* Program 6 SAS Functions ; DATA example; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @058 height 4. @085 weight 5. @172 ursod 3. @236 (se1-se10) (1.0 + 1); bmi = (weight*703.0768)/(height*height); rbmi1 = ROUND(bmi,1); rbmi2 = ROUND(bmi,.1); lursod = LOG(ursod); seavg = MEAN (OF se1-se10); semax = MAX (OF se1-se10); semin = MIN (OF se1-se10); We go on to program 6 of the course. This program will illustrate several functions that can be used to create new variables in the DATA step. We read-in height, weight, urinary sodium excretion, and the first 10 side-effect variables. We first create a new variable bmi based on height and weight. The ROUND function is then used to round bmi to the nearest integer; it has an optional second parameter that tells SAS to what decimal level to round the variable. The variable rbmi2 will be rounded to the nearest tenth. We use the LOG function to compute a new variable that is the natural log of the variable ursod. This type of transformation is often used to make the variable more closely follow a normal distribution which is an assumption for some statistical tests. The MEAN, MIN, and MAX function illustrated here can be very useful. The MEAN function computes the average of all non-missing values in the list of variables enclosed in parenthesis. If the variables in the list have a common root in their name, followed by a number, you can use the shorthand notation as shown here. If you use this shorthand notation you need to preface the list with the keyword OF. This tells SAS that what follows is a list of variables and not the numerical difference between two variables. The MEAN function is used here to compute an average side-effect score, the average of the 10 side-effect variables. You can then compare treatment groups for this (summary) variable, in replace of or in addition to comparing groups for each of the individual side-effects.. The MAX function here is used to compute the “worst” level of side effect in the list (the highest value). You may want, for example, to display patients that have a value of 4 for this variable. The MIN function does not have much utility here but would be useful, if say the list of variables was the times for a person to complete a race. The MIN function would result in the fastest (lowest) time.

seavg = MEAN (OF se1-se10); This is the same as * Use of dash notation ; seavg = MEAN (OF se1-se10); This is the same as seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); The OF is very important. Otherwise SAS thinks you are subtracting se10 from se1. To use this notation the ROOT of the name must be the same. Let’s take a closer look at the MEAN function and using the dash notation. Using the dash notation in the list of variables is equivalent to listing each variable separated by a comma. The two statements here will produce the same results. Note again the OF is very important. Without it you will get something very unexpected. To use this notation the root of the name must be the same.

* Two ways of computing average ; seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); Versus seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10; Using mean function computes the average of non-missing values. Result is missing only if all values all missing. Using + formula requires all values be non-missing otherwise result will be missing A common question is how does the mean function compare to using the formula for the mean. The code for each of these using the side-effect data is show here. At first you may think these two methods are equivalent. However, if any variable in the list has a missing value they will not yield the same result. This is because the MEAN function computes the average of all non-missing variables in the list. The result will be missing only if all the variables in the list are missing. Using the formula for the mean will result in a missing value if any variable in the list is missing (Remember – if any variable on the right hand side of a formula is missing the new variable will be missing). Sometimes you may want to require a certain number of variables in the list to be non-missing in order to compute the mean. You can accomplish this easily with an IF statement and the N function as shown here. What does this statement do? It will compute the mean of the 10 items only if more than 5 of the items are non-missing. Otherwise the result will be missing. The N function counts the number of non-missing variables in the list. There is a corresponding NMISS function which counts the number of missing values in a list. if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10); What does this statement do?

* Compute 10 new variables, 100 if se is present and 0 if not; ARRAY se (10) se1-se10; ARRAY hse(10) hse1-hse10; DO senumber = 1 to 10; if se(senumber) = 1 then hse(senumber) = 0; else if se(senumber) in(2,3,4) then hse(senumber) = 100; END; *** For senumber = 1 the code is ************* if se1 = 1 then hse1 = 0; else if se1 in(2,3,4) then hse1 = 100; New variables We will use arrays in program 6 to compute a new variable for each side effect variable that is the re-coding of the original variable. The original variables are se1-se10. Each of these variables take on a value from 1 to 4, with a 1 meaning the patient is not bothered by that condition and 2-4 meaning the patient has that condition to different degrees (mild, moderate, or severe). We want to create 10 new variables that take on the value 100 if the patient is bothered at all and 0 if not bothered. Thus, we will assign a value of 1 in the original variable to 0 for the new variable and a value of 2, 3, or 4 to 100. We set up two arrays, one (SE) to hold the original variables and one (HSE) to contain the new variables. Each array has 10 elements or variables. The DO loop is executed 10 times, one for each variable assigning the new variables to 0 or 100 depending on the original variables. The do loop variable senumber can be named anything, it serves as the index into the array, taking on the values 1 to 10 as the loop iterates. The equivalent code not using arrays is given here for the first variable.

PROC PRINT DATA = example (OBS=10); VAR bmi rbmi1 rbmi2 seavg semin semax ; TITLE 'Listing of Selected Data for 10 Patients '; RUN; PROC FREQ DATA = example; TABLES semax; TITLE 'Distribution of Worse Side Effect Value'; TITLE2 'Side Effect Scores Range from 1 to 4'; PROC MEANS DATA = example; VAR hse1-hse10; TITLE 'Percent of Patients With Condition by Condition'; ods graphics on; PROC UNIVARIATE DATA = example ; VAR ursod lursod; QQPLOT ursod lursod; TITLE 'Quantile Plots for Urine Sodium Data'; Here we run a PROC PRINT to display some of our new variables created, a PROC FREQ to display the frequency of the worst side-effect, a PROC MEANS on the new side-effect variables, and a PROC UNIVARIATE to compare the untransformed and log-transformed urinary sodium variables.

Listing of Selected Data for 10 Patients Obs bmi rbmi1 rbmi2 seavg semin semax 1 28.2620 28 28.3 1.1 1 2 2 35.9963 36 36.0 1.0 1 1 3 27.0489 27 27.0 1.0 1 1 4 28.2620 28 28.3 1.1 1 2 5 33.2008 33 33.2 1.0 1 1 6 27.7691 28 27.8 1.2 1 2 7 32.6040 33 32.6 1.0 1 1 8 22.4057 22 22.4 1.2 1 2 9 37.2037 37 37.2 1.1 1 2 10 33.1717 33 33.2 1.7 1 3 Here you see the 3 BMI variables, the second rounding the value to the nearest integer and the third rounding to the nearest tenth. Of these 10 patients, patient 10 has the highest side-effect score (1.7). Patient 1 has a semax value of 2 meaning he or she had at least one side-effect value of 2 but no values of 3 or 4.

Distribution of Worse Side Effect Value Side Effect Scores Range from 1 to 4 The FREQ Procedure Cumulative Cumulative semax Frequency Percent Frequency Percent ---------------------------------------------------------- 1 33 33.00 33 33.00 2 52 52.00 85 85.00 3 13 13.00 98 98.00 4 2 2.00 100 100.00 The frequency distribution for semax shows that 2 patients had at least 1 side-effect that was severe. This distribution perhaps could be compared between the placebo and drug groups. 2 patients had at least 1 severe side effect

These means are percent of patients with se Percent of Patients With Condition by Condition Type The MEANS Procedure Variable N Mean Std Dev Minimum Maximum hse1 100 12.0000000 32.6598632 0 100.0000000 hse2 100 21.0000000 40.9360181 0 100.0000000 hse3 100 8.0000000 27.2659924 0 100.0000000 hse4 100 13.0000000 33.7997669 0 100.0000000 hse5 100 10.0000000 30.1511345 0 100.0000000 hse6 100 30.0000000 46.0566186 0 100.0000000 hse7 100 16.0000000 36.8452949 0 100.0000000 hse8 100 31.0000000 46.4823199 0 100.0000000 hse9 100 7.0000000 25.6432400 0 100.0000000 hse10 100 14.0000000 34.8735088 0 100.0000000 This results of the PROC MEANS for the new side-effect variables are presented here. Coding the variables as 0 and 100 makes the mean be the same as the percentage of patients with the side-effect. The sum would be equal to the number of patients with the condition (after dividing by 100). This coding can be useful because doing the coding in this way allows you to get the percent with the condition with a PROC MEANS where the output is more close together, rather than with a series of frequency distributions using PROC FREQ. These means are percent of patients with se

Log transformed value shows a better linear pattern This is the normal probability (QQ) plot for the untransformed urinary sodium excretion. The pattern of the points do not follow a straight line so the data does not follow a normal distribution very well. The log-transformed values fit a line much better. You may want to use the log-transformed in any analyses.