Lesson 4 - Topics Creating new variables in the data step SAS Functions
Creating New Variables Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); Indirect assignments (if/then/else) if age < 50 then young = 1; else young = 2; if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3;
Direct Assignments (Formulas) Example c = a + b ; So if a = 2, b =3, c = 5; What if a is missing, what is c? C will be missing What if b is missing?
If/then/else Statements With if-then-else definitions SAS stops executing after the first true statement if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; What if income is 10? What if income is 23? What if income is 30? What if income is missing? Tax = 1 Tax = 2 Tax = 3 Tax = 1
Create a new variable with 2 levels, one for college graduates and one for non-college graduates. Creating New Variables
Program 5 DATA tdata; INFILE ‘C:\SAS_Files\tomhs.data' ; 1 ptid 49 educ sbp12 3. ; * This way will code missing values to the value 2; if educ < 7 then grad1 = 2 ; else if educ >=7 then grad1 = 1 ; * The next two ways are equivalent and are correct; if educ < 7 and educ ne. then grad2 = 2; else if educ >=7 then grad2 = 1; * IN is a useful function in SAS ; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1; New variable defines go after the input statement
PROC FREQ DATA=tdata; TABLES educ grad1 grad2 grad3 ; Cumulative Cumulative educ Frequency Percent Frequency Percent Frequency Missing = 1 Cumulative Cumulative grad1 Frequency Percent Frequency Percent Cumulative Cumulative grad2 Frequency Percent Frequency Percent Frequency Missing = 1 Cumulative Cumulative grad3 Frequency Percent Frequency Percent Frequency Missing = 1 Coded the missing value for educ to 2
PROC FREQ DATA=tdata; TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW NOCOL; TITLE 'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad1 educ grad1 Frequency‚ 1‚ 2‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ. ‚ 0 ‚ 1 ‚ 1 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 3 ‚ 3 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 0 ‚ 4 ‚ 4 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 0 ‚ 23 ‚ 23 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 5 ‚ 0 ‚ 14 ‚ 14 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 6 ‚ 0 ‚ 12 ‚ 12 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 7 ‚ 16 ‚ 0 ‚ 16 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 8 ‚ 10 ‚ 0 ‚ 10 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 9 ‚ 17 ‚ 0 ‚ 17 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total This shows that the missing value for educ got assigned a value of 2
* Recode sbp12 into 3 levels; if sbp12 =. then sbp12c =. ; else if sbp12 < 120 then sbp12c = 1 ; else if sbp12 < 140 then sbp12c = 2 ; else if sbp12 >=140 then sbp12c = 3 ; With if-then-else definitions SAS stops executing after the first true statement Values < 120 will be assigned value of 1 Values will be assigned value of 2 Values >=140 will be assigned value of 3 Missing values will be assigned to missing
PROC FREQ DATA=tdata; TABLES sbp12c sbp12; RUN; OUTPUT Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Frequency Missing = 8 Cumulative Cumulative sbp12 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ (more values) Frequency Missing = 8
* Easy but costly error to make; if sbp12 =. then sbp12c =. ; else if sbp12 < 120 then sbp12c = 1 ; else if sbp12 < 140 then sbp12 = 2 ; else if sbp12 >=140 then sbp12c = 3 ; PROC FREQ DATA=tdata; TABLES sbp12c; RUN; The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Frequency Missing = 51 How come no values of 2 and why so many missing?
Important Facts When Creating New Variable 1.New variables are initialized to missing 2.Missing values are < any value if var < value (true if var is missing) 3.Reference missing values for numeric variables as. 4.Reference missing values for character variables as ' ' if sbp =. then... (or if missing(sbp)) if clinic = ' ' then...
SAS Handling of Missing Data When Creating New Variables Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); If any variable on the right-hand side is missing then the new variable will be missing Indirect assignments if age < 50 then young = 1; else young=2; New variables are initialized to missing but may be given a value if any of the IF statements are true
What Value to Set New Variable if age < 20 then teenager = 1; else if age >=20 then teenager = 2; if age < 20 then teenager = 1; else if age >=20 then teenager = 0; if age < 20 then teenager = ‘YES’; else if age >=20 then teenager = ‘NO’;
* Program 6 SAS Functions ; DATA example; INFILE ‘C:\SAS_Files\tomhs.data' ; height weight ursod (se1-se10) ( ); bmi = (weight* )/(height*height); rbmi1 = ROUND(bmi,1); lursod = LOG(ursod); seavg = MEAN (OF se1-se10); semax = MAX (OF se1-se10); semin = MIN (OF se1-se10);
* Use of dash notation ; seavg = MEAN (OF se1-se10); This is the same as seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); The OF is very important. Otherwise SAS thinks you are subtracting se10 from se1. To use this notation the ROOT of the name must be the same.
* Two ways of computing average ; seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); Versus seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10; Using mean function computes the average of non- missing values. Result is missing only if all values all missing. Using + formula requires all values be non-missing otherwise result will be missing if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10); What does this statement do?
PROC PRINT DATA = example (OBS=15); VAR bmi rbmi1 rbmi2 seavg semin semax ; TITLE 'Listing of Selected Data for 15 Patients '; RUN; PROC FREQ DATA = example; TABLES semax; TITLE 'Distribution of Worse Side Effect Value'; TITLE2 'Side Effect Scores Range from 1 to 4'; RUN; ods graphics on; PROC UNIVARIATE DATA = example ; VAR ursod lursod; QQPLOT ursod lursod; TITLE 'Quantile Plots for Urine Sodium Data'; RUN;
Listing of Selected Data for 10 Patients Obs bmi rbmi1 seavg semin semax
Distribution of Worse Side Effect Value Side Effect Scores Ranges from 1 to 4 The FREQ Procedure Cumulative Cumulative semax Frequency Percent Frequency Percent patients had at least 1 severe side effect
Log transformed value shows a better linear pattern