Chapter 3 “Working With Your Data” concerns programming in the DATA step - putting lines of SAS code between a DATA and PROC statement… Creating new variables or modifying existing variables is one of the main tasks in DATA step programming. The syntax is: variable = expression ; The variable on the left side can be a new or existing variable; if the variable is a new one, then SAS adds it to the dataset and if it’s an existing variable, SAS redefines it (and so replaces the old values with the new values) defined by the expression on the right.
SAS uses the usual arithmetic signs: +, -, *, /, ** – follows the usual rules for precedence (exponentiation first, then multiplication & division, then addition and subtraction) – follows the usual rules for parentheses (do what’s in parentheses first)… so when in doubt, use parentheses... Example: write a simple SAS program using the 5 arithmetic operations to create new variables… then write some programming statements to show how parentheses can affect those variables…
options ls=80; data test; input x y=x+10; z=x/y; w=x**2; t=x**.5; bd=mdy(1,15,1966); new=round(log10(150)); datalines; ; proc print; format bd mmddyy10.; run;
Many times new variables are created with the built-in functions of SAS... sections 3.2 & 3.3 have a sampling of those put in categories so you can see the variety... categories are: –Character, Date, Financial, Macro, Mathematical –Probability, Random Number, Simple Statistical –State & Zip Code Functions operate on their arguments and are in the form function(arg1, …, argn); so to create a new variable with a function use new_variable = function(argument_list) Functions can be nested within each other: newvar=sqrt(log10(X));
DATA contest; *INFILE 'c:\MyRawData\Pumpkin.dat'; INPUT Name $16. Age Type $1. +1 Date MMDDYY10. (Scr1 Scr2 Scr3 Scr4 Scr5) (4.1); AvgScore = MEAN(Scr1, Scr2, Scr3, Scr4, Scr5); DayEntered = DAY(Date); Type = UPCASE(Type); DATALINES; Alicia Grossman 13 c Matthew Lee 9 D Elizabeth Garcia 10 C Lori Newcombe 6 D Jose Martinez 7 d Brian Williams 11 C ; PROC PRINT DATA = contest; TITLE 'Pumpkin Carving Contest'; RUN;
Go over the functions in section 3.3 in detail - general statements on p. 80 and specific examples on p. 81. Note the variety of arguments for the various functions…
Another way to create new variables is with so-called IF-THEN statements - syntax is IF condition THEN action ; Most of the conditions are based on comparison operators... EQ, NE, GT, LT, GE, LE, IN (or use =, ~= or ^=, >, =, <=) You can also specify multiple conditions with logical operators... OR, AND, NOT (or use |, &, ~ or ^) You can also specify multiple actions using the DO-END loop... (see p also next slide)
In the conditional IF condition THEN action ; condition can also be compound, formed by using the logical connectors AND, OR, NOT or by using the IN statement... see the examples on pages 82 and 83 Can also use the following construction: IF condition THEN DO; action1; action2; END; this DO group can contain many SAS statements, but they are all treated as a unit, and executed one after the other until the END statement is reached
IF-THEN statements are often used to create new categorical variables from existing variables with many different values; or to convert numeric variables to character variables - do some examples… IF-THEN-ELSE statements can help with this task by ensuring that the conditions you are using to create your categories are mutually exclusive. Sometimes the final ELSE statement has no IF…THEN See example on page 85 - note how missing values are handled…
* Group observations by cost; DATA homeimprovements; *INFILE 'c:\MyRawData\Home.dat'; INPUT Owner $ 1-7 Description $ 9-33 Cost; IF Cost =. THEN CostGroup = 'missing'; ELSE IF Cost < 2000 THEN CostGroup = 'low'; ELSE IF Cost < THEN CostGroup = 'medium'; ELSE CostGroup = 'high'; DATALINES; Bob kitchen cabinet face-lift Shirley bathroom addition Silvia paint exterior. Al backyard gazebo Norm paint interior Kathy second floor addition PROC PRINT DATA = homeimprovements; TITLE 'Home Improvement Cost Groups'; RUN;
HW: to me by noon on Wednesday: Use the diabetes data to create the following new variables: –BMI (Body Mass Index = (weight (in kg)) divided by (height in meters) 2 ) –A character variable based on the BMI that puts the participants into groups: below 18.5 is underweight, between 18.5 and 24.9 is normal, 25.0 to 29.9 is overweight, and 30.0 and above is obese. –age classes - make classes by decade (teens, 20s, 30s, etc.)