Chapter 3: Working With Your Data **SAS Alert** In this chapter, we will see some examples of SAS at its worst. The “implicit” looping of the DATA step makes several operations that would be easy in R difficult or clumsy in SAS We see good things too © Fall 2011 John Grego and the University of South Carolina
Assigning Variables Creating variables based on other variables is easily done within the data step Assignment is carried out with the = sign input var1 var2 var3; mysum=var1+var2+var3; mycube=var3**3; always6=6; newname=var2;
Assigning Variables II Order of operations is followed, but use parentheses when necessary and for clarity Previously defined variables can be over-written var1=var1-15;
SAS functions In addition to simple math expressions, you can use built-in SAS functions to create variables Section 3.3 (pages 78-81) lists many built-in functions
Working with your data Some of the most useful: LOG, MEAN, ROUND, SUM, TRIM, UPCASE, SUBSTR, CAT, COMPRESS, DAY, MONTH Note: MEAN takes the mean of several variables, not the mean of all values of one variable. Similarly with SUM, etc. They are row operators, not column operators. Open charfunctions.sas
Using IF-THEN Statements Conditional statements in SAS rely on several important keywords like IF, THEN and ELSE and logical keywords like EQ,NE,GT,LT,GE,LE,IN,AND,OR All of these have symbolic equivalents (see page 82) IN: Checks whether a variable value occurs in a specified list
Simple IF-THEN statement An IF-THEN statement is a simple conditional statement, usually resulting in only one action, unless the keywords DO and END are specified (like braces in R) IF X>0 AND X<2 THEN Y=X; ELSE Y=2-X; Open pact.sas
IF THEN/ELSE IF/ELSE Several conditions may be checked using ELSE IF or ELSE statements The last action is carried out if none of the previous conditions are true IF .. THEN ..; ELSE IF .. THEN ..; ELSE ..; Open retain2008.sas
IF-THEN vs. IF-THEN/ELSE Using several ELSE statements is more efficient than using several IF-THEN statements (though errors in logic are more likely) Note: Parentheses may be useful with AND/OR statements.
Missing Values Be careful with missing values when making comparisons! SAS considers missing values to be “less than” practically any value, so if data contains missing values, handle them separately IF weight=. THEN size=‘unknown’; ELSE IF weight<25 THEN size=‘small’; ELSE IF ..
Using IF to select a subset of data We can retain cases from the data using logical operators with a subsetting IF. The syntax is unusual; it seems as though part of the statement is missing DATA B; SET A; IF type=‘Pine’; Open pact.sas. again
Implict vs Explicit subsetting IF Data set B will then include only the cases that match the condition Most people get used to this odd syntax, but you can include a THEN KEEP/THEN DELETE clause if this makes you really uncomfortable
SAS Dates SAS stores dates internally as the number of days since January 1, 1960 Special informats for reading dates (pp. 44-45) When a year is specified by two digits (‘03, ‘45, etc.), you can use YEARCUTOFF to specify the century
SAS Dates-Year cutoff The default is 1920; SAS assumes dates range from 1920 to 2019. It’s better to simply avoid this ambiguity options yearcutoff=1930; options yearcutoff=1800;
SAS Dates-TODAY Handy function: TODAY() is set to the current date Special date form for logical operators if birthdate>’01JAN1988’d then age=‘Under 21; 15
SAS Date functions and formats Printing dates in a conventional format: Use FORMAT command in PROC PRINT; We can also output data using date formats (pp. 90-91) Other useful functions: MONTH(),DAY(),YEAR(),MDY(),QTR() Run date.sas. DATETIME formats not covered here.
cumul_sum+value_added; RETAIN statement The RETAIN statement tells SAS to retain the value of a variable as SAS moves from observation to observation A clumsy solution to a common SAS problem Can be useful for cumulative analyses A sum statement creates a cumulative sum: cumul_sum+value_added; David’s code—open sas_quakes_retain.sas. SAS supports implicit addition too.
Using arrays We have seen how to alter variables that have been read into a SAS data set Sometimes we want to operate on more than one variable in the same way This can be accomplished quickly by creating an array (another example of an awkward solution to a common problem in SAS)
Array properties An array is a group of variables (either all numeric or all character) These could be already-existing variables or new ones The array itself does not become an object in the data set
Defining an array Once an array is defined, you can refer to its variables using “subscripts”. Ex: array_name(2) Using a DO statement is clunky—I like to use DO OVER instead ARRAY array_name (n) $ .. .. ..; Book ex: ARRAY song (5) wj kt tr filp ttr; Do i=1 TO 5; IF song(i)=9 THEN song(i)=.; END; (n) is used for explicit arrays with predefined numbers of variables. Use [] or {} in addition to (). Open array.sas, which changes missing values to 0’s, and counts item response categories.
Shortcuts for lists of variables Suppose variable names begin with a common character string, and end with a number sequence: var1, var2, var3, var4 You can refer to them in shortcut fashion: var1-var4
Lists of variables in functions When specifying abbreviated lists in functions, you must use the keyword OF sum(of var1-var4); mean(of var1-var7);
proc contents data=dsname position; Variable ordering You can abbreviate lists of named variables using a double hyphen: firstvar—lastvar These must follow the creation order of the variables as defined in the SAS data sheet. Check this either in the worksheet or by entering: proc contents data=dsname position; Open array.sas rather than doubeldash.txt. Open variable_order.sas
Special variables _ALL_ is short for “all variables in the data set” _NUMERIC_ is short for “all numeric variables in the data set” _CHARACTER_ is short for “all character variables in the data set” _N_ is short for “current observation index in the data set SUM(OF _NUMERIC_);