Working With Dates: Dates Come in Many Ways 10/18/04 18/10/04 10/18/2004 18OCT2004 101804 October 18, 2004 Need to know how to read-in dates and then work with them The rest of this lesson will discuss working with variables that are dates. Dates come in many different formats several of which are displayed here. These are all possible formats for the date: October 18th, 2004. Sometimes the month is first, sometimes it is second, sometimes the year is 4 digits, sometimes it is only 2. Sometimes the first 3 characters of the months are spelled out. Dates can have separators and sometimes not. Whatever the format we will want to know how to read them into SAS and how to work with them Fortunately, SAS makes reading and working with dates fairly easy.
What do you want to do with dates? Display them Compare two dates: find the number of days between 2 dates ndays = date2 - date1; Will this work? Before we learn how to read in dates, let’s consider what you might want to do with dates once they are read in. This first thing we might want to do is simply to display them. For example, we might want to list for each patient the date they were enrolled, or the date they last came in for a visit. The second thing we will usually want to do with dates is to compare two dates, to see which date is earlier, or more generally to calculate the number of days between the two dates. Think about computing the age of a person at the time they entered a study if their birth date and their date enrolled are available. You would like to subtract the enrollment date from the birth date and somehow come up with the number of years between the dates. However, dates in their raw form do not subtract very well. Even if you could ignore the separators the result would not be what you want. SAS does something special when dates are read in that makes subtracting two dates be the number of days between them. Once you have that you can compute the number of years easily.
Jan 1, 1960 DATA dates; INFILE DATALINES; INPUT @1 brthdate mmddyy10.; * Use informat; DATALINES; 03/03/1971 02/14/1956 01/01/1960 ; PROC PRINT; VAR brthdate; PROC PRINT; VAR brthdate; FORMAT brthdate mmddyy10.; ------------------------------------------------------ Obs brthdate 1 4079 2 -1417 3 0 1 03/03/1971 2 02/14/1956 3 01/01/1960 Let’s look at a simple program that illustrates how to read in dates and what SAS does with them. Here we have a short DATA step that reads in the birth dates for three persons. We use the INFORMAT mmddyy10. which reads in dates that take up 10 positions, having the pattern month/day/year with a slash separating the three portions of the date. There are similar INFORMATS for different date formats. We then do a PROC PRINT on the variable brthdate. Notice the values look nothing like dates, they are numbers – one positive, one negative, and one equal to zero. What is going on here? Well, when SAS reads in a date with a date informat SAS makes the variable numeric and assigns the value relative to the date of January 1, 1960. Dates before this date will be negative and dates after this date with be positive (and dates equal to this date will be assigned a value of zero). This explains our display for the three dates. Each date has the value of the number of days from January 1, 1960. To display dates properly, as a date, you need to apply a date FORMAT to the variable as shown here in the second PROC PRINT. SAS will then display the value as a date as displayed here. Jan 1, 1960
When you read in a variable with a date informat: SAS makes the variable numeric SAS assigns the numeric value relative to January 1, 1960 This makes it easy to subtract two dates to get the number of days between the dates. dayselapsed = date2 – date1; Here is a summary: When SAS reads in a date with a date informat it make the variable numeric and assigns the value relative to January 1, 1960. You might ask, what good does this do? Well, this make it easy to subtract dates to find the number of days between the dates. You simply subtract the two date variables. The January 1, 1960 date is arbitrary, SAS could have used any date. An important thing to remember is that once the date is read-in with a date informat SAS looks at it as any numeric variable. Remember variables in SAS are either character or numeric, there is no “date” kind of variable. If you always want SAS to display a date variable in a date format, then assign a date format to the variable in the DATA step as shown here.
* Program 8 ; DATA age; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @14 randdate mmddyy10. @34 brthdate mmddyy10. ; agedays = randdate - brthdate ; ageyrs = (randdate - brthdate)/365.25; ageint = INT( (randdate - brthdate)/365.25); yrrand = YEAR(randdate); OK, let’s look at program 9 from the course notes. The DATA step reads in three date variables using the mmddyy10. informat. The variables are the patient’s randomization date, birth date, and 12-month visit date. We will illustrate how to compute ages and some other things involving dates. The first new variable we compute is variable agedays which is the simple difference between the randomization and birth date. This will contain the number of days between the two dates. If we divide this difference by 365.25 we can get age in years (variable ageyrs). Lastly, since we usually report ages as “completed years of life” we use the integer function INT to remove the decimal portion of the years (variable ageint). SAS also has a function that can compute age, the function YRDIF. You provide the first and second date along with a method to compute years Here we use the “Actual” method. You can look up this function under the SAS help to get details on the different options for calculating years. If you want to compute the age as of a certain date you can use the MDY function which creates a temporary variable with that date from which you can subtract the birth date. Here we compute the patient’s age as of February 28, 1992, the end of the study. There is also the TODAY function if you would like to compute the age as of the date you run the program. Next, suppose you want to know if a patient came in for their 1-year visit within +- one month of their one year anniversary from randomization. We first compute the number of days between the 1-year visit date and the randomization date (variable daysv12); then we use IF-THEN-ELSE statement to determine if the patient came in within a month of the expected date. ABS in the absolute value function. If the two dates differ by less than 31 days we say the visit was in the window (window12 = 1) ; if it is outside this number of days we say the visit was outside the window (window12 = 2). If the visit was missed then we set window12 to missing. Lastly, we illustrate the YEAR function. This function takes a date and extracts the year from it; so the variable yrrand will contain the year of enrollment. There are similar functions for month and day of the week.
PROC PRINT DATA=age (obs=5); VAR brthdate randdate agedays ageyrs ageint ; TITLE 'Printing Dates Without a Date Format'; RUN; VAR brthdate randdate agedays ageyrs ageint ; FORMAT brthdate mmddyy10. randdate mmddyy10.; TITLE 'Printing Dates With a Date Format'; Now that we have defined a bunch of new variables we will use PROC PRINT to display some of the date and age variables we created; first without using date formats and in the second procedure with a format for the two date variables. We restrict the output to 10 observations.
Printing Dates Without a Date Format Obs brthdate randdate agedays ageyrs ageyrsX ageint 1 -8589 10175 18764 51.3730 51.3739 51 2 -6880 10239 17119 46.8693 46.8711 46 3 -12572 10002 22574 61.8042 61.8055 61 4 -9592 10175 19767 54.1191 54.1205 54 5 -12996 10280 23276 63.7262 63.7268 63 All before 1960 In the first display note that the variable brthdate displays as a negative integer value – these are all dates before 1960. The variable randdate has values of about 10 thousand; there are the number of days from January 1, 1960. The age variables all display correctly because these are not dates, but various times between dates. Note how the integer age in years variable truncates the fractional years from the age in years variable.
Printing Dates With a Date Format Obs brthdate randdate 1 06/26/1936 11/10/1987 2 03/01/1941 01/13/1988 3 07/31/1925 05/21/1987 4 09/27/1933 11/10/1987 5 06/02/1924 02/23/1988 This slide displays partial results for the proc print when the two date variables have a date format applied. We see that the variables display as dates, here in the month-day-year format with a four digit year. One note – you can display the dates in any format, not just the format used to read-in the variable. Section 9.3 of the LSB lists several of the available date formats.
TITLE 'Frequency Distribution of Year Randomized'; RUN; PROC FREQ DATA=age; TABLES yrrand ; TITLE 'Frequency Distribution of Year Randomized'; RUN; The FREQ Procedure Cumulative Cumulative yrrand Frequency Percent Frequency Percent ----------------------------------------------------------- 1986 9 9.00 9 9.00 1987 65 65.00 74 74.00 1988 26 26.00 100 100.00 Here we run a PROC PRINT to display variables related to whether the patient came in within the visit window. We also run a PROC FREQ to display the frequency distribution of the year of randomization.
Topics Reading SAS datasets Sub-setting SAS datasets Merging SAS datasets Welcome to Lesson 7. In this lesson we will look at working with SAS data sets. This includes creating and using permanent data sets, sub-setting datasets, and merging data sets. These topics are illustrated in programs 10-13 in the course notes and discussed in the indicated sections of the LSB.
Working With SAS Data Sets Reading SAS dataset SET Statement Merging SAS datasets MERGE Statement There are two key SAS statements when working with SAS data sets. The first is the SET statement which reads or brings-in a SAS dataset. The second is the MERGE statements which brings in multiple SAS datasets and merges them into one SAS dataset. All of these actions will take place in the DATA step. That is where new datasets are created, whether from raw data or from an existing SAS dataset. Done within a DATA step
SET STATEMENT DATA new; SET old (KEEP = varlist); WHERE = condition; Reads SAS data set Replaces INFILE and INPUT statements used when reading in raw data KEEP brings in selected variables (columns) Where brings in selected observations (rows) DATA new; SET old (KEEP = varlist); WHERE = condition; RUN; The SET statement is a simple but important statement that reads a SAS dataset. The syntax is the keyword SET followed by the name of the dataset to read. An optional KEEP= option in parenthesis can be used to restrict the variables brought in. If there is no KEEP statement then all variables are brought in. The WHERE statement limits the observations brought in, based on one or more variables in the dataset brought in. So as the KEEP statement makes the new dataset skinnier, the WHERE statement make the new dataset shorter. These 4 lines of SAS code here create a new dataset called new that contain selected observations from old and the variables from old included in varlist. If there was no KEEP or WHERE statement then the dataset new would be identical to old The SET statement replaces both the INFILE and INPUT statements when reading in raw data. You can see that it is much easier to read a SAS dataset. . This creates a new data set called new that has the variables in varlist and selected observations from old.
Making SAS Datasets from Other SAS Datasets; PROGRAM 9 Making SAS Datasets from Other SAS Datasets; DATA tdata; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. @ 12 clinic $1. @ 25 group 1. @ 30 sex 1. @ 123 sbp12 3. @ 14 randdate $10. ; RUN; * Making a new dataset containing only men; DATA men; SET tdata; * reads the existing dataset; WHERE sex = 1; This does the selection; if group in(1,2,3,4,5) then active = 1; else if group in(6) then active = 2; KEEP ptid clinic group sbp12 randdate active; We will illustrate working with SAS datasets in program 10. We first create a dataset called tdata reading in six variables from the raw file tomhs.data. The RUN statement completes the DATA step. We then create a “subset” data set called men which will contain the rows from tdata for which sex = 1. We do this by using the SET statement followed by the WHERE statement. We then add a new variable called active based on the variable group. We add a KEEP statement to limit the variables placed on the dataset men. The new dataset will have 6 variables and as many rows are there are men on the dataset.
PROGRAM 10 - Merging SAS Datasets DATA clinic; INFILE DATALINES; INPUT id $ sbp ; DATALINES; C03615 115 B00979 107 B00644 138 D01348 142 A01088 117 B01408 121 B00025 130 B00714 144 A01166 113 … more data ; DATA lab; INFILE DATALINES; INPUT id $ glucose; DATALINES; C03615 102 B00644 089 D01348 111 A01088 093 B01408 094 B00025 077 B00714 100 A01166 113 D00942 103 … more data ; Frequently you will have multiple SAS datasets that you want to merge together. For example, laboratory data may be in one dataset and clinical data in another dataset. To perform analyses across the datasets (for example, relating laboratory variables and clinical variables) you first need to merge them together. Program 11 illustrates how to merge datasets using a simple example where all the data is within the program. In this example data comes from two sources: the first is data collected at the clinical center; the second source of data comes from the laboratory. In the first DATA step we create the dataset clinic reading in two variables, the patient ID and systolic BP. The second DATA step creates the dataset lab, reading in the patient ID and glucose. Note that the variable name for patient ID is identical in both datasets (variable id). When you merge datasets you need to have a common variable to link the data together. This is typically an ID of some sort. .
* Creating merged dataset; PROC SORT DATA= clinic; BY id; PROC SORT DATA= lab; BY id; DATA study; MERGE clinic lab; BY id ; RUN; Note: The BY statement is very important! Before we can merge datasets we need to be sure each dataset is sorted by the variable you want to merge by, in this case the variable id (think of trying to put two stacks of exams together by student ID – it will be much easier and faster if the piles are sorted by student ID). To sort a dataset you use PROC SORT. The syntax is PROC SORT followed by the dataset to be sorted. You follow this by the keyword BY followed by the variable you want the dataset sorted by. PROC SORT sorts the data by the variable in BY and writes the sorted dataset (by default) back to the same dataset. After the two PROC SORTS shown here the datasets baseline and follow-up will be sorted by patient ID. We are then ready to create a new dataset that is the merged data of the two datasets. We do this in a DATA step. Instead of a SET statement we use a MERGE statement followed by the datasets to be merged. This is immediately followed by a BY statement giving the variable that links the data together. That is all you need! In just a couple of lines of code we have created a merged dataset called study. An important note here: Do not forget the BY statement. If you omit the BY statement then SAS does a one-to-one matching, taking the first row of clinic and merging it with the first row of lab. This is not what you want to do; the data from one patient may get merged with a different patient. (think of the example of merging the two exams together)!
Merged Dataset Obs id sbp glucose 1 A00869 110 99 2 A01088 117 93 Here is the result of the merged dataset , displaying the variables using PROC PRINT. The two observations in red had data only in the clinic dataset and the observation in blue had data only in the lab dataset. Note their data is missing for all variables from the missing dataset. That is what you would expect SAS to do. Here then is an important thing to remember. When merging datasets if an observation is not in a dataset then all variables from that dataset are set to missing. What if the same variable name is in both datasets and the subject has data in both datasets? Well, this could cause a problem. SAS will take the right most dataset value. If that is what you want then you are OK. In general, though, you want variables to have unique names across the datasets (except for the merge by variable).
Things to Remember When Merging Datasets Need to have common variable name is each dataset to use as linking variable Variables in dataset with no match will be set to missing Rows matched that have same variable names will be assigned right-most dataset value Always remember the BY statement in the merge! Here is a summary of important points to remember when merging datasets. First, you need to have a common variable in each dataset to be merged. This is usually an ID of some sort. Second, if an ID is found in one dataset but not the other than variables in the dataset not found are set to missing. Third, if there are common variable names in both datasets then the values form the second dataset will be used. As mentioned, you usually want unique names across datasets, except for the linking variable. Lastly always remember the BY statement after the MERGE. Otherwise you will usually get the wrong results. If you are database inclined, you may want to explore PROC SQL to merge your data. It is more complicated to use, but it is more flexible, and if you have database experience you will be familiar with the syntax.
Temporary vs Permanent SAS Datasets Temporary (or work) SAS dataset - After SAS session is over the dataset is deleted. DATA bp; * bp is deleted after SAS session; Permanent SAS dataset - After program is run the dataset is saved and is available for use in future programs. You need to tell SAS where to store/retrieve the dataset. Note: For PC SAS the work dataset is available until you end the SAS session. In the programs we have looked at so far the SAS data sets created have been temporary SAS datasets. They are temporary in that after your SAS session is over the datasets are deleted. These are also called “work” datasets. In the DATA statement here the dataset bp is temporary. It will be deleted after the SAS session is over. This is usually not a problem because we created the data set with a DATA step in a program. To recreate the data set we can just re-run the program. However, there are times when you would like to have the dataset available after your SAS session is over, without having to re-create it using a DATA step. To create a permanent data set you will need to tell SAS where to store the file containing the dataset. We will look at some reasons why you may want to create a permanent SAS dataset and the syntax for how to do that.
Reasons to Create Permanent SAS Datasets Read raw data and compute calculated variables only once All variables have assigned names and labels. Data is ready to be analyzed. Dataset can be sent to other computers or users. Listed here are reasons for creating a permanent SAS dataset. One reason is that you will then need to read the raw data only once. A second reason is that you can create all your calculated variables once, storing them in the SAS data set. The calculated variables may involve complicated formula or logic. You don’t want to do that every time, especially if multiple programmers are using the data. You want to get all the variable defined up-front. This will reduce the chance of errors. With the SAS dataset all the variables have assigned names and is ready to go, i.e. be analyzed. The data-step that read the raw data and computed new variables is eliminated, once the dataset is created. Often times you may still need to create or recode additional variables in a DATA step, but this will usually be simple and straightforward. The last reason given here is that SAS data sets are a good way to send data to other users or computers. You simply sent the file to another user and everything is ready to go – all the variables have been defined. SAS will need to be installed on the new computer for the dataset to be accessed.
Creating a Permanent Dataset LIBNAME mylib ‘C:\My SAS Datasets’; LIBNAME – assigns a directory (folder) reference name. In this example the directory ‘C:\My SAS Datasets’ is assigned a reference name of mylib. DATA mylib.sescore; Tells SAS to create a dataset called sescore in the directory referenced by mylib, which is ‘C:\My SAS Datasets’. To create a permanent SAS dataset you need to use a LIBNAME statement to tell SAS where to store the dataset. LIBNAME stands for library name. After the keyword LIBNAME follows what is called a library reference. This is a name you assign that points to a directory (i.e. a folder) on your system. After the library reference you indicate the directory that the library reference points to. In the example the library reference named mylib points to the ‘My SAS Datasets’ folder. Library names are similar to filename statements that we learned about in program 2, except the filename statements point to a file whereas libname statements point to a folder. You then use the library reference name in the DATA statement as shown here. The DATA statement here tells SAS to create a data set called sescore in the directory referenced by mylib, which points to the C:\My SAS Datasets directory. Note the form of specifying the dataset, the library reference (where to put the file), a period, followed by the name of the dataset. One note – temporary datasets have a library reference called WORK. In this case you do not need to specify the WORK library, i.e. if no library reference is given, SAS assumes the WORK directory, which SAS assigns to some place on your computer. As stated before these files are then deleted by SAS when your SAS session is ended.
LIBNAME mylib ‘C:\SAS_Files'; DATA mylib.sescore; PROGRAM 11 LIBNAME mylib ‘C:\SAS_Files'; DATA mylib.sescore; INFILE ‘C:\SAS_Files\tomhs.data' LRECL =400; INPUT @ 1 ptid $10. @ 12 clinic $1. @ 14 randdate mmddyy10. @ 25 group 1. @ 49 educ 1. @ 85 wtbl 5.1 @ 97 wt12 5.1 @115 sbpbl 3.0 @123 sbp12 3.0 @236 (sebl_1-sebl_20) (1. +1) @276 (se12_1-se12_20) (1. +1) ; In program 12 we will create a permanent SAS data set called sescore that will be stored in the ‘C:\SAS_Files’ folder. We first define a library reference called mylib that points to this folder. We could have called it anything, we would just need to be consistent with the name in the DATA statement. The two words in red here need to be the same. We then read-in several variables from tomhs.data, including 20 side-effect items at baseline and the same 20 side-effects at the 12-month visit. We use array type notation as a shortcut in reading in these variables.
sescrbl = MEAN (OF sebl_1 - sebl_20) ; wtd12 = wt12 - wtbl; sbpd12 = sbp12 - sbpbl; sescrbl = MEAN (OF sebl_1 - sebl_20) ; sescr12 = MEAN (OF se12_1 - se12_20) ; sescrd12 = sescr12 - sescrbl ; LABEL educ = 'Highest Education Level'; LABEL wt12 = 'Weight (lbs) at 12 Months'; LABEL wtbl = 'Weight (lbs) at Baseline'; LABEL wtd12 = 'Weight Change at Baseline'; LABEL sbpbl = 'Systolic BP (mmHg) at Baseline'; LABEL sbp12 = 'Systolic BP (mmHg) at 12 Months'; LABEL sbpd12 = 'Systolic BP Change at 12 Months'; LABEL group = 'Treatment Group (1-6)'; LABEL sescrbl = 'Side Effect at Baseline'; LABEL sescr12 = 'Side Effect at 12 Months'; LABEL sescrd12 = 'Side Effect Change Score'; FORMAT randdate mmddyy10. ; DROP sebl_1-sebl_20 se12_1-se12_20 ; We then create several new variables: weight and blood pressure change, and two side-effect summary measures. The first is a variable named sescrbl which is the average “score” of the 20 items at baseline. The second is a similar variable using items at 12-months. We then compute a variable called sescrd12 which is the difference between the two variables. This can be used as a measure of improvement (or worsening) over time in the side-effect profile, which can be compared among the treatment groups. We add labels for all variables to document the dataset. We also include a format for randdate so that this variable will always display as a date. Lastly, we use a drop statement that tells SAS to “drop” these variables, i.e. not to include these variables on the data set. This will make the dataset smaller.
60 LIBNAME mylib 'C:\SAS_Files'; NOTE: Libref MYLIB was successfully assigned as follows: Engine: V9 Physical Name: C:\SAS_Files DATA mylib.sescore; NOTE: The infile 'C:\SAS_Files\tomhs.data' is: File Name=C:\SAS_Files\tomhs.data, RECFM=V,LRECL=400 NOTE: 100 records were read from the infile 'C:\SAS_Files\tomhs.data'. NOTE: The data set MYLIB.SESCORE has 100 observations and 14 variables. This is a partial SAS log when the program is run. After the LIBNAME statement we get a note that the library reference MYLIB was successfully assigned. If the referenced directory did not exist then you would get an error message. It would most likely mean that you incorrectly typed the folder path. After the DATA step is a note that the dataset MYLIB.SESCORE has 100 observations and 14 variables. This means that the data set was successfully created. There is one observation for each record read-in from the raw data file tomhs.data.
Using PROC COPY to copy work dataset to permanent dataset Make a work dataset first – then when you know that is working correctly copy the work dataset to a permanent dataset. LIBNAME mylib ‘C:\SAS_Files'; DATA sescore; …. RUN; PROC COPY IN=work OUT=mylib; SELECT sescore; Instead of using a two-part name to create a permanent dataset you can use the copy procedure to copy the work dataset to a permanent library or folder on your computer. You use the IN option to specify the WORK library and then use the OUT option to specify the library (folder) you want the dataset stored. This is usually the way I make permanent datasets, that way my program always has work datasets and then I use PROC COPY to create the permanent dataset when I need to.
Reading Permanent SAS Dataset LIBNAME class ‘C:\SAS_Files' ; * Tells SAS where to find the SAS dataset; PROC MEANS DATA=class.sescore ; TITLE 'Means of All Numeric Variables on SAS Permanent Dataset'; RUN; What if dataset was moved to a different folder? Just need to change LIBNAME So you have created a permanent dataset called sescore. Now you or another user want to use it. (or maybe the dataset was sent to you and you have stored in on your computer and now want to use it. Well, just as you needed to tell SAS where to store it when you created the dataset, you will need to tell SAS where it is stored on your computer in order to read it. For this you will need to use (again) the LIBNAME statement. The LIBNAME statement is used for both reading and writing SAS datasets. Now the library reference that was used to create the dataset does not matter (in fact, you may not even know this information). You do need to know where on your computer the file is located. From program 12 we know the dataset sescore is located under the ‘C:\SAS_Files folder. In program 13 we use the LIBNAME statement to assign a library reference to this folder. We will use the name class. Now we can start running SAS procedures immediately without a DATA step. We need to refer to the dataset on the PROC statement with the library reference, a period, then the dataset name. The PROC MEANS statement here tells SAS to go out and find a dataset called sescore in the directory referenced by class and run a PROC MEANS on this dataset. We then run a PROC CORR on the dataset. You way think that using library references are an odd way to tell SAS where the data set is located. Perhaps, that is true. However, what if the dataset you have been using gets moved to a different folder (perhaps by the data administrator). If that happens, you just need to change the LIBNAME statement and everything will work.
if group in(1,2,3,4,5) then rx = 1; else rx = 2; RUN; *---------------------------------------------------------------* Often you will read the permanent SAS dataset in a DATA step to modify or add variables. Usually these will be put on a new work SAS dataset. *---------------------------------------------------------------*; LIBNAME class 'C:\SAS_Files' DATA rxdata; SET class.sescore; if group in(1,2,3,4,5) then rx = 1; else rx = 2; RUN; PROC MEANS DATA=rxdata N MEAN MAXDEC=2 FW=7; CLASS group; VAR sbpd12 wtd12 sescrd12; TITLE 'Change in SBP, Weight, and Side Effect Score by Treatment'; Finally, here at the end of the program I create a new dataset called rxdata that reads the permanent dataset using the SET statement. I create one new variable called rx and run a PROC MEANS using rx as a class variable. Creating a new dataset from a permanent SAS dataset is no different then creating a SAS dataset from a work dataset. You need to include a LIBNAME statement where you assign the library reference. Then on the SET statement add the library reference to the dataset name. Here we read the SAS dataset sescore from the class library which points to the folder ‘C:\SAS_Files where the file resides.