Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward
Intro to SAS Chapter 3 Part 2
3.9 GOING DEEPER: UNDERSTANDING HOW THE DATA STEP READS AND STORES DATA To understand how SAS works, it can be helpful to "look under the hood" and see what is happening in all those bits and bytes as SAS reads and processes data. Knowing how SAS handles data is a little bit like knowing how the motor in your car works. You can usually drive around okay without knowing anything about pistons, but sometimes, it is good to know what that knocking sound means.
How SAS Thinks UNDERSTANDING HOWTHE DATA STEP READS AND STORES DATA Data Step Processing The DATA Step vs The PROC Step More about reading data files Review of how to read data into SAS
Consider the following program Using the following SAS program: DATA NEW; INPUT ID $ AGE TEMPC; TEMPF=TEMPC*(9/5)+32; DATALINES; 0001 24 37.3 0002 35 38.2 ; RUN; PROC PRINT;RUN; How does SAS read in this data and create a SAS data set? This code calculates a value and creates a variable named TEMPF. We’ll learn more about calculations later… Alan Elliott, stattutorials.com
Overview of SAS Data Step Compile Phase (Look at Syntax) Execution Phase (Read data, Calculate) Output Phase (Create Data Set)
Concepts… COMPILE - SAS reads the syntax of the SAS program to see if there are any errors in the code. If there are no errors found, SAS “compiles” this code – that is, it transforms the SAS code into a code used internally by SAS. (You don’t need to know this internal code.) EXECUTION - If the code syntax checks out, SAS begins performing the tasks specified by the code. For example, the first line of code is DATA NEW, so during the execution phase, SAS creates a “blank” dataset (no data in it) named NEW that it will use to put the data into as it is read. OUTPUT - SAS reads in each data line. It interprets this line of data into the values for each variable and stores them into the data set one line at a time until all data have been output into the specified data set.
Compile Phase SAS Checks the syntax of the program. Identifies type and length of each variable Does any variable need conversion? If everything is okay, proceed to the next step. If errors are discovered, SAS attempts to interpret what you mean. If SAS can’t correct the error, it prints an error message to the log. DATA NEW; INPUT ID $ AGE TEMPC; TEMPF=TEMPC*(9/5)+32; DATALINES; 0001 24 37.3 0002 35 38.2 ; run; proc print;run;
Create Input Buffer SAS creates an input buffer INPUT BUFFER contains data as it is read in DATALINES; 0001 24 37.3 0002 35 38.2 ; INPUT BUFFER 1 2 3 4 5 6 7 8 9 10 11 12 .
Execution Phase PROGRAM DATA VECTOR (PDV) is created and contains information about the variables Two automatic variables _N_ and _ERROR_ and a position for each of the four variables in the DATA step. Sets _N_ = 1 _ERROR_ = 0 (no initial error) and remaining variables to missing. _N_ _ERROR_ ID AGE TEMPC TEMPF 1 .
Buffer to PDV Buffer 1 2 3 4 5 6 7 8 9 10 11 12 . PDV Reads 1st record . PDV Reads 1st record _N_ _ERROR_ ID AGE TEMPC TEMPF 1 0001 24 37.3 . Initially missing Processes the code TEMPF=TEMPC*(9/5)+32; If there is an executable statement… _N_ _ERROR_ ID AGE TEMPC TEMPF 1 0001 24 37.3 99.14 Calculated value
Output Phase The values in the PDV are written to the output data set (NEW) as the first observation: From PDV _N_ _ERROR_ ID AGE TEMPC TEMPF 1 0001 24 37.3 99.14 Write data to data set. This is the first record in the output data set named “NEW.” Note that _N_ and _ERROR_ are dropped. ID AGE TEMPC TEMPF 0001 24 37.3 99.14
Exceptions to Missing in PDV Initial values usually set to missing in PDV _N_ _ERROR_ ID AGE TEMPC TEMPF 1 . Some data values are not initially set to missing in the PDV variables in a RETAIN statement variables created in a SUM statement data elements in a _TEMPORARY_ array variables created with options in the FILE or INFILE statements These exceptions are covered later.
Know the Difference INPUT BUFFER PROGRAM DATA VECTOR 1 2 3 4 5 6 7 8 9 10 11 12 . _N_ _ERROR_ ID AGE TEMPC TEMPF 1 .
Next data record read Once SAS finished reading the first data record, it continues the same process, and reads the second record…sending results to output data set (named NEW in this case.) …and so on for all records. ID AGE TEMPC TEMPF 0001 24 37.3 99.14 0002 35 38.2 100.76
Descriptor Information For the data set, SAS creates and maintains a description about each SAS data set: data set attributes variable attributes the name of the data set member type, the date and time that the data set was created, and the number, names and data types (character or numeric) of the variables.
Data Set Description Alternate programs proc contents data= new; To see all data sets use proc contents data=_ALL_; proc datasets ; contents data=new; run; Contents output… (abbreviated) # Name Member Type File Size Last Modified 1 NEW DATA 5120 20Nov13:08:59:32
Description output continued… Data Set Name WORK.NEW Observations 2 Member Type DATA Variables 4 Engine V9 Indexes Created Wed, Nov 20, 2013 08:59:32 AM Observation Length 32 Last Modified Deleted Observations Protection Compressed NO Data Set Type Sorted Label Data Representation WINDOWS_64 Encoding wlatin1 Western (Windows)
Description output continued… Alphabetic List of Variables and Attributes # Variable Type Len 2 AGE Num 8 1 ID Char 3 TEMPC 4 TEMPF
Example -- How Errors are Found During Compilation Review this program: DATA NEW; INPUT ID $ AGE TEMPC; TEMPF=TEMPC*(9/5)+32; DATALINES; 0001 24 37.3 0002 35 38.2 ; run; proc print;run;
Original Program DATA NEW; INPUT ID $ AGE TEMPC; TEMPF=TEMPC*(9/5)+32; DATALINES; 0001 24 37.3 0002 35 38.2 ; run; proc print;run; Program output Obs ID AGE TEMPC TEMPF 1 0001 24 37.3 99.14 2 0002 35 38.2 100.76
Example of Error DATA NEW; INPUT ID $ AGE TEMPC; TEMPF=TEMPC*(9/5)+32 Review this program – what’s the error?: DATA NEW; INPUT ID $ AGE TEMPC; TEMPF=TEMPC*(9/5)+32 DATALINES; 0001 24 37.3 0002 35 38.2 ; run; proc print;run; Missing Semi-colon
Error found during compilation 76 DATA NEW; 77 INPUT ID $ AGE TEMPC; 78 TEMPF=TEMPC*(9/5)+32 79 DATALINES; --------- 22 80 0001 24 37.3 ---- 180 ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, - , /, <, <=, <>, =, >, ><, >=, AND, EQ, GE, GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR, ^=, |, ||, ~=. ERROR 180-322: Statement is not valid or it is used out of proper order. 81 0002 35 38.2 82 ; 83 run; ERROR: No DATALINES or INFILE statement. Error found during compilation
Summary - Compilation Phase During Compilation Check code syntax Identify type and length of each new variable (is a data type conversion needed?) Creates INPUT BUFFER if there is an INPUT statement for an external file Creates the Program Data Vector (PDV) Creates descriptor information for data sets and variable attributes Other options not discussed here: DROP; KEEP; RENAME; RETAIN; WHERE; LABEL; LENGTH; FORMAT; ARRAY; BY; ATTRIB; END=, IN=, FIRST, LAST, POINT= Does not read any data….
Summary – Execution Phase The DATA step iterates once for each observation being created. Each time the DATA statement executes, _N_ is incremented by 1. Newly created variables set to missing in the PDV. SAS reads a data record from a raw data file into the input buffer (there are other possibilities not discussed here). SAS executes any other programming statements for the current record. At the end of the data statements (RUN;) SAS writes an observation to the SAS data set (OUTPUT PHASE) SAS returns to the top of the DATA step (Step 3 above) The DATA step terminates when there is no more data.
Quiz - Find Syntax Errors DATA MYDATA; INPUT ID $ SBP DBP GENDER $ AGE WT; DATALINES; 1 120 80 M 15 115 2 130 70 F 25 180 3 140 100 M 89 170 4 120 80 F 30 150 5 125 80 F 20 110; PROC PRINT; RUN; Where is the syntax error?
Quiz - Find Syntax Errors DATA MYDATA; INPUT ID $ SBP DBP GENDER $ AGE WT; DATALINES; 1 120 80 M 15 115 2 130 70 F 25 180 3 140 100 M 89 170 4 120 80 F 30 150 5 125 80 F 20 110; PROC PRINT; RUN; Where is the syntax error?
Find Syntax Errors DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.DAT'; INPUT ID $ 1-3 GP $ 5 AGE 6-9 TIME1 10-14 TIME2 15-19 TIME3 20-24; DATALINES; PROC MEANS; RUN; Where is the syntax error?
Find Syntax Errors DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.DAT'; INPUT ID $ 1-3 GP $ 5 AGE 6-9 TIME1 10-14 TIME2 15-19 TIME3 20-24; DATALINES; PROC MEANS; RUN; Where is the syntax error?
Find Syntax Errors DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.CSV'; DLM=',' FIRSTOBS=2 OBS=26; INPUT GROUP $ AGE TIME2 TIME3 Time4 SOCIO; PROC MEANS; RUN; Where is the syntax error?
Find Syntax Errors DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.CSV'; DLM=',' FIRSTOBS=2 OBS=26; INPUT GROUP $ AGE TIME2 TIME3 Time4 SOCIO; PROC MEANS; RUN; Where is the syntax error?
Character Variable LENGTH in SAS By default, character variables have a length of 8. DATA NAMES; INPUT FIRST $ LAST $ AGE; DATALINES; GEORGE WASHINGTON 30 JAMES ADAMS 34 BERNIE RUMPELSTILTSKIN 55 ; proc print; run; What’s the problem with this code?
Results Obs FIRST LAST AGE 1 GEORGE WASHINGT 30 2 JAMES ADAMS 34 3 BERNIE RUMPELST 55 NOTE THE PROBLEM
Solution: Use LENGTH Statement data names; LENGTH LAST $15.; input FIRST $ LAST $ AGE; Etc… You could also use a FORMAT statement here. Obs LAST FIRST AGE 1 WASHINGTON GEORGE 30 2 ADAMS JAMES 34 3 RUMPELSTILTSKIN BERNIE 55 Problem corrected…
Problem: Missing Data in Freeform data names; LENGTH LAST $15. input FIRST $ LAST $ AGE; DATALINES; GEORGE WASHINGTON 30 JAMES ADAMS BERNIE RUMPELSTILTSKIN 55 ; proc print; run; Note: No AGE for JAMES Adams
Did not read all of the data! Results are Obs LAST FIRST AGE 1 WASHINGTON GEORGE 30 2 ADAMS JAMES Did not read all of the data!
Solution: Indicate missing data data names; LENGTH LAST $15. input FIRST $ LAST $ AGE; DATALINES; GEORGE WASHINGTON 30 JAMES ADAMS . BERNIE RUMPELSTILTSKIN 55 ; proc print; run; Note: Note missing value denoted as dot (.)
Results Obs LAST FIRST AGE 1 WASHINGTON GEORGE 30 2 ADAMS JAMES . 3 RUMPELSTILTSKIN BERNIE 55
Missing Data in Column Read in DCOLUMN.SAS DATA MYDATA; INPUT ID $ 1 SBP 2-4 DBP 5-7 GENDER $ 8 AGE 9-10 WT 11-13; DATALINES; 1120 M15115 2130 70F25180 3140100 89170 4120 80F30 5125 80F20110 ; RUN; PROC PRINT; Read in DCOLUMN.SAS Delete the 80 in the first record , the 150 in the 4th record, and M in record 3 (preserve columns) Change PROC MEANS to PROC PRINT Run the program and observe output
Resulting Output – it’s okay! Obs ID SBP DBP GENDER AGE WT 1 120 . M 15 115 2 130 70 F 25 180 3 140 100 89 170 4 80 30 5 125 20 110 Note the blanks in the data set are read as missing values – the numeric missing values are indicated by dot (.) and text missing values are indicated with a blank. This works if data are read using column or formatted input.
Location of created Data sets In the left window in SAS, click on the Explore tab. Notice the Contents of “SAS Environment.” Click on the Libraries icon. You will see several “Libraries” including Work. Click on Work.
Work Library The WORK library contains all of the SAS data sets we’ve created so far. (You may have a little different list.) Double Click on the one named Employees.
SAS Viewtable SAS Viewtable displays the contents of a data set. Click on “x” to close. Note that the name is work.employees. Close viewer.
3.10 SUMMARY This chapter defined the difference between temporary and permanent data sets and illustrated several methods for importing data sets into SAS using either the SAS Wizard or PROC IMPORT. Finally, the way SAS "thinks" as it is inputting data is explained.
These slides are based on the book: Introduction to SAS Essentials Mastering SAS for Data Analytics, 2nd Edition By Alan C, Elliott and Wayne A. Woodward Paperback: 512 pages Publisher: Wiley; 2 edition (August 3, 2015) Language: English ISBN-10: 111904216X ISBN-13: 978-1119042167 These slides are provided for you to use to teach SAS using this book. Feel free to modify them for your own needs. Please send comments about errors in the slides (or suggestions for improvements) to acelliott@smu.edu. Thanks.