Introduction to SAS
What is SAS? SAS originally stood for “Statistical Analysis System”. SAS is a computer software system that provides all the tools needed for data analysis: reading data: flexible input techniques transformations: programming language with statistical & mathematical functions manipulation: sorting, subsetting, concatenating, and merging maintenance: storing, documenting, updating, and editing report writing: printing information using pre-written procedures or customized programs graphics: charts, plots, maps, and slides data reduction and summarization: descriptive statistics statistical analysis: from simple crosstab to complex multivariate techniques
What is SAS? SAS consists of a data-handling language and a library of procedures that work together as a system. A supervisor program controls the execution of your SAS job. The SAS System is comprised of numerous SAS products. This course will focus on base SAS software.
Components of Base SAS Software Base SAS software contains – a data management facility – a programming language – data analysis and reporting facilities. Learning to use these features of base SAS software prepares you to learn other SAS products, because they all follow the same basic rules.
Data Management Facility SAS organizes data into a rectangular form called a SAS data set. The example below shows the rectangular form and describes participants in a 16-week weight program at a health and fitness club. Note that a variable contains the same type of data value for all observations.
How to build a SAS data set ? Using the SAS programming language. (1) The DATA statement tells SAS to begin building a SAS data set named WEIGHT_CLUB. (2) The INPUT statement identifies the fields to be read from the input data and names the SAS variables to be created from them. (3) This is an assignment statement that calculates the weight each person lost and assigns it to a new variable, Loss. (4) The DATALINES statement indicates that data lines follow. (5) These data lines contain the raw data. This way of reading raw data is useful when you don’t have a lot of data. (6) The semicolon signals the end of the raw data.
Programming Language Rules for SAS Statements Most SAS statements begin with an identifying keyword. All SAS statements end with a semicolon. You can enter SAS statements in lowercase, uppercase, or a mixture. SAS statements are free format. – They can begin anywhere on a line and end anywhere on a line. – One statement can continue over several lines as long as you do not split a word over 2 lines. – Several statements can be on one line. – Words in SAS statements are separated by blanks – as many as you want – or by special characters, e.g. “=”. Recommended style (not rules, but conventions): – Start each statement on a new line. – Start DATA and PROC statements in column 1. Indent the other statements within the – DATA or PROC step to indicate the logical structure of the step.
Programming Language Rules for Most SAS Names SAS names are used for data sets, variables, and other items. – A SAS name can contain from 1-32 characters. – The first character must be a letter or underscore ( _ ). – Subsequent characters can be letters, numbers, or underscores. – Blanks cannot appear in a SAS name. A Special Rule for Variable Names – For variable names only, SAS remembers the combination of uppercase and lowercase letters used when the variable was created. Internally, the case does not matter (‘dog’, ‘DOG’, and ‘Dog’ represent the same variable). But for printing purposes, SAS uses the original case of each letter.
Data Analysis & Reporting Utilities Base SAS includes a library of built-in programs known as SAS procedures. SAS procedures analyze data from SAS data sets and produce preprogrammed reports. The SAS program below uses the PRINT procedure to produce a report that displays the values of the variables in the WEIGHT_CLUB data set.
Data Analysis and Reporting Utilities The following output shows the results:
Data Analysis and Reporting Utilities To produce a table showing the mean starting weight, ending weight, and weight loss for each team, you can use the TABULATE procedure;
The structure of SAS program A portion of a SAS program that begins with a PROC (procedure) statement and ends with a RUN statement (or another PROC or DATA statement) is called a PROC step. Both of the PROC steps above include the following elements: – A PROC/DATA statement, which includes the word PROC/DATA, the name of the procedure/data you want to use, and for PROC you need to specify the name of the SAS data set that contains the values to be analyzed. – Additional statements that give SAS more information about what you want to do, for example, the CLASS, VAR, TABLE, and TITLE statements. – A RUN statement, which indicates that the preceding group of statements is ready to be executed.
SAS Processing All SAS jobs are a sequence of SAS steps. There are only two kinds of SAS steps: – DATA steps are usually used to create SAS data sets, but can be used to produce reports. – PROC steps analyze or process SAS data sets (generate reports and graphs, edit data, sort data) and, in some cases, create SAS data sets.
The DATA Step In DATA steps, a powerful programming language gives programmers great flexibility in designing applications. DATA step capabilities include: – Sophisticated record I/O – Conditional logic – Iterative do loops – Array processing – Structured programming logic – A wide range of functions – Producing customized reports
The PROC Step In PROC steps, a large library of prewritten procedures enables end users to produce reports easily. You can use PROC steps in base SAS software for: – List and tabular reports – Graphics – Statistical analysis – Data management – Ad hoc queries – Accessing other software files
The SAS Data Set Data must be in the form of a SAS data set to be processed by most SAS procedures and some DATA statements. SAS data sets consist of a descriptor portion that contains information about the data and a data portion that contains the data values. The data values in the data portion are arranged in a rectangular table.
SAS Variable Rules for variable names: Can be 1-32 characters in length. Start with the letter A-Z or the underscore character ( _ ). Continue with letters, numbers, and underscores. Recommendation: Choose names that describe the fields. There are two kinds of variables. Character variables: Values are stored using ASCII representation and can be from 1-32,767 characters in length. Numeric variables: Values are stored using floating point representation and can be 3-8 bytes long (typically use 8).
SAS Variable Any number of variables can be stored in a SAS data set in SAS 9 (limited only by the computer’s capacity). The rows in a data set are called observations (or records). There is no limit to the number of observations.
Missing Values Most collections of data contain missing values. The rectangular structure of a SAS data set implies that a value must exist for every variable for every observation. In SAS data sets, missing values are represented by: – a period (. ) for a numeric variable – a blank (" ") for a character variable Missing numeric variables are not zero. They are excluded from arithmetic and statistical computations. – Each SAS PROC checks variables for missing values and takes appropriate action. – See the individual PROC descriptions in the User's Guide for details.
Documenting SAS Data Sets A SAS data set contains, in addition to the data values, descriptors about the data set as a whole and descriptors with names and attributes of the variables. The name of the data set and its member type The date and time the data set was created The # of observations The # of variables The engine type The attribute information: the variable’s name, type, length, position, format for printing, informat for input, and label.
Documenting SAS Data Sets Below is a partial listing of the descriptor portion of a SAS data set. PROC contents data=sc.class; Run;
SAS Data Libraries A SAS data library is a collection of SAS files recognized as a unit by the SAS System. In directory-based operating systems, such as Windows or UNIX, a SAS data library is a collection of SAS files of the same engine type stored in a specific directory. Every SAS file has a two-level name. The first level determines whether the file is temporary or permanent. The general form of a SAS filename is: libref.SAS-filename – libref is a name specified in a LIBNAME statement that is associated with a directory
SAS Data Libraries SAS-filename refers to a specific SAS file in the library If you do not specify a libref (first-level name): – The default libref is WORK. – The data set is temporary. The LIBNAME statement is used to associate a libref with a directory containing SAS data files. Once defined, a libref can be used repeatedly throughout a program. You can think of librefs as temporary nicknames that you use to identify SAS data libraries during a SAS session.
LIBNAME statement LIBNAME libref 'SAS-data-library' options ; – libref any valid SAS name (but only up to 8 characters long) – SAS-data-library a directory – engine-name an optional parameter specifying one of the library engines supported by a given operating system V8 accesses Version 8 or 9 SAS data sets V6 accesses Version 6.10, 6.11, and 6.12 SAS data sets XPORT accesses transport format files LIBNAME classlib ‘C:\SOCI6200\SASDATA'; PROC PRINT DATA=classlib.class; RUN;
Temporary/Permanent SAS Libraries You can store SAS data sets in a temporary SAS data library by omitting the libref or by using the libref WORK (a libref that SAS always assigns for you). For example: You can permanently store SAS data sets by using a libref other than WORK. The directory where you want to store your data sets must exist. For example: LIBNAME soci 'Y:\SOCI6200' ; DATA soci.one ; INFILE xyz ; INPUT a b c ; RUN;
SAS Files The individual files in the library are considered members of the library. Member types include DATA, VIEW, CATALOG, ACCESS, and PROGRAM. SAS data sets can have one of two member types, DATA or VIEW, depending on the kind of information they contain.
Comments in SAS Code There are two ways to insert comments in SAS: * message ; or /* message */ Comments can be used anywhere in a SAS program for documentation purposes. SAS ignores comments during processing. – * message ; must be written as a separate statement and can not contain internal semicolons. – /* message */ can be written within statements or anywhere a blank can appear. These comments can contain semicolons.
Outcomes of executing a SAS program A SAS program consists of a series of DATA steps and PROC steps. When you execute a SAS program, the output generated by SAS is divided into two parts: SAS Log and SAS Output. SAS Log – Contains information about the processing of the SAS program. – Prints the statements you entered. – Prints errors and warning messages. – Prints NOTEs relating to each step: For each DATA step, documents the creation of the data set. For each PROC step, indicates the page numbers of the output and how much time the procedure spent operating. SAS Output contains the results of the PROC steps.
Starting and Running SAS Programs There are three modes of execution or environments you can use to run SAS programs: – interactive windowing environment – interactive line mode – noninteractive or batch mode We will only discuss the interactive windowing environment and the noninteractive mode in this course. These are the two most common modes of execution.
SAS for Windows This window allows you to write SAS programs and submit your programs to execute. This window illustrates a table of contents of SAS Output Window. This window shows SAS files and libraries in the Windows Explore like display This window returns the output of results from SAS executions. This window displays the notes of SAS sessions, and tells you any errors, warnings after you submit your SAS programs