LISA SHORT COURSE SERIES: INTRODUCTION TO SAS UNIVERSITY William DeShong Fall 2015
Upcoming LISA Short Courses
Outline 1. SAS Overview 2. SAS University Environment 3. Data Step 1. Importing Data Sets 2. Merging Data Sets 4. Procedure Step 1. Manipulate/View Data 1. Proc Print 2. Proc Sort 2. Aggregate Data 1. Proc Summary 2. Proc Freq 3. Proc Means 3. Model Data 1. Proc Reg (If time permits)
SAS SAS (an acronym for Statistical Analysis System) is a data-driven programming language that provides information from data. The functionality of SAS is built around four data-driven tasks. Data Access Addresses or locates the data required by the programmer. Data Management Shapes the data into a form required by the programmer. Data Analysis Summarizes, reduces, or transforms raw data into meaningful and useful information. Data Presentation Communicates information in ways that clearly demonstrate its significance.
SAS Program A SAS program (also called "SAS code") is a series of statements (or "steps") for SAS to execute. There are three types of SAS statements: DATA statements PROC statements global statements All DATA statements end with a RUN command. All PROC statements end with either: RUN command (for almost all statements) QUIT command (for very, very few statements)
Flow of Programming A DATA statement can be used to (1) create a SAS dataset from scratch, (2) create a SAS dataset from a raw dataset, (3) check for and correct errors in a dataset, and (4) create a SAS dataset by merging, subsetting, and updating existing SAS datasets. Raw Dataset DATA Statement SAS Dataset PROC Statement Report Built-In SAS Dataset(s)
SAS Pointers When programming in SAS, keep in mind the following pointers to prevent syntax errors: Semicolon Check: Every line of code (with exception to formats and labels) end with a semicolon ( ; ). One missing semicolon can destroy an entire SAS program. Use Comments: You can make one-line comments by placing an asterisk ( * ) in the front of your comment. For a multi-line comment, start with ( /* ) on the first line and end with ( */ ) on the last line. Commented lines of code are ignored by the SAS processor. Comments are used to help the programmer remember parts of the SAS code.
SAS University Edition Environment Let’s take a look at SAS University now!
Data Step
Importing Datasets Lets use the Data Importing Wizard!
Accessing Permanent SAS Datasets To access existing SAS datasets, use the following code: The name_of_library is a name that you choose to represent the name of the folder to store the SAS datasets in or access the existing SAS datasets. The location_of_file represents the location where SAS should go to find or save permanent SAS datasets. libname name_of_library ‘ location_of_file ’; run ;
Accessing Permanent SAS Datasets Note that in giving the location, you are not mentioning which particular SAS dataset that you want to use. Rather, you locate the folder or extension (if there is no folder) where the SAS dataset(s) are located. Most SAS programmers put all of their SAS programs in one folder so that they can access them all at one time. libname name_of_library ‘ location_of_file ’; run ;
Accessing Permanent SAS Datasets The name_of_library is limited to 1 to 8 characters long, can only begin with a letter or underscore, and contains only letters, numbers, or underscores. Legal vs. Illegal Names of Libraries clinic1 1_clinic _%clinic _clinic1 _1clinic clinic_1 1clinic_1 libname name_of_library ‘ location_of_file ’; run ; How many of the following seven library names are legal library names? 4
Descriptive Statistics Functions Below are a few of the descriptive statistics functions. Most of these descriptive statistics can be found using PROC MEANS or PROC UNIVARIATE. FunctionsSyntaxCalculates SUMsum(argument, argument, …) ;sum of values MEANmean(argument, argument, …) ;average of nonmissing values MINmin(argument, argument, …) ;minimum value MAXmax(argument, argument, …) ;maximum value VARvar(argument, argument, …) ;variance of the values STDstd(argument, argument, …) ;standard deviation
Date and Time Functions FunctionsSyntaxCalculates TODAYtoday( ) ; gives today's SAS date value, requires no arguments TIMEtime( ) ; gives current time, requires no arguments MDYmdy(month_val, day_val, year_val) ; gives back the numeric SAS date value DAYday(date_val) ; gives back the day date of the SAS date value (1-31) QTRqtr(date_val) ; gives back the quarter of the year of the SAS value (1-4) WEEKDAYweekday(date_val) ; gives back the numeric day of the SAS date value (1-7) MONTHmonth(date_val) ; gives back the month of the SAS date value (1-12) YEARyear(date_val) ; gives back the year of the SAS date value (4 digits)
Date and Time Functions Here are some interesting ones, however. FunctionsSyntaxCalculates INTCK intck('day', SASdate1, SASdate2) ; intck('week', SASdate1, SASdate2) ; intck('month, SASdate1, SASdate2) ; intck('qtr', SASdate1, SASdate2) ; intck('year', SASdate1, SASdate2) ; provides the difference in the number of {days, weeks, months, quarters, years} between two SAS date values. INTNX intnx('interval', SAS_start_date, increment, alignment_character) ; a SAS_end_date which is a multiple of the time interval added to SAS_start_date alignment_characters ' b ' = 1st of the month ' m ' = 15th of the month ' e ' = 30th/31st of month ' s ' = same day of SAS_start_date
Mathematical Functions Below are a few of the billions of mathematical functions. There is no way to list them all. You learn them as you learn how to program. FunctionsSyntaxCalculates ROUNDround( argument, d ) ; rounds to nearest d where d =10 (tens) d = 1 (integer) d =.1 (tenths) d =.01 (hundredth) LOGlog(argument) ;take the natural log LOG10log10(argument) ;takes the log base 10 FLOORfloor(argument) ;rounds down to nearest integer CEILceil(argument) ;rounds up to nearest integer INTint(argument) ;returns integer part of value only
Character Functions FunctionsSyntaxCalculates SCANscan(argument, n, delimiters) ; returns a specified word from a character word SUBSTRsubstr(argument, n, delimiters) ; extracts a substring replaces character values TRIMtrim(argument) ;trims trailing blanks INDEXindex(source, excerpt) ; searches a character value for a specific string UPCASEupcase(argument) ;converts to uppercase letters LOWCASElowcase(argument) ;converts to lowercase letters PROPCASEpropcase(argument) ;uppercase first character value TRANWRDtranwrd(source, target, replace) ; replaces or removes all occurrences of a pattern of characters
PROC SORT Statement The purpose of PROC SORT is to reorganize a SAS dataset by a subset of its variables. The PROC SORT statement can sort: by one variable or more than one variable in ascending order or descending order remove duplicates while sorting (not by default, you must specify it) proc sort data = libref.datasetname ; by var1 var2 … vark ; run ;
PROC SORT Statement The purpose of PROC SORT is to reorganize a SAS dataset by a subset of its variables. If you specify an out statement, SAS will sort the original SAS dataset (dataset1) and put it in the SAS dataset (dataset2). If you do not use the out statement, SAS will sort dataset1 and store it into dataset1. Thus, it overwrites the dataset and you lose the original order. proc sort data = libref2.dataset1 out = libref2.dataset2 ; by var1 var2 … vark ; run ;
Merging Data sets with Match-Merging With simple match-merging, the SAS programmer is trying to link observations together using the values in the variables listed in the BY statement. proc sort data = SAS-dataset-1 ; by variable_1 variable_2 … variable_n ; SAS statements ; run ; " " proc sort data = SAS-dataset-k ; by variable_1 variable_2 … variable_n ; SAS statements ; run ; data newSASdataset ; merge SAS-dataset-1 SAS-dataset-2 SAS-dataset-k ; by variable_1 variable_2 … variable_n ; SAS statements ; run ;
Match-Merging It is required that all of the original SAS datasets being merged are sorted by the variables in the BY statement first to perform this technique. proc sort data = SAS-dataset-1 ; by variable_1 variable_2 … variable_n ; SAS statements ; run ; " " proc sort data = SAS-dataset-k ; by variable_1 variable_2 … variable_n ; SAS statements ; run ; data newSASdataset ; merge SAS-dataset-1 SAS-dataset-2 SAS-dataset-k ; by variable_1 variable_2 … variable_n ; SAS statements ; run ;
Example #1 Flight attendants for International Airlines are need to pass three exams (federal regulations, customer service, and safety procedures) in order to become certified flight attendants. They can take them at any time, but they must pass the federal regulations exam first before moving on. Below are three permanent SAS datasets showing the attempts by id number and their scores. A score higher than 6 is needed to pass.
Match-Merging [Step 1: Use a PROC SORT] The PROC SORT steps will sort the three SAS datasets by the idnum variable. This will set us up to begin the simple match-merging procedure.
Match-Merging [Step 2: The DATA Statement] The DATA step will link the observations together by the idnum variable. But how does SAS accomplish this?
Match-Merging [Step 3: The Merging] From all three SAS datasets, SAS searches for the first set of observations with the lowest value for idnum. In this case, it is the missing value in the third dataset. Why? Notice, however, that there are no observations in the other SAS datasets with an idnum also equal to blank. If an input SAS dataset does not have a matching BY value, then the observation in the output SAS dataset contains missing values for the variables that are unique to that input dataset.
Match-Merging [Step 3: The Merging] SAS now searches for the next lowest value for the idnum variable. Here, the value appears in only two of the three SAS datasets. Again, SAS will put missings in for the fr_score variable.
Match-Merging [Step 3: The Merging] The next idnum value is Fortunately, it appears in all three once. SAS simply links them together. So when BY variable value appears the same number of times in all of the SAS dataset, SAS has no problem at all linking them together by order.
Match-Merging [Step 3: The Merging] Similar to the last idnum value, SAS is going to do the same for the value of Since there is an equal number of observations in all three SAS datasets, SAS is going to link them together by the order in which they appear. The first observations in each dataset will link together and the second observations will link together.
Match-Merging [Step 3: The Merging] Now look at this situation. Not only are we missing an observation in the third SAS dataset, but there is an uneven number of observations in the first two. SAS only knows how to match if there are the same number of observations in the SAS datasets that share the same BY variable values.
Match-Merging [Step 3: The Merging] SAS then links by the order in which they appear in the background. This is what actually really happens for SAS datasets without a BY variable value observation. Please note that the replicated observations do not appear in the input SAS datasets
Match-Merging [Step 3: The Merging] This is similar to the last example (but with more observations). Questions How many observations will SAS create for idnum 4524? What observations will be replicated to perform the match-merging? How will SAS link these records together?
Match-Merging [Step 3: The Merging] I think you get the point now, right? And so you should know what appears next in the SAS dataset. There will be two observations for idnum 5702…
Match-Merging [Step 3: The Merging] There will be three observations for idnum 6256…
Match-Merging [Step 3: The Merging] There will be one observation for 7803…
Match-Merging [Step 3: The Merging] There will be two observations for idnum 8008…
Match-Merging [Step 3: The Merging] And finally, there will be four observations for idnum
Voila! … the SAS dataset is complete.
Common Variable (Simple Match-Merging) Keep in mind that all four common variable rules apply for the simple match-merging process. The common variable must have the same variable type (i.e. numeric or character) in each of its SAS original datasets. Otherwise, SAS will return an error message. The values from the last original SAS dataset overwrite the previous values stored for that variable. If a common variable has different formats, SAS will use the first format it sees for that variable. If a common variable has different lengths, SAS will use the first length it sees for that variable. It is this common variable rule that we are going to investigate more right now. The last thing that we want to do is overwrite data.
The PROC PRINT Statement The PROC PRINT statement is the most popularly used procedure in SAS. This statement lets you output a SAS dataset (or a subset of it) in the output window. The most basic format of the PROC PRINT statement is the following: In this format, SAS will print all of the variables in the SAS dataset into the output window unformatted. Of course, there are ways to enhance the output (which we will cover some now). proc print data = libref.datasetname ; run ;
PROC PRINT: Options If you want SAS to print specific variables, you can adjust the code by including a var statement. You can also produce column totals for numeric variables by using a sum statement. proc print data = libref.datasetname ; var variable1 variable2 … variablek ; run ; proc print data = libref.datasetname ; sum num_variable ; run ;
PROC PRINT: Options (cont.) You can also specify not to provide the observation number by including the noobs statement in the code. Rather, if you have a variable that represents the identity of each observation, you can use the id statement to replace the default observation number. proc print data = libref.datasetname ; id variable1 ; run ; proc print data = libref.datasetname noobs; run ;
PROC PRINT: Options (cont.) Rather than use variable name, you can substitute a label for the variable by including a label statement. But notice where you have to mention it in the code. You can also specify to print a subset of observations from the SAS dataset based on a condition or a set of conditions using a where statement in the code. proc print data = libref.datasetname label ; label variable1 = ‘Variable 1’ ; run ; proc print data = libref.datasetname ; where insert_condition_here ; run ;
PROC CONTENTS Statement The purpose of PROC CONTENTS is to provide a detailed listing of: the variables listed in a SAS dataset the SAS datasets located in a SAS folder The ‘ _all_ ‘ is a SAS keyword to reference all of the SAS datasets in a SAS library. proc contents data = libref.datasetname ; run ; proc contents data = libref._all_ ; run ;
PROC FREQ Statement Now, we turn our attention to procedures that will help produce results in the output window. The purpose of PROC FREQ is to create a frequency or relative frequency table over a subset of SAS variables. The code to do this is the following: The PROC FREQ statement can not only create a table by one or more variables, but it can also save the results as a SAS dataset. proc freq data = libref.datasetname ; tables var1 var2 … vark ; run ;
PROC MEANS Let's start from the basics. The basic form of the PROC MEANS is the following: This basic form: produces statistical output for all of the numeric variables in the SAS dataset produces the sample size, mean, standard deviation, minimum, and maximum values by default We will use our baseball SAS dataset to understand how this procedure works. proc means data = libref.datasetname ; run ;
Scenario Here is a SAS dataset called baseball. It is located in the ' ia ' library.
Scenario Here is the breakdown of the variables.
PROC MEANS Here is the application of the PROC MEANS without any options: Again, without any options, SAS calculates the sample mean, sample standard deviation, sample size, minimum, and maximum values for each numeric variable in the SAS dataset. The output is placed in a table and is posted in the output window (i.e. no new output window is created from the MEANS procedure unless specified otherwise). proc means data = ia.baseball ; run ;
Let's adjust the code to get better output.
PROC MEANS [var keyword] Notice in the last slide, all of the variables were provided in the output. To specify specific variables in the SAS dataset, include a var statement followed by the variables that you only want outputted. SAS will only output the statistics for those that you provided (and in that order). Note: if you have SAS variables with names that differ by a number at the end of the variable name (for example: exam1 exam2 exam3 exam4 exam5), you can reference all of them by saying the following: var variable_1 - variable_k For our example, we can say: var exam1 - exam5 proc means data = libref.datasetname ; var variable_1 variable_2 variable_3 …. variable_k ; run ;
PROC MEANS [ ] You can specify which descriptive statistics that you want to output if you list them after the name of the dataset. By using this option, you will be trumping the default statistics that is outputted. Now, SAS will only produce the statistics that you specify. There are dozens of statistical keywords to choose from. proc means data = libref.datasetname ; var variable_1 variable_2 variable_3 …. variable_k ; run ;
PROC REG The basic form of the PROC REG is the following: This basic form: produces a linear regression model with model fit, parameter estimates, and produces the residual diagnostic test We will use our Salary of Major Leageue Baseball Players SAS dataset to understand how this procedure works. proc reg data = libref.datasetname ; id; model responsevar = var1 var2…vark; run ;
Questions?
Special Thanks Dr. Chris Franck- Assistant Director of LISA Tonya Pruitt-Administrative Specialist LISA Dr. Marlow Lemons Kris Patton Elaine Perrin Weibin Xu