USING SAS PROCEDURES SAS System Options OPTIONS Statement SAS System options are parameters you can change that affect the SAS System – how it works, what the output looks like, error handling, and so forth. You can set system options in the SAS System Options window, in an OPTIONS statement, or in a couple of other ways. In this class we will focus on the OPTIONS statement. OPTIONS Statement An OPTIONS statement is submitted as part of a SAS program and affects all steps that follow it. This chapter is about procedures, but first let’s talk about SAS System options. The settings of these options can affect the appearance of the output from procedures.
OPTIONS Statement The OPTIONS statement is one of the special SAS statements that do not belong to either a PROC or a DATA step. It can appear anywhere in a program, but often the very first line in a program is an OPTIONS statement. That way you can easily see what options are in effect. Any subsequent OPTIONS statements in a program override previous ones.
OPTIONS Statement Commonly Used Options CENTER | NOCENTER This option works as a switch. CENTER centers output on a page, while NOCENTER left justifies output. DATE | NODATE This option is also a switch. DATE places today’s date at the top of each page of output. NUMBER | NONUMBER This switch controls whether a page number appears on each page of output. LINESIZE = n Use LINESIZE to control the maximum length of output lines. Takes values from 64 to 256. A linesize of 98 or less tends to work well for SOCI6200 assignments. PAGESIZE = n Use PAGESIZE to control the maximum number of lines per page of output. Takes values from 15 to 32767. A pagesize of 55 or less tends to work well for SOCI6200 assignments. PAGENO = n Starts numbering pages with n. FORMDLIM=’-‘ Asks SAS to write a line of dashes (--) where normally it would go to a new page, thus possibly conserving paper.
The PROC Step SAS procedures are computer programs that: read SAS data sets compute statistics write results create SAS data sets We will examine the following procedures: SORT Sorts observations by one or more variables. CONTENTS Provides information about a SAS data set. PRINT Prints the observations in a SAS data set using some or all of the variables. FREQ Produces one-way to n-way frequency and crosstabulation tables as well as many measures of association. UNIVARIATE Provides data summarization tools and information on the distribution of numeric variables. MEANS Produces descriptive statistics for numeric variables.
Using SAS Procedures Using a procedure is like filling out a form. Someone else designed the form, and all you have to do is fill in the blanks and choose from a list of options. Most procedures write results to the Output window or to a listing file, depending on how you are running SAS. You can customize the appearance of the output using the system options described above. Many procedures can also write results to an output SAS data set using an OUTPUT statement or an OUT= option. Several other statements can be used with almost every procedure. These optional statements include BY, WHERE, TITLE, FOOTNOTE, LABEL, and FORMAT. All procedures can use the Output Delivery System (ODS) to produce results that are more attractive than the default.
Using SAS Procedures PROC Statement Each PROC step begins with a PROC statement. A PROC statement consists of the word PROC followed by the name of the procedure you want to run, followed by any procedure options you want to set for this run. The procedure options should almost always include DATA=<data set name> to specify the data set that you want to analyze. This example asks PROC PRINT to print the weight_club data set double-spaced. PROC PRINT DATA=weight_club DOUBLE;
BY Statement BY Statement A BY statement tells the procedure to perform a separate analysis for each combination of values of the BY variables rather than treating all observations as one group. All procedures except PROC SORT assume that your data are already sorted by the variables in your BY statement. If they are not, use PROC SORT to do so.
BY Statement This example sorts the weight_club data set by team and then computes the average weight loss for each team.
WHERE Statement WHERE Statement The WHERE statement tells the procedure to use a subset of the data. The basic form of a WHERE statement is Only observations satisfying the condition will be used by the PROC. The left side of the condition must be a variable name, and the right side of the condition can be a variable name, a constant, or a mathematical equation.
WHERE Statement: operators Here are the most frequently used operators:
WHERE Statement For example, this PROC step prints only the members of the yellow team.
TITLE Statement TITLE Statement Use TITLE statements to place descriptive information at the top of every page of output. The basic form of a TITLE statement is TITLE followed by the text of the title in quotes. You can specify up to ten titles by adding numbers to the keyword TITLE. Keyword TITLE is the same as TITLE1. The basic form of a TITLE statement is TITLEn followed by the text of the title in quotes. You can use either single or double quotes as long as they are the same on either end of the text. If you want to put an apostrophe in a title, use double quotes.
TITLE Statement TITLE statements can appear anywhere in a program, not just in a PROC step, but it often makes sense to put them in PROC steps since they affect procedure output. The maximum length of a title is 256 characters or the current linesize, whichever is less. Once you specify a title for a line, it is used for all subsequent output until you cancel the title with a null statement or define another title for that line. A null statement for TITLE2 would be the following: The flowing cancels all existing titles.
FOOTNOTE Statement FOOTNOTE statement The FOOTNOTE statement works exactly the same way as the TITLE statement. However, footnotes print at the bottom of the page. FOOTNOTE ‘your comments’; FOOTNOTE1 ‘your comments’; FOOTNOTE2 ‘your comments’; FOOTNOTE;
LABEL Statement LABEL Statement By default, SAS uses variable names to label your output, but with the LABEL statement you can create more descriptive labels, up to 256 characters long, for each variable. When a LABEL statement is used in a DATA step, the labels are permanently attached to the variables. When used in a PROC step, the labels stay in effect only for the duration of that step. Labels containing special characters such as “ or ; must be enclosed in single quotes. Labels containing single quotes must be enclosed in double quotes. Otherwise, labels can be enclosed in either single or double quotes.
FORMAT Statement FORMAT Statement Formats can be used to change the appearance of printed values – how many decimal places to print, to show a $ when a variable contains amounts of money, etc. Base SAS includes many formats for numeric, character, and date values. You use a FORMAT statement to associate formats with particular variables. This statement is mentioned here because it is commonly used in PROC steps. However, we will save a detailed discussion of formats for a later class.
The Output Delivery System (ODS) The Output Delivery System (ODS) can be used to make your procedure output more visually appealing and more portable to other software packages than is traditional SAS output. You can make HTML files to display on the web, RTF files to import into Microsoft Word, or PDF files to read with Acrobat Reader. All procedures support ODS. We will save a detailed discussion of this topic for later in this module.
An example program Here is a simple program that uses many of the statements described above, followed by the output.
Output of the example program
The SORT Procedure The SORT procedure sorts observations in a SAS data set by one or more variables, storing the sorted observations in a new SAS data set or replacing the original data set. Windows and UNIX use the ASCII collating sequence. Sorting character variables: Sorting numeric variables Features include: multiple sort fields ascending or descending order
The SORT Procedure Basic form of the PROC SORT statement: Statement used with PROC SORT:
An example program PROC SORT DATA=SOCI6200.sales OUT=sales ; BY dept DESCENDING clerk ; RUN; PROC PRINT DATA=sales NOOBS ; TITLE 'Sorting a SAS Data Set' ;
BY Group Processing The BY statement can be used with procedures to perform subgroup processing. The PROC will be executed separately for each of the subsets of observations defined by the values of the BY variables. Syntax: The data set must be sorted as specified in the BY variable list.
An example program and output PROC SORT DATA=SOCI6200.sales OUT=sales ; BY dept ; RUN; PROC MEANS DATA=sales ; TITLE 'PROC MEANS by DEPT' ;
The CONTENTS Procedure PROC CONTENTS prints the descriptor information from a SAS data set. It lists data set attributes followed by all variables and their attributes in alphabetical order. Variable attributes displayed include type, length, label (if present), and format (if present). PROC CONTENTS is especially useful for documenting a permanent data set since it displays the data set creation date and date of last modification. Basic form of the PROC statement:
An example program and output PROC CONTENTS DATA=weight_club; RUN;
The PRINT Procedure The PRINT procedure lists the observations in a SAS data set, using all or some of the variables. Features include: Automatic formatting Columns labeled with variable names or labels Special handling of section and page breaks If requested, accumulation and printing of subtotals and totals Special BY/ID formatting Basic form of the PROC statement:
The PRINT Procedure Selected procedure statement options: DOUBLE Double-spaces the output. LABEL Uses variable labels as column headings. If this option is not specified, variable names are used as column headings. N Prints the number of observations at the end of the printed output. NOOBS Suppresses the observation number column in the printed output. Special statements supported by PROC PRINT: VAR variable_list ; ID variable_list ; PAGEBY name_of_by_variable ; SUM variable_list ; SUMBY name_of_by_variable ;
The PRINT Procedure VAR Statement The VAR statement selects the variables to appear in the report and determines their order. The basic form of the VAR statement is VAR variable-list; variable-list can be one of the following VAR x1 x2 x3; Range of variables in positional order: VAR name--age ; Numeric suffix ranges: VAR x1-x3; Only numeric variables: VAR _numeric_ ; Only character variables: VAR _character_ ; All variables starting with the letter A : VAR A: ; All variables starting with the letters AB: VAR AB: ;
The PRINT Procedure ID Statement If an ID statement is included in a PROC PRINT step, observations in the listing are identified by the value of the ID variable(s) rather than by observation number. The basic form of the ID statement is ID variable-list; The variables in the ID statement list appear before any variables in the VAR statement list. Even without an ID statement, the NOOBS option of the PROC PRINT statement can be used to turn off observation numbers.
The PRINT Procedure Sample results without an ID statement PROC PRINT DATA=weight_club; VAR name team loss; RUN;
The PRINT Procedure Sample results with an ID statement
The PRINT Procedure PAGEBY Statement The PAGEBY statement can be used to identify a variable appearing in the BY statement in the same PROC PRINT step. If the value of this BY variable changes, or if the value of any BY variable that precedes it in the BY statement changes, PROC PRINT begins printing a new page. The general form of the PAGEBY statement is PAGEBY name_of_by_variable; SUM Statement The SUM statement identifies any numeric variables to total in the report. The general form of the SUM statement is SUM variable(s);
The PRINT Procedure SUMBY Statement SUMBY name_of_by_variable; The SUMBY statement can be used with the SUM statement to limit the number of sums that appear in the report. The SUMBY statement can be used to identify a variable appearing in the BY statement in the same PROC PRINT step. If the value of this BY variable changes, or if the value of any BY variable that precedes it in the BY statement changes, PROC PRINT prints the sums of all variables listed in the SUM statement. The general form of the SUMBY statement is SUMBY name_of_by_variable;
PROC PRINT Examples PROC PRINT DATA=weight_club N NOOBS DOUBLE; SUM loss; TITLE 'Listing of the Weight_Club Data Set'; TITLE2 'Using the N, NOOBS, and DOUBLE Options'; TITLE3 'Also, Using a SUM Statement But No VAR Statement'; RUN; Sum of Loss # of obs
Print certain # of observations PROC PRINT DATA=SOCI6200.sales(OBS=20); VAR _CHARACTER_; TITLE1 'Listing of the First 20 Observations of the SALES Data Set'; TITLE2 'Only Character Variables'; RUN;
PROC PRINT example PROC SORT DATA=SOCI6200.sales OUT=sales; BY dept clerk; RUN; PROC PRINT DATA=sales(OBS=25) LABEL; WHERE price > 60; VAR price cost; SUM price; SUMBY dept; LABEL cost='Cost of Item‘ price='Price of Item'; TITLE1 'Partial Listing of Sales Data Set'; TITLE2 'Using Selected Options and Statements';
The FREQ Procedure Basic form of the PROC FREQ statement: The FREQ procedure produces one-way to n-way frequency and crosstabulation tables. distributions of variable values (frequency counts) tests for proportions for one-way tables combined frequencies for two or more variables weighted frequencies measures of association and tests (chi-square and exact) for n-way tables ability to output frequencies to a SAS data set stratified analysis, within and across strata Basic form of the PROC FREQ statement: PROC FREQ DATA = data_set_name options ;
The FREQ Procedure The TABLES statement: one-way: tables variable ; TABLES table_requests / options ; one-way: tables variable ; two-way: tables var_1 * var_2 ; three-way: tables var_1 * var_2 * var_3 ; For two- to n-way tables, the last variable is the column variable and the next-to-last variable is the row variable. Other variables can be called stratum variables. Any # of requests can be given in one TABLES statement. Any # of TABLES statements can be used in one PROC FREQ step. By default, missing levels of each variable are excluded from the table, but the total frequency of missing observations is printed below each table.
The FREQ Procedure: some options NOFREQ suppresses frequencies NOCUM suppresses cumulative frequencies and percentages NOPERCENT suppresses percentages NOROW suppresses row percentages NOCOL suppresses column percentages MISSPRINT prints missing value frequencies for two- to n-way tables; MISSING interprets missing values as non-missing and includes them in calculations of % and other statistics LIST presents two- to n-way tables in list format rather than as crosst tables OUT= names an output data set to contain variable values and frequency counts OUTPCT adds percentages to OUT= data set SPARSE lists all possible combinations of the variable values for an n-way table even if a combination does not occur in the data; only has an effect with the LIST or OUT=. ALL requests tests and measures of association produced by CHISQ, MEASURES, and CMH CHISQ requests chi-square tests and measures of association based on chi-square MEASURES requests measures of association CMH computes Cochran-Mantel-Haenzel statistics
The FREQ Procedure: examples Basic example Example suppressing percents Missing data example PROC FREQ DATA=sales; TABLES dept clerk dept*clerk; RUN; PROC FREQ DATA=sales; TABLES dept*clerk / NOROW NOCOL NOPERCENT; TITLE1 "Table with Percents Suppressed"; RUN; PROC FREQ DATA=clin ; TABLES evdnk*visit ; TABLES evdnk*visit / MISSPRINT ; TABLES evdnk*visit / MISSING ; TITLE1 "Missing Data Example" ; RUN;
The FREQ Procedure: examples Example using the ALL option Using WEIGHT statement WEIGHT var-name-of-counts; PROC FREQ DATA=sales2 ; TABLES sex*clerk / ALL ; TITLE 'Example with the ALL Option'; RUN;
The UNIVARIATE Procedure The UNIVARIATE procedure provides data summarization tools, high-resolution graphics displays, and information on the distribution of numeric variables. Descriptive statistics Details on extreme values and extreme observations Median, mode, range, and quantiles Tests for location and normality Confidence limits Low-resolution plots to picture the distribution High-resolution histograms, quantile-quantile plots, and probability plots Frequency tables Output data sets
The UNIVARIATE Procedure Basic form of the PROC UNIVARIATE statement: PROC UNIVARIATE options DATA= data_set_name ; Selected options: FREQ produces a frequency table NOPRINT suppresses all tables of descriptive statistics; use when producing an output data set NORMAL computes a test statistic for the hypothesis that the input data come from a normal distribution PLOT produces a stem and leaf plot, a box plot, and a normal probability plot (all lowresolution) ROUND specifies units for rounding variable values before computations CIBASIC requests confidence limits for the mean, standard deviation, and variance based on normally distributed data
The UNIVARIATE Procedure Selected statements used with PROC UNIVARIATE: BY variable list ; VAR variable list ; FREQ variable ; Specifies a numeric variable whose values represent the frequency of the observation. WEIGHT variable ; Specifies weight for the analysis variables. ID variable list ; Identifies the extreme observations in the appropriate table. OUTPUT OUT=sas_data_set output-statistic-list PCTLPTS=percentiles PCTLPRE=prefix-name-list PCTLNAMES=suffix-name-list ;
The UNIVARIATE Procedure Basic examples PROC UNIVARIATE DATA=sales2; VAR price; TITLE1 'PROC UNIVARIATE with No Options'; RUN; PROC UNIVARIATE DATA=sales2 FREQ PLOT NORMAL ; VAR COST ; ID CLERK ; LABEL COST='Total Cost' ; TITLE1 'PROC UNIVARIATE with Several Options and Optional Statements'; RUN;
The UNIVARIATE Procedure Output data set example PROC SORT DATA=sales2 OUT=sales; BY sex; RUN; PROC UNIVARIATE DATA=sales NOPRINT; VAR cost price; OUTPUT OUT=stats N=ncost nprice NMISS=nmcost nmprice MEAN=mcost mprice MAX=maxcost maxprice ; PROC PRINT DATA=stats; TITLE1 'Output Data Set from PROC UNIVARIATE';
The MEANS Procedure The MEANS procedure produces simple univariate descriptive statistics for numeric variables. Selected univariate statistics Special BY group processing: using a BY or CLASS statement will cause MEANS to calculate descriptive statistics separately for groups of observations Printed output optional Output data set of summary statistics optional
The MEANS Procedure Basic form of the PROC MEANS statement: PROC MEANS DATA= data_set_name options statistic-keywords; Selected options: FW= specifies the field width to use in printing each statistic; default is 12 NOPRINT suppresses all printed output MAXDEC= specifies the maximum number of decimal places MISSING requests that PROC MEANS treat missing values as valid subgroup values for the CLASS variables NWAY specifies that statistics be output for only the observations with the highest _TYPE_ value
The MEANS Procedure Selected statistic-keywords (default N, MEAN, STD, MIN, and MAX): N number of non-missing values NMISS number of missing values MEAN the mean SUM the sum of all values STD standard deviation STDERR standard error VAR variance MAX maximum MIN minimum CLM confidence interval for the mean MEDIAN the median P1 1st percentile P5 5th percentile P10 10th percentile P90 90th percentile P95 95th percentile P99 99th percentile Q1 25th percentile Q3 75th percentile T t test PROBT p-value based on t
The MEANS Procedure Selected statements used with PROC MEANS: BY variable list ; VAR variable list ; CLASS variable list / MISSING; MISSING option requests that missing values be treated as valid class level values. FREQ variable ; WEIGHT variable ; OUTPUT OUT=sas_data_set output-statistic-list / AUTONAME; ID variable list; Can be used to add extra variables to the output data set.
The MEANS Procedure: examples Basic examples BY-group processing / the CLASS statement PROC MEANS DATA=sales2; TITLE1 'PROC MEANS with No Options'; RUN; PROC MEANS DATA=sales2 MEAN MIN MAX ; VAR cost price ; TITLE1 'PROC MEANS: Request Specific Statistics for Selected Variables'; RUN; PROC SORT DATA=sales2 OUT=sales ; BY sex; RUN; PROC MEANS DATA=sales N MEAN; BY sex ; PROC MEANS DATA=sales2 N MEAN; CLASS sex ; RUN;
The MEANS Procedure: examples Example: Creating an output SAS data set using a BY statement (also AUTONAME) PROC SORT DATA=sales2; BY sex; RUN; PROC MEANS DATA=sales2 NOPRINT; VAR cost price; OUTPUT OUT=summary N= MEAN= / AUTONAME; PROC PRINT DATA=summary; TITLE1 'Output SAS Data Set from PROC MEANS'; TITLE2 'with the BY Statement (also AUTONAME)';
The MEANS Procedure: examples Example: Creating an output SAS data set using a CLASS statement with and without the NWAY option. Note how the NWAY option affects the output data set. PROC MEANS DATA=sales2 NOPRINT ; CLASS sex ; VAR cost price ; OUTPUT OUT=summary N=NCost NPrice MEAN=MeanCost MeanPrice ; RUN; PROC MEANS DATA=sales2 NOPRINT NWAY; CLASS sex ; VAR cost price ; OUTPUT OUT=summary N=NCost NPrice MEAN=MeanCost MeanPrice ; RUN;
SAS Descriptive Procedures: Summary SORT Used to rearrange observations by one or more sort fields. Must be used to sort data before a BY statement is used with other procedures. Can sort data in ascending or descending order. Sorts data in ASCII sorting sequence. Can be used to check for and remove duplicate observations. CONTENTS Displays descriptor part of a SAS data set. Can be used to get meta-information on existing or new SAS data sets. To see information similar to PROC CONTENTS output, view the properties of a data set in the SAS Explorer window. PRINT Displays data part of a SAS data set. Can be used to check a data set or to produce simple listing reports. Another way to visually inspect the data portion of a data set is to use the ViewTable window. E.g., double-click on the data set in the SAS Explorer window.
SAS Descriptive Procedures: Summary FREQ Displays the distribution of variable values in tabular format. Displays the combined frequency for two or more variables. Displays measures of association and test statistics for two-way tables. More appropriate for discrete variables. Should be used to print out all of the possible values a discrete variable can have and to get a frequency distribution of each discrete variable. UNIVARIATE Gives detailed information about the distribution of a numeric variable. More appropriate for continuous variables. Should be run on any variables to be used in more analytic procedures. Can be used to output statistics of interest. MEANS Provides simple univariate statistics for numeric variables. More condensed output than UNIVARIATE.