Preparing your data for analysis using SAS Landon Sego 24 April 2003 Department of Statistics UW-Madison.

Slides:



Advertisements
Similar presentations
The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Advertisements

Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
Variables 9/10/2013. Readings Chapter 3 Proposing Explanations, Framing Hypotheses, and Making Comparisons (Pollock) (pp.48-58) Chapter 1 Introduction.
Tutorial 12: Enhancing Excel with Visual Basic for Applications
I OWA S TATE U NIVERSITY Department of Animal Science Getting Started Using SAS Software Animal Science 500 Lecture No. 2.
Linux+ Guide to Linux Certification, Second Edition
Using PC SAS Help. You can get information about the Editor, Log, or Output windows by Selecting “Using this Window” under the Help Menu...
Guide To UNIX Using Linux Third Edition
Introduction to SQL Session 1 Retrieving Data From a Single Table.
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Introduction to SPSS (For SPSS Version 16.0)
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Introduction to SAS Lecture 2 Brian Healy.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
Copyright © 2006, SAS Institute Inc. All rights reserved. Enterprise Guide 4.2 : A Primer SHRUG : Spring 2010 Presented by: Josée Ranger-Lacroix SAS Institute.
SAS SQL SAS Seminar Series
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
A Guide to Unix Using Linux Fourth Edition
SPSS Presented by Chabalala Chabalala Lebohang Kompi Balone Ndaba.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Introduction to SAS. What is SAS? SAS originally stood for “Statistical Analysis System”. SAS is a computer software system that provides all the tools.
1 Working with MS SQL Server Textbook Chapter 14.
Quantify the Example Data First, code and quantify the data (assign column locations & variable names) Use the sample data to create a data set from the.
SAS Macro: Some Tips for Debugging Stat St. Paul’s Hospital April 2, 2007.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
I OWA S TATE U NIVERSITY Department of Animal Science Getting Your Data Into SAS (Chapter 2 in the Little SAS Book) Animal Science 500 Lecture No. 3 September.
5/30/2010 SAS Macro Language Group 6 Pradnya Nimkar, Li Lin, Linsong Zhang & Loc Tran.
Macro Overview Mihaela Simion. Macro Facility Overview Definition : The SAS Macro Facility is a tool within base SAS software that contains the essential.
Chapter 17 Creating a Database.
Lesson 2 Topic - Reading in data Chapter 2 (Little SAS Book)
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
SAS: The last of the great mainframe stats packages STA431 Winter/Spring 2013.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Lesson 6 - Topics Reading SAS datasets Subsetting SAS datasets Merging SAS datasets.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Here’s another problem (see section 2.13 on page 54). A file contains two different types of records (say A’s and B’s) and we only want to read in the.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
Common Sense Validation Using SAS Lisa Eckler Lisa Eckler Consulting Inc. TASS Interfaces, December 2015.
Lecture 4 Ways to get data into SAS Some practice programming
An Introduction Katherine Nicholas & Liqiong Fan.
Chapter 2 Getting Data into SAS Directly enter data into SAS data sets –use the ViewTable window. You can define columns (variables) with the Column Attributes.
Linux+ Guide to Linux Certification, Second Edition Chapter 4 Exploring Linux Filesystems.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Lesson 2 Topic - Reading in data Programs 1 and 2 in course notes –Chapter 2 (Little SAS Book)
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
Online Programming| Online Training| Real Time Projects | Certifications |Online Classes| Corporate Training |Jobs| CONTACT US: STANSYS SOFTWARE SOLUTIONS.
Build your Metadata with PROC CONTENTS and ODS OUTPUT Louise S. Hadden Abt Associates Inc.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
Session 1 Retrieving Data From a Single Table
Chapter 2: Getting Data into SAS
ECONOMETRICS ii – spring 2018
Tamara Arenovich Tony Panzarella
Python I/O.
SAS Essentials How SAS Thinks
Chapter Four UNIX File Processing.
Introduction to DATA Step Programming: SAS Basics II
Presentation transcript:

Preparing your data for analysis using SAS Landon Sego 24 April 2003 Department of Statistics UW-Madison

Assumptions That you have used SAS at least a few times. It doesn’t matter whether you run SAS in interactive mode (Windows) or in batch mode (Unix/Linux).

Interactive SAS for Windows

Editing SAS code in EMACS for Batch mode execution

Executing SAS in batch mode at the Linux prompt

Where we’re going… Rarely does data come to you in a form that is analyzable. As a best case, all you need to do is clean your data and check it for consistency. As a worst case, extensive manipulation of the data is needed in order to analyze. We want to get familiar with some tools in SAS used to check, clean, and manipulate data.

The dark toolbox SAS is like a toolbox as big as a garage—with thousands of tools. For many SAS users, this toolbox is dark and there is a monster lurking inside. Let’s turn on the light and meet a few of the tools available in SAS. No guarantees about the monster…

Published resources Lots available I’ve learned on my own and with the SAS documentation Google searches

SAS Online Documentation or

What I use most often in the SAS Online Documentation Base SAS Software –SAS Language Reference: Concepts –SAS Language Reference: Dictionary –SAS Macro Language Reference –SAS Procedures Guide SAS/STAT –SAS/STAT User’s Guide

SAS Language Reference: Dictionary

SAS Procedures Guide

SAS/STAT User’s Guide

Conventions, terminology, and options

Conventions SAS terminology in red SAS code in blue

Basic terminology Data in SAS exists as a data set, with variables and observations. variables are the columns. observations are the rows. Two types of variables: Character and numeric. Character variables can range in length from 1 to 32,767 = 2 15 characters. Numeric variables can be virtually any size (within the limitations of the computer)

My favorite options On the first line of almost every SAS program I write, I include the following: options nodate nocenter nonumber ps=3000 ls=200 mprint mlogic symbolgen; These options control the format of the output and make macro code easier to debug

Importing data into SAS

Data can exist in many forms (text file, Excel spreadsheet, permanent SAS data set, etc.) Excel spreadsheets are probably the most common form. Can use DDE (dynamic data exchange) (Windows version of SAS only) But for Excel files, I like to use CSV file format (comma separated value). Works on any platform.

Excel  CSV text file  SAS data set Column of ‘j’s provide a buffer at the end of each line of text in the CSV file. If you are running SAS on a Linux or UNIX machine, you need to add the j’s (or use the dos2unix command to convert the CSV file to the text formatting used by UNIX).

Save Excel spreadsheet in the CSV format

How the CSV file looks (when viewed with a text editor) Location,Type,Length,j Albuquerque,1,1.414,j Albuquerque,1,2.000,j Albuquerque,1,1.414,j Albuquerque,1,2.236,j Albuquerque,2,2.000,j Albuquerque,2,2.236,j Lexington,1,2.000,j

SAS code to import the CSV file data fake; infile ‘c:\mydata\fake.csv’ dsd firstobs=2; input location :$14. type length; proc print; run; Note, if I were to use: input location $ type length; it would truncate the “location” variable to 8 characters in length.

Results from the proc print Obs location type length 1 Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Lexington Lexington Lexington Lexington Lexington Lexington Lexington Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg

Checking and summarizing data

proc contents data=fake; proc freq data=fake; run; This is IMPORTANT! Don’t take for granted that there aren’t mistakes in your data.

Result from proc contents Data Set Name: WORK.FAKE Observations: 19 Member Type: DATA Variables: 3 Engine: V8 Indexes: 0 Created: 11:46 Wednesday, April 16, 2003 Observation Length: 32 Last Modified: 11:46 Wednesday, April 16, 2003 Deleted Observations: 0. File Name: /tmp/sastmp_sego/SAS_workB6B _ gstat201.stat.wisc.edu/fake.sas7bdat Alphabetic List of Variables and Attributes----- # Variable Type Len Pos length Num location Char type Num 8 0

Results from proc freq The FREQ Procedure Cumulative Cumulative location Frequency Percent Frequency Percent Albuquerque Johannesburg Lexington Cumulative Cumulative type Frequency Percent Frequency Percent Cumulative Cumulative length Frequency Percent Frequency Percent

Selecting subsets of the data

Selecting observations (rows) A large contiguous group of observations Specific observation numbers Using selection criteria e.g. when the location is “Lexington” or when the length is between 1 and 2.

Selecting a group of contiguous observations data smallfake; set fake (firstobs=10 obs=15); proc print; Obs location type length 1 Lexington Lexington Lexington Lexington Johannesburg Johannesburg Selects observations 10 through 15 Data set options

Selecting specific observation numbers data smallfake; set fake; if _n_ in (7,11,16); proc print; Obs location type length 1 Lexington Lexington Johannesburg Selects observation numbers 7, 11, and 16.

Selection criteria: where statement data smallfake; set fake; where location = ‘Lexington’; or where location ne ‘Lexington’; or where location in (‘Lexington’, ’Albuquerque’); or where (1 le length le 2); or where (length > 2.3);

Selection criteria: if statement data smallfake; set fake; if location in (‘Lexington’, ’Albuquerque’); or if location = ‘Lexington’ | location = ‘Albuquerque’; or if location = ‘Johannesburg’ then delete; These three if statements produce identical results.

Some comparison operators neor^= not equals to eqor=equals to geor>=greater than or equal to gtor> greater than leor<= less than or equal to ltor<less than inif contained in a group not inif not contained in a group andor&and logical operator oror|or logical operator

Selecting and managing variables

Selecting variables using “keep” data smallfake (keep = location length); set fake; where type = 1; (type is available for processing, but not written to smallfake data set) data smallfake; set fake (keep = location length); (type is not available for processing) Data set options

Selecting variables using “drop” data smallfake (drop = type); set fake; where type = 1; (type is available for processing, but not written to smallfake data set) data smallfake; set fake (drop = type); (type is not available for processing) Data set options

Renaming variables data fake1; set fake (rename=(location=place type=trt)); where trt = 1; data fake2 (rename=(location=place type=trt)); set fake; where type =1; data fake3 (drop = location type); set fake; where type = 1; place = location; trt = type; These three pieces of code achieve the same result. Look closely at the where statements. location place type trt

Concatenation

Concatenation (stacking) SAS can stack multiple data sets on top of one another. Pay attention whether or not the variables and their attributes (length and variable type) match among the different data sets. Can use the set statement or proc append to concatenate data sets.

Using the set statement to concatenate data sets Suppose you wanted to stack the three data sets fake, faux, and fraud on top of one other: data fantastic; set fake faux fraud;

Using proc append to concatenate data sets proc append concatenates only two data sets at a time—and typically these data sets must have the same variable names with the same attributes. proc append base=fake data=faux; Here the observations in faux are tacked onto the end of the fake data set. The combined data set is called fake.

Splitting data into several data sets Supposed we want all Albuquerque observations to go into a data set called albuq, the Lexington observations to go into the data set lexing, and observations that have lengths larger than 3.0 into the data set large. data albuq lexing large; set fake; if location = ‘Albuquerque’ then output albuq; else if location = ‘Lexington’ then output lexing; if length gt 3 then output large;

Merging data

Merging (combining) data Merging data sets places two or more data sets side by side into a single data set. If you simply want place two data sets side by side (1 to 1 merging): data faux; set faux (rename=(location=location1 type=type1)); data fantastic; merge fake faux; proc print data = fantastic;

Results of 1 to 1 merge Obs location type length location1 type1 weight 1 Albuquerque Lexington Albuquerque Lexington Albuquerque Lexington Albuquerque Lexington Albuquerque Lexington Albuquerque Lexington Lexington Lexington Lexington Johannesburg Lexington Johannesburg Lexington Johannesburg Lexington Johannesburg Lexington Johannesburg Lexington Johannesburg Johannesburg Albuquerque Johannesburg Albuquerque Johannesburg Albuquerque Johannesburg Albuquerque Johannesburg Albuquerque Johannesburg Albuquerque

Merging data with a “by” variable All data sets that will be merged must first be sorted by the linking variables: proc sort data = fake; by location type; proc sort data = faux; by location type; data fantastic; merge fake faux; by location type; proc print data = fantastic;

Results of merging with “by” variables Obs location type length weight 1 Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Lexington Lexington Lexington Lexington Lexington Lexington Lexington

More about merging When you merge with a by statement, you may only want observations that have by-variable “matches” in both data sets. fakefraud

Assume both fake and fraud are sorted by location and type. data fantastic; merge fake (in = tmp1) fraud (in = tmp2); by location type; from_fake = tmp1; from_fraud = tmp2; proc print; Using (in =) data set option

from_ from_ Obs location type length thickness fake fraud 1 Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Lexington Lexington Lexington Lexington Lexington Lexington Lexington Lexington Lexington Identifying obs from both data sets

Now select observations that are common to both data sets: data fantastic; merge fake (in = tmp1) fraud (in = tmp2); by location type; if tmp1=1 and tmp2=1; proc print; Using (in =) data set option

Obs location type length thickness 1 Albuquerque Albuquerque Albuquerque Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Lexington Lexington Lexington After selecting for observations in common:

Merge mania data fantastic; merge fake fraud; by location type; data fantastic; merge fake (in=tmp1) fraud; by location type; if tmp1 = 1; data fantastic; merge fake (in=tmp1) fraud (in=tmp2); by location type; if tmp1 =1 and tmp2 = 1;

Creating new variables

data fantastic; merge fake faux; by location type; newcode = substr(location,1,1) || '-' || trim(left(type)); growth_index = length + weight**2; if (growth_index gt 15) then large = '*'; else large = ' ';

Results of new variables growth_ Obs location type length weight newcode index large 1 Albuquerque A Albuquerque A Albuquerque A Albuquerque A Albuquerque A Albuquerque A Johannesburg J Johannesburg J Johannesburg J Johannesburg J Johannesburg J Johannesburg J * 13 Lexington L Lexington L * 15 Lexington L * 16 Lexington L Lexington L * 18 Lexington L Lexington L *

Common functions used to manipulate text strings compress index left scan substr trim Refer to SAS Online Docs : Base SAS Software SAS Language Reference: Dictionary Dictionary of Language Elements Functions and Call Routines

by-group processing

Suppose you wanted a subset of the data that contained the observation with the smallest length from each location. proc sort data = fake; by location length; data shortest; set fake; by location length; first = first.location; last = last.location;

output from by-group processing Obs location type length first last 1 Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Albuquerque Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Johannesburg Lexington Lexington Lexington Lexington Lexington Lexington Lexington

by-group processing proc sort data = fake; by location length; data shortest; set fake; by location length; if first.location = 1; Obs location type length 1 Albuquerque Johannesburg Lexington

Basic macros

SAS macros allow you to easily program repetitive tasks. On the surface, creating a SAS macro is very similar to creating a function in R or S-Plus. SAS Macro is actually a text generation tool.

%macro analyze(dataset=,response=); proc mixed data = &dataset; class location type; model &response = location type; lsmeans location type; ods output lsmeans=model_means; data model_means; set model_means; variable = "&response"; proc append base=results data=model_means; %mend analyze; %analyze(dataset=fake,response=length) %analyze(dataset=faux,response=weight) proc print data = results; Macro example Name macro, begin macro definition, identify macro variables Code to be generated by macro Note the use of proc append End macro definition Call macro Print results

Results from macro code Obs Effect location type Estimate StdErr DF tValue Probt variable 1 location Albuquerque _ <.0001 length 2 location Johannesburg _ <.0001 length 3 location Lexington _ <.0001 length 4 type <.0001 length 5 type <.0001 length 6 location Albuquerque _ <.0001 weight 7 location Johannesburg _ <.0001 weight 8 location Lexington _ <.0001 weight 9 type <.0001 weight 10 type <.0001 weight

proc transpose

Rearranging data with proc transpose Obs year cultivar Effect trt Estimate StdErr DF tValue Probt Intercept _ trt calevel _ calevel*trt Intercept _ trt calevel _ calevel*trt Intercept _ trt calevel _ calevel*trt Consider this output from proc mixed:

Results from proc transpose proc transpose data=solution out=tsolution; by year cultivar; var estimate; id effect; calevel_ Obs year cultivar _NAME_ Intercept trt calevel trt Estimate Estimate Estimate Estimate Estimate Estimate Estimate Estimate Estimate

Parting words of advice

Attributes of SAS SAS is read/write intensive. Every time you create a data set, the data set is written to the disk. Where does it get written? To the SAS Work Library, which is assigned to a directory somewhere…..use proc contents to find out. For CALS HP users and PC users, the SAS Work Library resides on the actual machine.

Attributes of SAS Users of the AFS system beware! (Stat department, CS department) The SAS Work Library is assigned to your account in AFS—not to the local machine that is running SAS. Your local computer running SAS AFS recording and reading your SAS data sets Network traffic

Assigning the SAS Work Library To assign the SAS work library to a local directory (when running in batch mode on a Linux or Unix system): $ sas mysascode.sas -work /scratch

Synthesis Most of what we’ve covered today involves the data step. Many of the techniques shown in this presentation can be applied together in a single data step. Now that you know the names of some of the tools, use the Online Documentation!