Lesson 2 Topic - Reading raw data into SAS

Slides:



Advertisements
Similar presentations
The SAS ® System Additional Information on Statistical Analysis Programming.
Advertisements

The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Into to SAS ®. 2 List the components of a SAS program. Open an existing SAS program and run it. Objectives.
Creating SAS® Data Sets
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
October 2003Bent Thomsen - FIT 3-21 IT – som værktøj Bent Thomsen Institut for Datalogi Aalborg Universitet.
1 Data List Spreadsheets or simple databases - a different use of Spreadsheets Bent Thomsen.
Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple observations from a single record containing repeating blocks.
Using Advanced INPUT Techniques Peter Cosette Dave Hall Amy Dunn-Ruiz Eric Lyon.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
BMTRY 789 Lecture 2 SAS Syntax, entering raw data, etc. Lecturer: Annie N. Simpson, MSc. Readings – Chapters 1, 2, 12, & 13 Lab Problems 1.1, 1.2, 1.3,
I OWA S TATE U NIVERSITY Department of Animal Science Getting Your Data Into SAS (Chapter 2 in the Little SAS Book) Animal Science 500 Lecture No. 3 September.
Lesson 2 Topic - Reading in data Chapter 2 (Little SAS Book)
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Lesson 6 - Topics Reading SAS datasets Subsetting SAS datasets Merging SAS datasets.
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
Here’s another problem (see section 2.13 on page 54). A file contains two different types of records (say A’s and B’s) and we only want to read in the.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Working with Data Lists.
1 Statistical Software Programming. STAT 6360 –Statistical Software Programming Data Input in SAS Many ways to get your data into SAS: –Through data entry.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
Lecture 4 Ways to get data into SAS Some practice programming
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
Chapter 18 Reading Free-Format Data. 2 Objectives Read free-format data not recognized in fixed fields. Read free-format data separated by non-blank delimiters,
Chapter 2 Getting Data into SAS Directly enter data into SAS data sets –use the ViewTable window. You can define columns (variables) with the Column Attributes.
Lesson 2 Topic - Reading in data Programs 1 and 2 in course notes –Chapter 2 (Little SAS Book)
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
Creative Create Lists Elizabeth B. Thomsen Member Services Manager
Creating a Workbook Part 1
User-Written Functions
PubH 6420 Introduction to SAS Programming
SAS Programming Training
Loops BIS1523 – Lecture 10.
SAS Programming Training
Variables and Primative Types
Instructor: Raul Cruz-Cano 7/9/2012
Chapter 2: Getting Data into SAS
Chapter 3: Working With Your Data
Data File Import / Export
Intro to PHP & Variables
ECONOMETRICS ii – spring 2018
Chapter 1: Introduction to SAS
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
Exploring Microsoft® Access® 2016 Series Editor Mary Anne Poatsy
MODULE 7 Microsoft Access 2010
Topics Introduction to File Input and Output
Number and String Operations
Lesson 7 - Topics Reading SAS data sets
Introduction to SAS A SAS program is a list of SAS statements executed in order Every SAS statement ends with a semicolon! SAS statements can be in caps.
Working With Dates: Dates Come in Many Ways
Introduction to DATA Step Programming: SAS Basics II
SAS Programming Training
Working With Dates: Dates Come in Many Ways
Spreadsheets, Modelling & Databases
Running a Java Program using Blue Jay.
Bent Thomsen Institut for Datalogi Aalborg Universitet
Variables in C Topics Naming Variables Declaring Variables
Topics Introduction to File Input and Output
Unit J: Creating a Database
Introduction to SAS Essentials Mastering SAS for Data Analytics
Introduction to SAS Essentials Mastering SAS for Data Analytics
Presentation transcript:

Lesson 2 Topic - Reading raw data into SAS Programs 1 and 2 in course notes Chapter 2 (Little SAS Book) Welcome to lesson 2. In this lesson we will look at how to read your data into SAS. We will look at some of the most common ways that raw data may be stored and the SAS code required to read-in the data. This is illustrated in programs 1 and 2. Most of these are covered in Chapter 2 of the LSB. Chapter 2 of the Little SAS Book devotes an entire chapter to this topic.

(Create new variables) Analyze Data Using Statistical Procedures Raw Data Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset) Data Step This diagram which we have seen before illustrates the processes involved in your SAS program. The first step is to get your raw data into SAS. This is usually done in the portion of the program called the data step. As stated before reading in your data into SAS is sometimes your most difficult task. This is because raw data can come in so many different formats. Once you get the data into SAS and create a SAS dataset then the analyses portion of the program using procedures is usually straight-forward. Before we look at SAS code for reading in data, let’s look at the some of the ways your raw data may be stored. Analyze Data Using Statistical Procedures PROCs

Raw Data Sources You type it in the SAS program Text file Spreadsheet (Excel) Database (Access, Oracle) SAS dataset First, by raw data I mean the data with which you have to work with, that you want to read into SAS. It may be the data that was first entered into the computer or may be a database dump of some sort. Your raw data can come from several sources. In very simple cases we will see that you can type your data right into your SAS program, thus you are entering the data for the first time. In most cases your data will be stored in an external file, either as a text file, a spreadsheet such as Excel, or in a database such as Access or Oracle. In a few cases the raw data may already be a SAS dataset, that will make your life simple, but that is often not the case. We will see how to read and process SAS datasets later in the course.

Data in Text Files Text files are simple character files that you can create or view in a text editor like Notepad. They may also be created as “dumps” from spreadsheet files like excel. Delimited data – variables are separated by a special character (e.g. a comma) Fixed position – data is organized into columns Text files are simple character files that you can create or view in a text editor like Notepad. They can also be created as “dumps” from spreadsheet files like excel. Data in text files are stored in one of two ways. In the first way, called delimited data, variables are separated by a character such as a comma. This is the most common. The other format is fixed position data where data is organized into columns. In this format, each variable starts and ends at fixed positions, the same positions for each row of data. This format is less common.

Data delimited with spaces: Note: Missing data is identified with a period. Here is an example of delimited data separated by spaces. It is quite easy to see that there are 5 variables. Note the positions of each variable are not always the same between rows. That is OK because we will be telling SAS to look for spaces to separate the variables. Note also that there is missing data for the second and third variables in row four, represented by a period. The period is a placeholder for these 2 variables and tells SAS that the values are missing. Data delimited by spaces is not common except if the user in typing the data themselves.

Data delimited with commas Note: Missing data is identified with a period. This is the same data separated by commas. This is a common format for data. We will need to tell SAS to look for commas to separate our variables.

Data delimited by commas (.csv file) Note: Missing data is identified by multiple commas. This is a similarly formatted structure, except multiple commas are used to indicate missing data. This is called a CSV file which stands for Comma Separated Variables. We will see how to read this data into SAS in this lecture.

Column Data C084138093143 D089150091140 A078116100162 A 086155 C081145086140 Note: Missing data values are blank. This is an example of column position data. Here the data is “squeezed” together. There is no delimiter between the variables. You will need to have a key to tell you where each variable starts and ends. We sometimes call that a data dictionary. Note by just looking at the data you would not even know that there are five variables. Note also, with column data missing data is usually a blank for each position in the variable.

INFILE and INPUT Statements When you write a SAS program to read in raw data, you’ll use two key statements: The INFILE statement tells SAS where to find the data and how it is organized. The INPUT statement tells SAS which variables to read-in There are two statements you will use to read-in your data. The first is the INFILE statement which will tell SAS where to find the data and how it is organized. This will be followed by the INPUT statement that tells SAS which variables are read-in. As part of this statement you will assign names to the variables, tell SAS whether the variable is character or numeric, and in some cases how to read or decode the variable brought in. These two statement will usually follow immediately after your DATA statement which names the SAS dataset your are creating and starts the DATA step. So when starting your program think “DATA – INFILE – INPUT”. These will often be the first three statements of your program. The exact syntax for the INFILE and INPUT statements will vary depending on where and how your data is stored. Let’s look at some examples.

* List Directed Input: Reading data values separated by spaces; Program 1 * List Directed Input: Reading data values separated by spaces; DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A . . 86 155 C 81 145 86 140 ; RUN ; TITLE 'Data Separated by Spaces'; PROC PRINT DATA=bp; RUN; Obs clinic dbp6 sbp6 dbpbl sbpbl 1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140 Program 1 gives examples of how to read-in data in different formats. This program consists of several DATA steps that read in and display the data. The data in each case is the same, a clinic code and 4 blood pressure variables on 5 persons. What changes is the format of the data. To see the differences in the formats of the data I have included the data within the program. In each case there is missing data for the 4th observation. The first line of code “DATA bp;” will be the same, each time creating a dataset called bp. Only the INFILE and INPUT statements will change. For data separated by a delimiter we will use what is called list directed input. This slide shows how to read-in data separated by spaces. The INFILE statement has only the keyword DATALINES, which tells SAS to expect the data to be contained within the program. The INPUT statement lists the names of the variables (that is why it is called list input) in the order the data is stored, from left to right. Here the variable clinic is first, followed by the four BP variables. Since clinic is a character variable we add a $ after the variable name to tell SAS to make clinic a character variable. Note that we could name the variables anything we want, within the rules for variable names. So the input statement is used to assign names to each variable. We follow next with a DATALINES statement. This tells SAS the data is coming and to not treat what follows as statements. The data then follows – each variable separated by a space. To tell SAS the data is ended we include a line with only a semi-colon. The RUN statement ends the DATA step. We can now run procedures; here we simply run PROC PRINT to display all the variables, so we can see if they were read-in properly. The SAS output is listed on the bottom. We see that we have 5 observations and that the data was read-in properly. The missing data is displayed as a period, usually pronounced as “dot”.

3 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; 4 DATALINES; PARTIAL SASLOG 1 DATA bp; 2 INFILE DATALINES; 3 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; 4 DATALINES; NOTE: The data set WORK.BP has 5 observations and 5 variables. NOTE: DATA statement used: real time 0.39 seconds cpu time 0.03 seconds If we looked at the log we would get the following notes. The first note tells us that the dataset WORK.BP has 5 observations and 5 variables. This is what was expected. The keyword WORK indicates that SAS created a temporary or work dataset.

* List Directed Input: Reading data values separated by commas; DATA bp; INFILE DATALINES DLM = ',' ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,.,.,86,155 C,81,145,86,140 ; RUN ; TITLE 'Data separated by a comma'; PROC PRINT DATA=bp; RUN; This next DATA step shows how to read-in data separated by commas. The only change is to add the DLM (which stands for delimiter) option to the INFILE statement. We specify the delimiter in quotes, here a comma, after an equals sign. A comma is what delimits or separates our variables. We did not need to specify the DLM option when the data was separated by spaces. That is because the default delimiter is a space. What would happen if you left off the DLM portion off the statement? Take a second and think how SAS would process the data. It would be looking for spaces to separate the variables which it would not find until the end of the line. Thus, all the characters would be assigned to the variable clinic and there would be no data left for the BP variables. You would end up getting many error messages and the data would not be read-in properly. Try it if you like.

* List Directed Input: Reading .csv files DATA bp; INFILE DATALINES DLM = ',' DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 ; TITLE 'Reading in Data using the DSD Option'; PROC PRINT DATA=bp; RUN; Consecutive commas indicate missing data This example shows how to read in .csv files. What are csv files? They are comma delimited data with the added feature that missing data is represented by consecutive commas. Note for the 4th row of data there are three consecutive commas, indicating missing data for variables dbp6 and sbp6. The only change to the previous example is to add the DSD option. (DSD stands for delimiter sensitive data). This option tells SAS to treat multiple delimiters as missing data. The DSD option also changes the default delimiter to a comma, so you could remove the DLM option here if you like. Without the DSD option SAS would treat the multiple commas in the fourth row as a single delimiter which would cause SAS to assign incorrectly the values of 86 and 155 to the variables dbp6 and sbp6. There would then not be data left for the last 2 BP variables which would cause a problem. CSV files are a common way to dump Excel spreadsheets. You can do that by going to the SAVE AS pulldown within Excel and choose .csv as the format. CSV stands for Comma Separated Variables.

INFILE DATALINES DLM = '09'x DSD; * List Directed Input: Reading data values separated by tabs (.txt files); DATA bp; INFILE DATALINES DLM = '09'x DSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A 86 155 C 81 145 86 140 ; TITLE 'Reading in Data separated by a tab'; PROC PRINT DATA=bp; RUN; The last example for using list input we will cover is when the data is separated by tabs. Since a tab is a special character we can’t type that in the DLM option but have to give the hexadecimal value for a tab. This value is 09, which is placed in quotes followed by the x character, which tells SAS to interpret the value in quotes as a hexadecimal value. The data within the program here is tab delimited which looks like multiple spaces, however, it is a tab between each variable. You can save an excel file as a tab delimited file by choosing .txt as a file type under the SAVE AS pulldown. If someone sends you a data file with a .txt extension it may well be a tab delimited file. You will want to keep this program in your how-to-do list of programs, it may well come in handy.

* Column Input: Data in fixed columns. DATA bp; INFILE DATALINES ; INPUT clinic $ 1-1 dbp6 2-4 sbp6 5-7 dbpbl 8-10 sbpbl 11-13 ; DATALINES; C084138093143 D089150091140 A078116100162 A 086155 C081145086140 ; Title 'Reading in Data using Column Input'; PROC PRINT DATA=bp; We have to use a different method to read-in data when our data is in fixed columns. We will use what is called column input. We can use this method if each variable is in the same column in every row. For this type of data the DLM option is not appropriate since there is no delimiter separating the data. So our INFILE statement has only the DATALINES option. The INPUT statement takes on a different form. We start with the variable name followed by a $ if the variable is character (as for clinic), and then followed with the beginning and ending locations for the variable. In this example the variable dbp6 is located in positions 2 through 4, variable sbp6 is located in positions 5-7. For clinic, which takes up one position, you could use 1-1 as we do here or just 1 without a second number. To know the beginning and ending positions you will usually need a data dictionary for the data file, given to you by the person who created the raw file. For column input missing data can be represented by blanks in the appropriate columns. Note: missing data is blank

* Reading data using Pointers and Informats DATA bp; INFILE DATALINES ; INPUT @1 clinic $1. @2 dbp6 3. @5 sbp6 3. @8 dbpbl 3. @11 sbpbl 3. ; DATALINES; C084138093143 D089150091140 A078116100162 A 086155 C081145086140 ; Title 'Reading in Data using Point/Informats'; PROC PRINT DATA=bp; Informats must end with a period. An alternative to using column input for data in fixed positions is to use what is called pointers and informats. With this method we give the starting position (using the at character) followed by the variable name, followed by what is called an informat. Informats tell SAS how to bring in or decode the variable. For numeric data the informat is giving by the width of the variable followed by a period. For character data the informat is given by a $, followed by the length of the variable, followed by a period. The last variable read-in is variable sbpbl, which we tell SAS to go to position 11 and read-in the next 3 characters.

* Reading data using Informat Lists DATA quallife; INFILE DATALINES ; INPUT (QL1-QL35) (1.) ; DATALINES; 31232242414444223544545354455342324 21112353214352552525522662566553533 21122252241333356262662366266551525 ; Title 'Reading in Data using Informat Lists'; PROC PRINT DATA=quallife; VAR QL1-QL35; RUN; O Q Q Q Q Q Q Q Q Q L L L L L L L L L L L L L L L L L L L L L L L L L L b L L L L L L L L L 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 s 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 1 3 1 2 3 2 2 4 2 4 1 4 4 4 4 2 2 3 5 4 4 5 4 5 3 5 4 4 5 5 3 4 2 3 2 4 2 2 1 1 1 2 3 5 3 2 1 4 3 5 2 5 5 2 5 2 5 5 2 2 6 6 2 5 6 6 5 5 3 5 3 3 3 2 1 1 2 2 2 5 2 2 4 1 3 3 3 3 5 6 2 6 2 6 6 2 3 6 6 2 6 6 5 5 1 5 2 5 If you have a series of similar variables such as items from a questionnaire, it is sometimes convenient to give them names that have a common root with a numeric suffix. Here we read in 35 variables, naming them QL1 through QL35. When SAS sees this dash notation SAS assumes you mean “through”, i.e. the variables are QL1, QL2, up to QL35. This can save you a lot of typing. Note the parenthesis on the input statement for both the variable list and the informat. This tells SAS to apply the informat to each of the variables in the list. This shorthand notation can be used elsewhere in your program, as is done here in specifying the variables to display in the VAR statement for PROC PRINT. If you wanted to display just the first 5 variables you would specify QL1-QL5.

* Reading data from an external file Program 2 * Reading data from an external file DATA bp; INFILE ‘C:\SAS_Files\bp.csv' DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE 'Reading in Data from an External File'; PROC PRINT DATA=bp; clinic,dbp6,sbp6,dbpbl,sbpbl C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 Content of bp.csv In the examples in program 1 the data was contained within the program. Usually, however, your data will be stored in an external file. To tell SAS to read from an external file you replace DATALINES on the INFILE statement with the file path of the file containing the data. The entire file path is placed in quotes (either single or double quotes but do not mix quote types). Be careful to type the file path correctly with no extra blanks anywhere within the quotes. Other INFILE options apply as before. The file path must be the first option on the INFILE statement, other options can be in any order. Here we use list input to read the data contained in the file bp.csv, the contents of which is displayed here. The first row of the data is column headings which we would get from an Excel dump. We do not want to read that row as data so we can either go into the file and delete the first line or (perhaps better) tell SAS to skip the first row by using the FIRSTOBS option. Here we tell SAS to start with row 2. We use the DSD option as before.

8 INFILE 'C:\SAS_Files\bp.csv' DSD FIRSTOBS=2 ; PARTIAL SAS LOG 7 DATA bp; 8 INFILE 'C:\SAS_Files\bp.csv' DSD FIRSTOBS=2 ; 9 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; NOTE: The infile 'C:\SAS_Files\bp.csv' is: File Name=C:\SAS_Files\bp.csv, RECFM=V,LRECL=256 NOTE: 5 records were read from the infile 'C:\SAS_Files\bp.csv'. The minimum record length was 10. The maximum record length was 16. NOTE: The data set WORK.BP has 5 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.10 seconds cpu time 0.01 seconds Running the program would generate the following log. The first note tells us SAS found the external file bp.csv where we told SAS it would be. The second note tells us that 5 records were read from this file. The third note tells us that the SAS dataset created (dataset BP) has 5 observations and 5 variables. This is what we expected.

*Reading data from an external file using a FILENAME statement; FILENAME bpdata ‘C:\SAS_Files\bp.csv'; DATA bp; INFILE bpdata DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE 'Reading in Data Using FILENAME'; PROC PRINT DATA=bp; An alternative to including the file path as part of the infile statement is to define a file reference using a filename statement. Here we assign the reference bpdata to the entire file path. Then on the infile statement we include the file reference (not in quotes). This is a pointer to the data file. Every time SAS sees the reference bpdata it knows you mean the bp.csv file in the SAS_Files directory. FILENAME statements are usually placed at the top of your program outside of the DATA step. The advantage to using filename statements is somewhat subtle. It makes the DATA step free from any direct file references which can be an advantage. Suppose the data file was moved to another location. You could simply change the path to the file on the FILENAME statement. You would not need to look for the infile statement which in more complicated programs may be buried deep within the program.

* Using PROC IMPORT to read in data ; * Can skip data step; * Can also try IMPORT Wizard; PROC IMPORT DATAFILE=‘C:\SAS_Files\bp.csv' OUT = bp DBMS = csv REPLACE ; GETNAMES = yes; GUESSINGROWS=9999; TITLE 'Reading in Data Using PROC IMPORT'; PROC PRINT DATA=bp; PROC CONTENTS DATA=bp; SAS has a utility procedure that can be also be used to read-in your data. The procedure is called PROC IMPORT, that will read certain types of raw data files and create SAS datasets from them. Here is an example where the raw data is a CSV file, the same file we just read in using a DATA step. The DATAFILE option gives the path and file name of the raw data file, in OUT you give the name of the SAS dataset you want to create, the database management system option (DBMS) is set to csv. The replace option tells SAS to write over the SAS dataset if it exists, and GETNAMES if set to YES tells SAS to use the first row of the CSV file for the names of the variables. The DBMS keyword can be omitted if the file extension of the CSV file is .csv. You would want to display the data and do a PROC PRINT and PROC CONTENTS to help you know if the data was brought in correctly. PROC CONTENTS will also give you the variable names. Although this is a nice utility because it eliminates the DATA step and all the coding involved in that, caution is needed in using this procedure since SAS has to make some decisions about whether each column of data is character or numeric by reading the data rather than you explicitly telling SAS in the INPUT statement. It will also sometimes make character variables much larger in length then they need to be. There is an option called GUESSINGROWS in PROC IMPORT that may help (see page 61 of LSB). If you set this to a number then SAS will read that many rows to determine the type of data. Set this to a large number to read the entire dataset. This will, however, increase the computer time to bring the data in. One note – PROC IMPORT actually generates and runs a data step, with infile and input statements. You can see the code generated in the log. You can take that code, stripped of the log notes, modify it as needed, and use that as your program. Thus, you are using PROC IMPORT to write the data step for you. Lastly, SAS has an import wizard that leads you through a series of steps in bringing in the data. There is an option to save the syntax code. This method is more for the novice rather than a programmer like you are becoming. Uses first row for variable names

* PC SAS can read excel files directly; PROC IMPORT DATAFILE=‘C:\SAS_Files\bp.xls' OUT = bp DBMS = xls REPLACE ; GETNAMES = yes; TITLE 'Reading in Data from excel'; PROC PRINT DATA=bp; PROC CONTENTS; Uses first row for variable names Even nicer is that with PC SAS you can create a SAS dataset directly from an Excel file, without first dumping it into a CSV file. Just change the name of the file in the DATAFILE keyword. You can even skip the DBMS if the extension of the excel file is .xls. There is a SHEET option as well, in case you have multiple worksheets in your file (see page 63 of LSB) The same cautions are warranted, as when using PROC IMPORT to import a CSV file.

# Variable Type Len Format Informat 1 Clinic Char 1 $1. $1. The CONTENTS Procedure Data Set Name WORK.BP Observations 5 Member Type DATA Variables 5 Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 1 Clinic Char 1 $1. $1. 2 DBP6 Num 8 BEST12. BEST32. 4 DBPBL Num 8 BEST12. BEST32. 3 SBP6 Num 8 BEST12. BEST32. 5 SBPBL Num 8 BEST12. BEST32. Here is the output from PROC CONTENTS on the dataset created with PROC IMPORT. We see we have 5 variables, the character variable clinic which is length 1 and four BP variables which are numeric. We will look more closely at the output from proc contents when we discuss permanent SAS datasets. For now just know that proc contents lists all the variable names on your dataset and tells you whether they are character or numeric.

SOME INFILE OPTIONS OBS - limits number of observations read FIRSTOBS - start reading from this obs. MISSOVER and TRUNCOVER - used to read in data with short records TERMSTR= used when reading PC files on a UNIX machine (or vice versa) LRECL= needed when you have data with long records (> 256 characters) Here are a few addition options for the INFILE statement you may need to use. We have already seen the FIRSTOBS option. The OBS option limits the number of observations read-in. This is useful for testing programs if you have a large dataset. The MISSOVER and TRUNCOVER options are sometimes needed when you have short records. We will see an example later in the course on using the MISSOVER option. The option TERMSTR is needed when you are reading a text file created on a UNIX environment and running SAS on a PC. Unix and PC operating systems use different markers to indicate the end of a line. PCs use two characters to end the line, a CR and a LF, where UNIX systems use just a LF. Set TERMSTR=LF if you know the file was created on a UNIX system. If this is not set then the last variable in a row of data will not be read in correctly. The LRECL option is needed when your data takes up more than 256 characters on a row. Without changing this option SAS will not see your data that is past that position. We will see an example of this next.

Problem when reading past default logical record length; DATA temp; INFILE ‘C:\SAS_Files\tomhs.data' OBS=6 ; INPUT @260 jntpain 2. ; TITLE 'Data not read in correctly because variable is past default LRECL of 256'; PROC PRINT; NOTE: Invalid data for jntpain in line 2 NOTE: SAS went to a new line when INPUT statement reached past the end of a line Here is an example where we have data beyond position 256. The variable jntpain is located in position 260. By default SAS will only bring in 256 characters. If you run the program you will get a note that the data for jntpain is invalid and a proc print will show that values for the variable were all set to missing. This is because SAS never went out to position 260 to get the data. Obs jntpain 1 . 2 . 3 .

*Add LRECL option to fix problem ; DATA temp; INFILE ‘C:\…\tomhs.data' OBS=6 LRECL=500; INPUT @260 jntpain 2. ; TITLE 'Data read in correctly using LRECL option'; PROC PRINT; Obs jntpain 1 1 2 1 3 1 4 1 5 1 6 2 To fix this add the LRECL option and set the value to something greater than the position of the last variable you read-in. Here we set LRECL to 500 and the data is then correctly read-in. A value of 261 would have worked as well.

Reading Special Data 04/11/1982 Date 59,365 Comma in number 086-59-9054 Long (>8) characters Informat 04/11/1982 mmddyy10. 59,365 comma6. 086-59-9054 $11. Sometimes the data you are reading in is formatted in special ways. The most common example is dates. Another example is commas imbedded within a number. To read this type of data you will have to tell SAS about this formatting. Also, character variables longer than 8 characters require special attention. We can read this type of data using special informats. We have seen some common informats for reading numeric data. This slides list informats for reading in dates, comma imbedded numbers, and long character variables.

* Reading special data with fixed position data; DATA info; INFILE DATALINES; INPUT @1 ssn $11. @13 taxdate mmddyy10. @25 income comma6. ; DATALINES; 086-59-9054 04/12/2001 59,365 405-65-0987 03/15/2002 26,925 212-44-9054 04/15/2003 44,999 ; TITLE 'Variables with Special Formats'; PROC PRINT DATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999 For data in fixed positions we just supply the special informat after the variable name as seen in this example. This will read the data in properly. The comma6. informat tells SAS to expect a variable of length up to 6 characters, with embedded commas. SAS will skip over the commas to read the data in. Later on we will devout some time on working with dates. For now just know that you need to use a special informat to read them in.

* Reading special data with list input using colon modifier; DATA info; INFILE DATALINES DLM=“;” DSD; INPUT ssn : $11. taxdate : mmddyy10. income : comma6. ; DATALINES; 086-59-9054;04/12/2001;59,365 405-65-0987;03/15/2002;26,925 212-44-9054;04/15/2003;44,999 ; TITLE 'Variables with Special Formats'; PROC PRINT DATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999 In this example data is separated by semi-colons and we want to use special informats. To add an informat to list input variables you add a colon after the variable name and then add the informat.

* Using INFORMAT statement to supply input formats; DATA info; INFILE DATALINES DLM=“;” DSD; INFORMAT ssn $11. taxdate mmddyy10. income comma6.; INPUT ssn taxdate income ; DATALINES; 086-59-9054;04/12/2001;59,365 405-65-0987;03/15/2002;26,925 212-44-9054;04/15/2003;44,999 ; TITLE 'Variables with Special Formats'; PROC PRINT DATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999 SAS has an INFORMAT statement where you can give input formats to all your variables in one place. This will make your INPUT statement much cleaner. When SAS reads in a variable it will choose the designated input format supplied in the INFORMAT statement. You place the INFORMAT statement before the INPUT statement.

Summary of Ways of Reading in Data List input - data is separated by a delimiter; must read in all variables. Column input - data is in fixed columns;must know where each variable starts and ends; can read in selected variables Pointers and Informats - alternative to column input; most flexible; must be used for special data PROC IMPORT Here is a summary of the different ways of reading in data. How you read-in your data will be dictated by the format of the data. With list directed input the data must be separated by a delimiter. A side-effect of this method is that you need to read-in all the variables, you can not skip over variables. With column input the data must be in fixed positions across all rows of your data. You must know the positions where each variable begins and ends. With column input you can read in selected variables. Pointers and informats are an alternative to using column input. This is the most flexible method as it can read in special data. You can also use PROC IMPORT for delimited data which skips the data step entirely. As mentioned, there can sometimes be problems using this method since SAS has to make assumptions about the data by reading the data.