Chapter 17 Read Raw Data in Fixed Format using Formatted Input Objectives Distinguish between standard and nonstandard numeric data Read standard fixed-field data Read nonstandard fixed-field data
Review of Column Input General Syntax of Column Input: INPUT var start_col – end_col ………. ; – Var is the variable name – $ is for character variable. – Start_col, end _clo specify the starting and ending col # for reading the variable. Ex. INPUT L_Name $ 1 – 15 F_name $ age Choles 30-35;
Important features and usages of Column Input It can read character variables of the data values have embedded blanks. Missing data values will be read as missing from the defined columns (Blank for character and ‘.’ for numeric). Columns can be re-read. ex. INPUT supplier $ 5-20 ItemNum amount 22-30; Columns can be read backwards or forwards. Ex. INPUT F_name $ L_Name $ 1-15 age 16-18;
Raw data that can not be read by Column Input When data values are not standard numeric data: Ex. Data values having $ sign, having comma, having %, etc. Data values are not organized in a fixed columns for each variable. Numeric data values having decimal places, the decimal is not recorded in the data. Date, Time, Datetime data that are not recorded in numeric values, instead, recorded as commonly used date, such as 11/14/2010. Such as date requires a special format to read it. If it is read using Column Input, it much be read as a character data. Data values are not recorded in fixed format, for example, data values are recorded by using delimiters, such as blank /, ; tab, and so on.
A review of Standard Vs. Nonstandard Numeric Data Standard numeric data can contain only Numbers Decimal places Numbers in scientific or E-notation (ex, 4.2E3) Plus or minus signs Nonstandard numeric data includes Values contain special characters, such as %, $, comma (,), etc. Date and time values Data in fractions, integer binary, real binary, hexadecimal forms, etc.
Determine if each of the following numeric data standard or nonstandard data Standard $ Nonstandard 3,456.12Nonstandard 20DEC2010Nonstandard date 12/20/2010Nonstandard date
A review of Fixed Format Vs. Free format Fixed format means a variable occupies in a fixed range of columns from observation to observation. Free format means the data values are not in a fixed range of columns. Ex:Fixed formatFree format HIGH FHIGH F LOW F MEDIAN M
A Review of basic statements for reading External Raw data in a Data Step General form for the complete DATA step without FILEMANE statement: DATA SAS_data_set_name; INFILE ‘input-raw-data-file’; INPUT variable $ start - end...; RUN; General form for the complete DATA step with Filename statement: DATA SAS_data_set_name; FILENAME Fileref ‘input-raw-data-file’; INFILE Fielref ; INPUT variable $ start - end...; RUN;
Example: Review of Reading External Raw Data Using Column Input Read External Data salesdata.dat with FILENAME statement FILENAME sal_dat ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ; DATA saleslib.sales_sasdata; INFILE sal_dat; INPUT last_name $ 1-7 sale_date $ 9-11 residential commercial 23 – 31; Read External Data salesdata.dat without FILENAME statement DATA saleslib.Sales_sasdata; INFILE ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ; INPUT last_name $ 1-7 sale_date $ 9-11 residential commercial 23 – 31;
Reading External Raw Data using Formatted Input General Syntax: INPUT variable Informat.; Pointer-control: pointers to control the position of the column. Variable: variable name to be created. Informat: the format to input the variable. NOTE: Two pointer-controls to control the column position : moves the pointer to the specified column. This is the absolute column of the data record. +n : move the pointer forward n columns beginning from the current position.
Informat in the INPUT statement Informat is the SAS format used in the Input statement to READ data values. It is used in the INPUT statement, and it is called INformat. The SAS format we discussed in Chapter four DISPLAYING data values can be used as Informat in the INPUT statement as fixed formatted input.
Recall: SAS Format A format is an instruction that SAS uses to write data values. SAS formats have the following form: 12 format. Format name Total width (including decimal places and special characters) Number of decimal places Required delimiter Indicates a character format
SAS INFormats 13
SAS INFormats for Date, Time 14 Recall that a SAS date is stored as the number of days between 01JAN1960 and the specified date. If the date or time values are created using ex, 10/16/2001, 16OCT2001, in order to read the date properly from the external data set, we need to use the Informat: Date in External data set InformatData value read 10/16/2001mmddyy OCT2001Date
Some commonly used Informat PERCENTw.dDATE9.NENGOw. $BINARYw.DATETIMEw.PDw.d $VARYINGw.HEXw.PERCENTw. $w.JULIANw.TIMEw. COMMAw.dMMDDYYw.W.d
The Informat COMMAw.d COMMAw.d informat reads nonstandard numeric data and removes the embedded Blanks, commas, dashes, dollar signs, percent signs, right parenthesis, left parenthesis, which are converted to negative sign. Actual Data valueCOMMAw.dData value read 12,345.67COMMA $12,345.67COMMA COMMA COMMA (12,345.67)COMMA
Exercise Practice Informat in Input statement: The data set aug99n.dat is posted on the class website. Three observations of the data set are shown below : AUG1999 R % AUG1999 C % AUG1999 T % Write a SAS program to read this data, pay special attention to the use of Informat to read non-standard numeric values. Print the data set using proper display formats for non-standard variables. Field NameStart ColumnEnd ColumnMaximum WidthData Type ID133numeric Date5139character Item15 1character Quantity17193numeric Price21244numeric Percentsale26297numeric
Answer: There are many ways to accomplish the same goal. Here is an example data orders; infile 'C:\math707\RawData\RawData_dat\aug99n.dat'; input ID date date9. item $ 15 quantity totalcost percentsale percent4. ; proc print; format date MMDDYY8. percentsale percent6. ; run;
Salesdata.dat SMITH 10JAN DAVIS 15JAN JOHNSON 20JAN SMITH 01FEB DAVIS 12FEB JOHNSON 22FEB SMITH 10MAR DAVIS 18MAR JOHNSON 26MAR
Read Salesdata.dat data using Formatted Input Data work.sale; INFILE ‘C:\math707\RawData\RawData_dat \Salesdata.dat’; input last_name month residential commercial 9.2; run; Proc print; Run; NOTE: INFILE defines the location of the data set. $w. is the format for character moves the pointer to the column n. +n: move forward n columns from the current position. The pointer starts at column 1. After reading a variable, the pointer move the next column as the current position. Ex: After reading last_name with 7 columns, the pointer moves to column 8 as the current position. After reading residential (starting at 19, reading 9 columns), that is, residential is from 19 to 27. The pointer moves to column 28 as the current column. Hence, +1 asks the pointer move one column forward from col 28 to col 29, then, read 9 columns for commercial.
Example: Read the following data using Formatted Input The following is the scores of quizzes, test1, test2 and final of a class. Name Q1 Q2 Q3 Q4 Q5 T1 T2 Final CSA DB QC DC E F GC HD IM WB Write a SAS program to read the data by having the data included in the SAS program.
/*Program Statements */ DATA scores; /*Column Input */ INPUTName $ 1-5 Q1 6-7 Q Q Q Q TEST TEST Final 33-36; /*Formatted Input */ INPUT NAME $5. Q1 Q2 Q3 Q4 Q5 TEST1 TEST2 FINAL 4.; CSA DB QC DC E F GC HD IM WB ; RUN;
Different formatted inputs to read the same data /*Column Input */ INPUTName $ 1-5 Q1 6-7 Q Q Q Q TEST TEST Final 33-36; /*Formatted Input */ INPUT NAME $5. Q1 Q2 Q3 Q4 Q5 TEST1 TEST2 FINAL 4.; INPUT NAME $5. Q Q Q Q4 Q5 TEST TEST2 FINAL 4.; INPUT NAME $5. (Q1-Q5 TEST1 TEST2)(2. FINAL 4. ;
Exercise Open the program c5_colInp And change the Column INPUT statement using Formatted Input.
Fixed Record Length Vs. Variable Record Length In reading an external data set, the record length is the size of each record. Usually a record consists of the variables of an observation. NOTE: It is possible one record can consists of multiple observations. This will be discussed later The size of each record is usually ‘FIXED’, that is the same record size for every record. However, it may not be the case in data recording. That is, the record size may differ. When the record lengths differ, Formatted input may not read the data values correctly due to the fact that Formatted input will look for the # of columns specified for each variable. When the record lengths vary, the pointer may continue to the next record in order to read the specified # of columns for last variable (usually) in the INPUT statement. An error will occur when this situation happens.
Formatted Input when Reading Records with Variable Record Lengths Using the PAD option in the INFILE Statement One way to fix the problem is to add the blank spaces to the existing records that are short of the record length to change the record length to be ‘FIXED’. The other way is to inform SAS to ‘PAD’ the blanks to those records which are too short. Suppose the record length for the Salesdata.dat is not fixed. Example: Data work.sale; INFILE ‘C:\math707\RawData\RawData_dat \ Salesdata.dat’ pad; input last_name sale_date residential commercial 9.2; run;
Formatted PUT to created External Data Set Similar to formatted Input to read external raw data set, one can create external data set using formatted PUT statement. FILENAME fileref ‘file-location’; FILE fileref; PUT var format ……… ; RUN;
Example: Create External Data Set using Formatted PUT To create an external data for the salesdata that consists of only MARCH. Data work.sale; INFILE ‘C:\math707\RawData\RawData_dat\ Salesdata.dat’; input last_name sale_date residential commercial 9.2; Run; Data marchsale; set work.sale; FILE ‘C:\math707\RawData\RawData_dat\ Sales_March.dat’; IF MONTH(Sale_date) = 3; PUT l_name sale_date residential commercial 10.2; run;
Exercise The following is a finance data. Variables are SSN, Name, Salary, Nyear, Birthday Rudelich 55, Vincent 65, Benito 78, Sirignano $5, Harbinger 73, Phillipon $49, Gunter 57, Write a SAS program to read this data using formatted format. Practice using PAD option in the Infile statement and make sure you see and understand the difference between with PAD and without PAD
An answer /* program (b) to read variable length records - This program has error. Carefully check the errors */ data financeb; infile 'C:\math707\RawData\RawData_dat\finance3_recordlength.dat' ; input SSN $ 1-11 Name $ salary comma Nyear birthdate 5.; proc print; format birthdate date9.; title 'Errors in reading variable-length records'; run; /* Program c: use PAD option in the INFLIE statement */ data financec; infile 'C:\math707\RawData\RawData_dat\finance3_recordlength.dat' pad; input SSN $ 1-11 Name $ salary comma Nyear birthdate 5.; proc print; format birthdate date9.; title 'Use PAD option to read variable-length records'; run;