Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 18 Reading Free-Format Data. 2 Objectives Read free-format data not recognized in fixed fields. Read free-format data separated by non-blank delimiters,

Similar presentations


Presentation on theme: "Chapter 18 Reading Free-Format Data. 2 Objectives Read free-format data not recognized in fixed fields. Read free-format data separated by non-blank delimiters,"— Presentation transcript:

1 Chapter 18 Reading Free-Format Data

2 2 Objectives Read free-format data not recognized in fixed fields. Read free-format data separated by non-blank delimiters, such as commas. Read a raw data file with missing data (at the end middle or beginning of a record). Read character values exceeding 8 characters. Read nonstandard free-format data. Read character values containing embedded blanks.

3 What is FREE-FORMAT data The data values not arranged in fixed fields. Data values separated by blanks or some specific delimiters. Numeric data values that are not in standard format. Issues that need special attention when reading free-format data: How to handle missing data in free-format data set? The danger of incorrect variable length. How to handle data values with quotation marks? Informats used in Formatted Input are not the same when reading free-format data values.

4 4 List Input with the Default Delimiter (Blank is the Default Delimiter) The data is not in fixed columns. The fields are separated by spaces. There is one nonstandard field. 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520

5 LIST INPUT and its variations To read a free-format data, the simplest INPUT is by using LIST INPUT. The general Syntax: INPUT variable ; Variable is the variable name to be read. $ specifies character variable. NOTE: The list input style signals to the SAS System that fields are separated by delimiters. SAS then reads from non-delimiter to delimiter instead of from a specific location on the raw data record.

6 IMPORTANT CONDITIONS for LIST Input: All fields must be separated by at least one blank. Fields must be read sequentially from left to right Can not skip or re-read fields. Missing data for character variable must be specified using user-defined missing (can not use blank as missing, since Blank is the delimiter. Missing data for numeric must be specified using ‘. ‘ Or other user-defined missing (can not use blank for numeric missing).

7 7 Delimiters tab characters A space (blank) is the default delimiter. blankscommas Common delimiters are

8 8 Input Data involving Date, Time The second field is a date. How does SAS store dates? 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520

9 9 Standard Data The term standard data refers to character and numeric data that SAS recognizes automatically. Some examples of standard numeric data include – 35469.93 – 3E5 (exponential notation) – -46859. Standard character data is any character you can type on your keyboard. Standard character values are always left-justified by SAS.

10 10 Nonstandard Data The term nonstandard data refers to character and numeric data that SAS does not recognize automatically. Examples of nonstandard numeric data include – 12/12/2012 – 29FEB2000 – 4,242 – $89,000.

11 11 Informats To read in nonstandard data, you must apply an informat. General form of an informat: Informats are instructions that specify how SAS reads raw data. INFORMAT-NAME.

12 12 Informats Examples of informats are COMMAw. reads numeric data ($4,242) and strips out selected nonnumeric characters, such as dollar signs and commas, dashes, blanks. MMDDYYw.reads dates in the form 12/31/2012. DATEw.reads dates in the form 29Feb2000.

13 Reading Free-Format data with Delimiters By default, free-format data values are separated by BLANKS. SAS reads a data value until it reaches the next blank. Blank is not the only delimiter to separate data values. SAS allows user-specified delimiters, as long as it is not part of the data values. For example, one can use /, % ; and so on as delimiter to create the external free-format data set. The option DLM = ‘ ‘ in the INFILE statement is needed to inform the SAS INPUT statement the delimiters used. Ex: INFILE ‘path-to-the-file’ DLM = ‘,’ ; informs the INPUT statement to read data value until comma (, ) is reached.

14 Example LA50001,4feb1989,132, 530 PHIL50002, 11nov1989, 152,540 NEWYORK50003,22oct1991, 90, 530 CHICAGO50004, 4feb1993,172,550 DETROIT50005,24jun1993, 170,510 DALLAS50006, 20dec1994, 180, 520 The following is an airplane data set consisting of ID, date_inservice, # of passenger capacity and # of cargo capacity The data values are separated by comma and space. How does SAS read this data set?

15 15 Reading a Delimited Raw Data File data airplanes; infile 'raw-data-file‘ DLM = ‘, ’; input ID $ InService date9. PassCap CargoCap; run;

16 Exercise Write a SAS program to read the following data. Variables are: Location, date # of passengers # of cargos for the flight LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152,540 NEWYORK50003,22oct1991, 90, 530 CHICAGO50004,4feb1993, 172,550 DETROIT50005,24jun1993, 170,510 DALLAS50006,20dec1994, 180, 520 Print the data. Save the program as c18_freeform1 to the SASEx folder in your c-drive. Observe the results. You should notice that some data values for Location are not complete. What is the cause of incomplete data values? How to solve this problem?

17 data airplane; infile datalines dlm=', ' ; input Loc $ date date9. npas ncargo; datalines; LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152,540 NEWYORK50003,22oct1991, 90, 530 CHICAGO50004,4feb1993, 172,550 DETROIT50005,24jun1993, 170,510 DALLAS50006,20dec1994, 180, 520 ; run; proc print; format date date9. ; run; Answer

18 Results Obs Loc date npas ncargo 1 LA50001 04FEB1989 132 530 2 PHIL5000 11NOV1989 152 540 3 NEWYORK5 22OCT1991 90 530 4 CHICAGO5 04FEB1993 172 550 5 DETROIT5 24JUN1993 170 510 6 DALLAS50 20DEC1994 180 520 What is wrong with this result? NOTE: The some of the LOC’s are not complete. NOTE: It is 8 characters. But, some of the ID’s are more than 8.

19 19 Lengths of Variables read using free-format When you use list input, the default length for character and numeric variables is 8 bytes. You can set the length of character variables with a LENGTH statement or with an informat. General form of a LENGTH statement: LENGTH variable-name length-specification...;

20 20 Setting the Length of a Variable data airplanes; length ID $ 15.; infile 'raw-data-file‘ DLM = ‘, ‘; input LOC $ InService date9. PassCap CargoCap; run;

21 Exercise Open the program c18_freeform1, revise the program to make the data values for Location are complete.

22 Answer data airplane; Length Loc $ 15.; infile datalines dlm=', ' ; input Loc $ date date9. npas ncargo; datalines; LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152,540 NEWYORK50003,22oct1991, 90, 530 CHICAGO50004,4feb1993, 172,550 DETROIT50005,24jun1993, 170,510 DALLAS50006,20dec1994, 180, 520 ; run; proc print; format date date9. ; run;

23 Correct Results Obs LOC date npas ncargo 1 LA50001 04FEB1989 132 530 2 PHIL50002 11NOV1989 152 540 3 NEWYORK50003 22OCT1991 90 530 4 CHICAGO50004 04FEB1993 172 550 5 DETROIT50005 24JUN1993 170 510 6 DALLAS50006 20DEC1994 180 520

24 24 ID $ 5 data airplanes; length ID $ 5.’; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File Compile PDV Input Buffer...

25 25 ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8 data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File Compile PDV Input Buffer...

26 26 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File Execute PDV data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8....

27 27 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File PDV data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; Input Buffer 5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0 ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8....

28 28 ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File PDV 50001 10627 132 530 data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; Input Buffer... 5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0

29 29 Write out observation to airplanes. data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. PDV 50001 10627 132 530 Input Buffer Implicit output... 5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0

30 30 data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap; run; 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Raw Data File ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. PDV 50001 10627 132 530 Input Buffer Implicit return... 5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0

31 31 Using the DLM= Option in the INFILE statement The DLM= option sets a character or characters that SAS recognizes as a delimiter in the raw data file. General form of the INFILE statement with the DLM= option: Any character you can type on your keyboard can be a delimiter. You can also use hexadecimal characters. INFILE 'raw-data-file' DLM='delimiter(s)';

32 Reading Missing Values There are two situations may occur when reading a free-format data involving missing data: Missing values at the END of a record Missing values at the BEGINNING or MIDDLE of a record

33 33 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Missing Data at the End of a Record

34 34 Missing Data at the End of a Row By default, when there is missing data at the end of a row, SAS will continue to read the missing data value from the next record: 1.SAS loads the next record to finish the observation. 2.A note is written to the log 3.SAS loads a new record at the top of the DATA step and continues processing.

35 35 data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File Execute PDV Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8....

36 36 data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File PDV Input Buffer 5 0 0 0 1, 4 f e b 1 9 8 9, 1 3 2 ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8....

37 37 Raw Data File 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; Input Buffer 5 0 0 0 1, 4 f e b 1 9 8 9, 1 3 2 ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. PDV 50001 10627 132... No data

38 38 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; Input Buffer 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 5 2, 5 4 0 ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. PDV... SAS loads next record. 500025000110627132

39 39 Write out observation to airplanes. 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. PDV 50002 50001 10627 132 Implicit output 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 5 2, 5 4 0

40 40 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. PDV... 50002 50001 10627 132 Implicit return 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 5 2, 5 4 0

41 41 Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File PDV 5 0 0 0 3, 2 2 o c t 1 9 9 1, 9 0, 5 3 0...

42 42 Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8. data airplanes3; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService : date9. PassCap CargoCap; run; 50001, 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Raw Data File PDV 5 0 0 0 3, 2 2 o c t 1 9 9 1, 9 0, 5 3 0... Continue processing until end of the raw data file.

43 43 NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.AIRPLANES3 has 4 observations and 4 variables. Partial Log

44 44 proc print data=airplanes3 noobs; run; In Pass Cargo ID Service Cap Cap 50001 10627 132 50002 50003 11617 90 530 50004 12088 172 50005 50006 12772 180 520 Missing Data at the End of the Row PROC PRINT Output

45 45 Use the MISSOVER Option in INFILE statement to handle missing at the end of a record The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached. General form of the INFILE statement with the MISSOVER option: If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing. INFILE 'raw-data-file' MISSOVER;

46 46 Using the MISSOVER Option data airplanes; length ID $ 5; infile 'raw-data-file' dlm=',' missover; input ID $ InService : date9. PassCap CargoCap; run;

47 47 Partial SAS Log NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26. NOTE: The data set WORK.AIRPLANES3 has 6 observations and 4 variables. Using the MISSOVER Option

48 48 proc print data=airplanes noobs; run; In Pass Cargo ID Service Cap Cap 50001 10627 132. 50002 10907 152 540 50003 11617 90 530 50004 12088 172. 50005 12228 170 510 50006 12772 180 520 Using the MISSOVER Option PROC PRINT Output

49 Missing Values at the beginning or Middle of a record There are situations where missing values occur in the beginning of a record or middle of a record. Since multiple delimiters, such as,, is treated as a delimiter, simply using DLM = ‘,’ will not able to take care of these situations here.

50 50 Missing Values without Placeholders There is missing data represented by two consecutive delimiters. 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520

51 51 5 0 0 0 1, 4feb1989,., 5 3 0 Missing Values without Placeholders By default, SAS treats two consecutive delimiters as one. Missing data should be represented by a placeholder by filling the missing value with proper missing value such as a period (.) for numeric missing. However, it is not possible to use blank as missing for character values, using a placeholder for character variable means to define a string as missing and then, writing a SAS program to convert the string into missing data. Alternatively, one can use an option DSD in the INFILE statement to handle these missing cases.

52 52 Missing Values without Placeholders data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap; run;

53 53 data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File Execute PDV Input Buffer ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8....

54 54 data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File PDV Input Buffer 5 0 0 0 1, 4 f e b 1 9 8 9,, 5 3 0 ID $ 5 PASSCAP N 8. CARGOCAP N 8. INSERVICE N 8....

55 55... Raw Data File 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; Input Buffer 5000110627530... No data PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8 5 0 0 0 1, 4 f e b 1 9 8 9,, 5 3 0

56 56 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; Input Buffer... 5 0 0 0 1, 4 f e b 1 9 8 9,, 5 3 0 PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8... 5000110627530

57 57 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; Input Buffer 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 3 2, 5 4 0... SAS loads next record.... 5000110627530 PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8

58 58... 5000110627530 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; Input Buffer 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 3 2, 5 4 0... 50002 PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8

59 59 Write out observation to airplanes4. 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; Input Buffer... 50001 10627 132 530 Implicit output... 50002 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 3 2, 5 4 0 PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8

60 60 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; Input Buffer... 50001 10627 132 530 Implicit return... 50002 5 0 0 0 2, 1 1 n o v 1 9 8 9, 1 3 2, 5 4 0 PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8

61 61 data airplanes4; length ID $ 5; infile ' raw-data-file ' dlm= ', ' ; input ID $ InService date9. PassCap CargoCap; run; 50001, 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520 Raw Data File Input Buffer 5 0 0 0 3, 2 2 o c t 1 9 9 1, 9 0, 5 3 0...... PDV ID $ 5 PASSCAP N 8 CARGOCAP N 8 INSERVICE N 8

62 62 NOTE: 6 records were read from the infile 'aircraft4.dat'. The minimum record length was 21. The maximum record length was 26. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.AIRPLANES4 has 4 observations and 4 variables. Missing Values without Placeholders Partial Log The missing is not correctly read.

63 63 proc print data=airplanes4 noobs; run; In Pass Cargo ID Service Cap Cap 50001 10627 530 50002 50003 11617 90 530 50004 12088 172 550 50005 12228 510 50006 Missing Values without Placeholders PROC PRINT Output This is not correct. Not only missing values are not correctly read, more errors have occurred.

64 64 5 0 0 0 1, 4feb1989,, 5 3 0 Missing Values without Placeholders If your data does not have placeholders, use the DSD option.

65 65 The DSD Option General form of the DSD option in the INFILE statement: INFILE ‘file-name’ DSD;

66 66 The DSD Option The DSD option – sets the default delimiter to a comma – treats consecutive delimiters as missing values – enables SAS to read values with embedded delimiters if the value is surrounded by double quotes.

67 67 Using the DSD Option data airplanes4; length ID $ 5; infile 'raw-data-file' dsd; input ID $ InService date9. PassCap CargoCap; run;

68 68 NOTE: 6 records were read from the infile 'aircraft4.dat'. The minimum record length was 22. The maximum record length was 25. NOTE: The data set WORK.AIRPLANES4 has 6 observations and 4 variables. Missing Values Without Placeholders Partial Log

69 69 proc print data=airplanes4 noobs; run; In Pass Cargo ID Service Cap Cap 50001 10627. 530 50002 10907 132 540 50003 11617 90 530 50004 12088 172 550 50005 12228. 510 50006 12772 180 520 Using the DSD Option PROC PRINT Output

70 Exercise Open the program c18_freeformat_missing Run the program, and observe the problem. Revise the program so that the missing data are properly handled.

71 Answer data carsales; infile datalines dlm = ‘,’ missover DSD; input year country $ type $ sales; datalines; 1998,US,CARS, 194324.12 1998,US,TRUCKS,142290.30 1998, CANADA,CARS,10483.44 1998, CANADA,TRUCKS, 1998,JAPAN,CARS,15066.43 1998,JAPAN, TRUCKS,40700.34 1997,,CARS, 213504.05 1997,US,TRUCKS,116735.65 1997,CANADA,CARS,904.89 1997,CANADA,TRUCKS,76576.12 1997,JAPAN,CARS,10000.18 1997,JAPAN,TRUCKS,50458.22; proc print data = carsales; run;

72 Exercise Open c18_freeformat2 Run the program, observe the results, and revise the program to read the data correctly.

73 Answer data carsales2; length type $ 14. ; infile datalines dlm = '/' missover dsd; input year (country type) ($) sales comma10.; datalines; 1998/US/CARS/$194324.12 1998/US/TRUCKS_GM/ $142290.30 1998/CANADA/CARS/$10483.44 1998/CANADA/TRUCKS_FORD/ 1998/JAPAN/CARS/$15066.43 1998/JAPAN/'TRUCKS_HUNDA'/$40700.34 1997/US/CARS/$213504.05 1997//TRUCKS_FORD/ $116735.65 1997/CANADA/CARS/$904.89 1997/CANADA/TRUCKS_GM/$76576.12 /JAPAN/CARS/$10000.18 1997/JAPAN/TRUCKS_TOYOTA/$50458.22 ; proc print data = carsales2; title ' / as delimiter '; run; proc contents; run;

74 74 Specifying an Informat To specify an informat, use the colon (:) format modifier in the INPUT statement between the variable name and the informat. General form of a format modifier in an INPUT statement: NOTE: The informat used for free-format is not the same as the informat used in the Fixed Format input: Informat in Fixed Formatted Input is the format specifying the columns and how the data created in the raw data, so that the data values will be read based on the Informat. The Informat in free-format input is the format that the data values will be read to the new data set to be created. INPUT variable : informat;

75 Modifying List Input In reading free-format data, it is difficult to specify an informat that defines the # of columns to be read from the data set, since the # of columns is often not properly formatted. Also, nonstandard data values can not be properly read in these situations. SAS provides two modifiers to help defining the informat.

76 Modifiers used in LIST INPUT The ampersand (&) modifier is used to read character values that contain embedded blanks. The colon ( : ) modifier is used to read nonstandard data values and character values that are longer than 8 characters, but which contain no embedded blanks.

77 Use the Modifier (&) in LIST INPUT & enables to read characters contain single embedded blanks, such as NEW YORK as a character value, and there is an embedded blank. Using DLM = ‘ ‘ will read NEW YORK as two character values: NEW and YORK. But, we want to read it as NEW YORK as one data value. Use & allows to read this as one data value. However, in order to stop reading further into the next data value as part of NEW YORK, it requires TWO or MORE blanks following NEW YORK. & helps to read data values with one embedded blanks until it reaches TWO or more blanks.

78 Example of applying Modifier & Data set (City, Population) NEW YORK 7,262,700 LOS ANGLES 3,259,340 CHICAGO 3,009,530 HOUSTON 1,728910 To read this data set, Data city_pop; Input city $ & population comma10.; Datalines; NEW YORK 7,262,700 LOS ANGLES 3,259,340 CHICAGO 3,009,530 HOUSTON 1,728,910 ; run; proc print; run;

79 The results from previous program using & modifier The SAS System 13:23 Monday, November 15, 2010 28 Obs city population 1 NEW YORK 7262700 2 LOS ANGL 3259340 3 CHICAGO 3009530 4 HOUSTON 1728910 NOTE: The data value LOS ANGELOS is not read correctly. It has the default length of 8, not the correct length of 10 in this case. To handle this problem, we introduce the use of LENGTH statement previously: LENGTH city $ 10; SAS has another way to do this by using modifier & with an informat together.

80 Using the & Modifier with an Informat Data city_pop; Input city & $10. population comma10.; Datalines; NEW YORK 7,262,700 LOS ANGLES 3,259,340 CHICAGO 3,009,530 HOUSTON 1,728,910 ; run; proc print; run; NOTE: Once use $10. in the list input, one does not need to define the LENGTH statement. Since it defines the length for storing the CITY.

81 Some cautions of using & NOTE: $10. does not specify the # of columns to be read for city variable. It specifies the length to store the data value city when it is used with &. You MUST use two consecutive blanks as delimiters when use the & modifier. You can not use any other delimiter to indicate the end of each record.

82 Exercise Open Program c18_freeformat_modifier Run each program to learn how modifiers work, review the options of using MISSOVER, DSD, Review the LENGTH statement,

83 Reading Nonstandard Values in LIST INPUT Nonstandard values, such as datew., timew. Datetimew., commaw.d, and so on require the user to specify the width, w. When this is used as Informat, w defines the # of columns to be read from the data. However, in a LIST INPUT, which is free-format, it is often very difficult to have the nonstandard values are properly defined in the correct # of columns. SAS introduces a LIST INPUT Modifier, Colon (:) to allows for reading the nonstandard values from delimiter to the next delimiter.

84 84 LIST INPUT Without the Colon The colon signals that SAS should read from delimiter to delimiter. If the colon is omitted, SAS reads the length of the informat, which may cause it to read past the end of the field. – No error message is printed. – You might see invalid data messages or unexpected data values.

85 Use COLON (:) as Modifier in LSIT INPUT Colon (:) modifier enables user to read nonstandard data values and Read character values that are longer than 8 characters with no embedded blanks. It reads values until a blank (or a delimiter) is reached. If the informat $w. is specified, this length overrides the default length.

86 Example of using Colon (:) modifier Data city_pop; Input city & $10. population : comma.; Datalines; NEW YORK 7,262,700 LOS ANGLES 3,259,340 CHICAGO 3,009,530 HOUSTON 1,728,910 ; run; proc print; run; NOTE: the informat COMMA. Does not specify the w value. List Input reads data value until the next delimiter is reached. The default length of numeric is 8 for storing the numeric value. There is no need to specify the length of a numeric variable.

87 NOTE: The informat COMMA. does not specify the w value. List Input reads data value until the next delimiter is reached. The default length of numeric is 8 for storing the numeric value. There is no need to specify the length of a numeric variable. NOTE: If we DO NOT use Colon (: ), then, we must specify COMMAw.d in order to read the correct # of columns in then data. In this situation, w. is the # of columns read from the data set.

88 88 INFILE Statement Options These options can be used separately or together in the INFILE statement.

89 Creating Free-Format External Data Similar to reading free-format external data, we can also create free-format external data by using: FILE ‘path-to-external-data-set’ ; PUT variable ; Format specifies the format to write the data values. This is particular useful when creating data values in nonstandard format such as commaw.d, date9., mmddyy10. and so on.

90 An example to create city_pop.dat data Data city; Input city & $10. population : comma.; Datalines; NEW YORK 7,262,700 LOS ANGLES 3,259,340 CHICAGO 3,009,530 HOUSTON 1,728,910 ; run; proc print; run; Data citypop; set city; File ‘c:\math707\rawdata\city_pop.dat’ dlm = ‘/’; Put city population comma.; Run;

91 An example of creating external data using free format when delimiter is, and some numeric variables are also saved using COMMAw.d format Data citypop; set city; File ‘c:\math707\rawdata\city_pop.dat’ dsd; Put city population:comma10.; Run; NOTE: since both delimiter is, and population is stored with comma format, the data values needs to be treated in a way it is recognizable as a data value. Using DSD option in the FILE statement creates quotation marks for population. When reading this type of data, one must also use DSD option in the INFILE statement and one should also be careful about the LENGHTH.

92 The resulting data set NEW YORK," 7,262,700" LOS ANGLES," 3,259,340" CHICAGO," 3,009,530" HOUSTON," 1,728,910“ To read this data set, one needs to use DSD option in the INFILE statement. Data citypop2; length city $ 10; infile 'c:\math707\rawdata\city_pop3.dat' DSD ; input city $ population : comma10.; Run; proc print; run;

93 The resulting data Obs city population 1 NEW YORK 7262700 2 LOS ANGLES 3259340 3 CHICAGO 3009530 4 HOUSTON 1728910

94 Writing Character Strings and variable values in the external data set Data citypop; set city; File ‘c:\math707\rawdata\city_pop.dat’ dsd; Put ‘2000 City Census ‘ city ‘Total Population ‘ population : comma10.; Run; This program will create extra string to describe City and Population in the created data set.

95 Use PROC EXPORT procedure to create external data set General Syntax: PROC EXPORT DATA = ‘sas-data-set’ OUTFILE = filename’ DBMS=DLM REPLACE; DELIMITER = ‘delimiter’; PUTNAME = ; RUN; Using SAS pulldown menu, to export data set. File, Export Data, then follow the step-by-step menu to create external file.

96 Exercise Open program c18_put_freeformat_Export Run the programs, and observe the result to make sure you learn how to write PUT statement and PROC EXPORT.

97 Exercise Open program c18_Import to learn how to write PROC IMPORT procedure to read external data with free format

98 Mixing Input Styles We have introduced Column Input, Formatted Input, List Input All of these input styles can be mixed in one INPUT statement, depending on the situations.

99 Additional materials useful for reading delimited data The textbook introduces the following options can be used in the INFILE statement for handling different situations when reading delimited external data: MISSOVER, DSD, DLM = ‘delimiter’ The follow are three additional useful options to handle the end of a record: STOPOVER, TRUNCOVER, FLOWOVER

100 Example Consider the following data set, TESTNUM. ----+----1----+- 1 22 333 4444 55555 We will show the effect of using FLOWOVER, MISSOVER and TRUNCOVER options in the infile statement

101 The Value of TESTNUM Using Different INFILE Statement Options OBSFLOWOVERMISSOVER TRUNCOVER 122.1 24444.22 355555.333 4.4444 555555 data numbers; infile 'external-file'; input testnum 5.; run;

102 Explanation of these options FLOWOVER is the default behavior. It causes the DATA step to look in the next record if the end of the current record is encountered before all of the variables are assigned values MISSOVER causes the DATA step to assign missing values to any variables that do not have values when the end of a data record is encountered. The DATA step continues processing. STOPOVER causes the DATA step to stop execution immediately and write a note to the SAS log. TRUNCOVER causes the DATA step to assign values to variables, even if the values are shorter than expected by the INPUT statement, and to assign missing values to any variables that do not have values when the end of a record is encountered.


Download ppt "Chapter 18 Reading Free-Format Data. 2 Objectives Read free-format data not recognized in fixed fields. Read free-format data separated by non-blank delimiters,"

Similar presentations


Ads by Google