Data Transformation Data cleaning
Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data
Importing an Excel Spreadsheet PROC IMPORT OUT= WORK.Fall2007 DATAFILE= "L:\DataWarehousing07f\CourseDatabase\Fall2007.xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;
Import an Access Table PROC IMPORT OUT= WORK.OrderLine DATATABLE= "OrderLin" DBMS=ACCESS REPLACE; DATABASE="I:\DataWarehousing07f\WholesaleProduct s.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN;
Good Practice Check the metadata for a dataset PROC CONTENTS DATA= OrderLine; RUN; Print a few records PROC PRINT DATA= OrderLine (OBS= 10); RUN;
Saving SAS Datasets LIBNAME course "L:\DataWarehousing07f\CourseDatabase"; Data course.Spring2008; set spring2008; run; Note: the name associated with the libname command (“course”) must be 8 characters or less.
LIBNAME / INFILE / INPUT for character data LIBNAME identifies the location or folder where the data file is stored INFILE specifies the libname to use for reading external data. INPUT reads text format data SET reads SAS data
INFILE with INPUT for character data files DATA Fitness; INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt"; INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS; run;
Creating Derived Attributes Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code. LENGTH statements FORMAT statements FORMATS and INFORMATS PUT INPUT
PUT and INPUT Functions TextOutput = PUT(variable, format) Note: the result of a put function is always character Note: there is also a PUT statement that writes the contents of a variable to the SAS log Output = INPUT(CharacterInput, informat) Note: the variable for an input function is always character
Formats Formats always contain a period Formats for character variables always start with a $ The most used format categories are Character, Date and Time, and Numeric Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”
Good Practice The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library Data _Null_; InputVal= 123; OutputVal= PUT(InputVal, Roman30.); PUT InputVal OutputVal; run;
Generating Dates Generating a Date dimension Usually done offline in something like Excel SAS has extensive date and datetime functions and formats SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.
Creating a text variable for Date Data Orders2; Length Date $10.; Set Orders; Date= PUT( Datepart(OrderDate), MDDYY8.); The Length statement assures that the variable will have enough space. It must come before the SET. OrderDate has DateTime format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.
SAS Functions We are especially interested in “Character” and “Date and Time” functions Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”
Useful Data Cleaning Functions Text Manipulation: –COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE Text Extraction –INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD
Parsing The process of splitting a text field into multiple fields Uses SAS functions to extract parts of a character string. –Fixed position in a string: SUBSTR –Known delimiter: SCAN Note: it is a good idea to strip blanks before you try to parse a string.
Example of Parsing Data Customer2; LENGTH street cust_addr $20.; FORMAT street cust_addr $20.; SET Customer; Cust_Addr= TRIM(Cust_Addr); Number= Scan(Cust_Addr,1,' '); Street= Scan(Cust_Addr,2,' '); run; Note: The LENGTH and FORMAT statements clear trailing blanks for further display.
Parsing Results Obs cust_addr Number street OAK 481 OAK PETE 215 PETE 3 48 COLLEGE 48 COLLEGE CHERRY 914 CHERRY WATSON 519 WATSON 6 16 ELM 16 ELM PINE 108 PINE
Good Practice Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.
=SUBSTR( string, position ) Use this when you have a known position for characters. String: character expression Position: start position (starts with 1) Length: number of characters to take (missing takes all to the end) VAR= ‘ABCDEFG’ NEWVAR= SUBSTR(VAR,2,2) NEWVAR2= SUBSTR(VAR,4) NEWVAR= ‘BC’ NEWVAR2= ‘DEFG’
SUBSTR(variable, position ) = new-characters Replaces character value contents. Use this when you know where the replacement starts. a='KIDNAP'; substr(a,1,3)='CAT'; a: CATNAP substr(a,4)='TY' ; a: KIDTY
INDEX(source, excerpt) Searches a character expression for a string of characters. Returns the location (number) where the string begins. a='ABC.DEF (X=Y)'; b='X=Y'; x=index(a,b); x: 10 x= index(a,’DEF’); x: 5
Alternative INDEX functions INDEXC searches for a single character INDEXW searches for a word: Syntax INDEXW(source, excerpt )
Length Returns the length of a character variable The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0. The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.
Standardizing Adjusting terms to standard format. Based off of frequency prints. Use functions or IF statements –TRANWRD is easy but can produce unexpected results –IF statements are safer, but less general
Standardization Code Supplier= Tranwrd(supplier, " Incorporated", ""); If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things"; More complex logic is often needed. See the course examples.
Good Practice It is a good idea to produce a change log for standardized changes: Data Products2 Changed; Set Products; SupplierOld= Supplier; * * Output Products2; If Trim(supplier) ^= Trim(SupplierOld) then output Changed; Proc Print Data= Changed; Var SupplierOld Supplier;
Locating Anomalies Frequency counts are a good way to identify anomalies. It is also helpful to identify standard changes that you do not have to review. Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)
De Duplicating Reconcile different representations of the same entity Done after standardizing. Usually requires multi-field testing. May use probabilistic logic, depending on the application. Should produce a change log.
Correcting Identifying and correcting values that are wrong Very difficult to do. Usually based off of exception reports or range checks.