Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access.

Slides:



Advertisements
Similar presentations
Creating the Date Dimension
Advertisements

The SAS ® System Additional Information on Statistical Analysis Programming.
The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Chapter 9: Introducing Macro Variables 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
Statistics in Science  Introducing SAS ® software Acknowlegements to David Williams Caroline Brophy.
Databases Lab 5 Further Select Statements. Functions in SQL There are many types of functions provided. The ones that are used most are: –Date and Time.
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
1 Computer Applications in Epidemiology Dongmei Li Lecture 26 5/6/2009.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Into to SAS ®. 2 List the components of a SAS program. Open an existing SAS program and run it. Objectives.
Creating SAS® Data Sets
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
Advanced File Processing
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
Chapter 10:Processing Macro Variables at Execution Time 1 STAT 541 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
Creating Context definitions for your own data.  RUN CONTEXT/TOOLS ◦ Application ◦ Application Data File ◦ Master File Record ◦ Context Definition.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
CNG 140 C Programming (Lecture set 9) Spring Chapter 9 Character Strings.
©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina Chapter 17 supplement: Review of Formatting Data STAT 541.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Knowing Understanding the Basics Writing your own code SAS Lab.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
Bringing Data into SAS From Menu: –File –Import Data –Spreadsheet example first Pick file by browsing Select Library and Member (we will talk about this.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
I OWA S TATE U NIVERSITY Department of Animal Science Getting Your Data Into SAS (Chapter 2 in the Little SAS Book) Animal Science 500 Lecture No. 3 September.
With Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Office 2007 Intermediate.
Lesson 2 Topic - Reading in data Chapter 2 (Little SAS Book)
SQL Chapter Two. Overview Basic Structure Verifying Statements Specifying Columns Specifying Rows.
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
Clearly Visual Basic: Programming with Visual Basic 2008 Chapter 24 The String Section.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
Here’s another problem (see section 2.13 on page 54). A file contains two different types of records (say A’s and B’s) and we only want to read in the.
SAS Basics. Windows Program Editor Write/edit all your statements here. Log Watch this for any errors in program as it runs. Output Will automatically.
1 Statistical Software Programming. STAT 6360 –Statistical Software Programming Data Input in SAS Many ways to get your data into SAS: –Through data entry.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
Summer SAS Workshop Lecture 3. Summer SAS Workshop Website
TASS Meeting Setting GuessingRows when Importing Excel Files September 19th, 2008 Setting GuessingRows when importing Excel Files Dr. Arthur Tabachneck,
An Introduction Katherine Nicholas & Liqiong Fan.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
Chapter 2 Getting Data into SAS Directly enter data into SAS data sets –use the ViewTable window. You can define columns (variables) with the Column Attributes.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
LISA SHORT COURSE SERIES: INTRODUCTION TO SAS UNIVERSITY William DeShong Fall 2015.
CC07 PhUSE 2011 Seven Sharp tips for Clinical Programmers David Garbutt Rohit Banga BIOP AG.
Chapter 23 The String Section (String Manipulation) Clearly Visual Basic: Programming with Visual Basic nd Edition.
Lesson 2 Topic - Reading in data Programs 1 and 2 in course notes –Chapter 2 (Little SAS Book)
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
Online Programming| Online Training| Real Time Projects | Certifications |Online Classes| Corporate Training |Jobs| CONTACT US: STANSYS SOFTWARE SOLUTIONS.
Copyright 2009 The Little Engine That Could: Using EXCEL LIBNAME Engine Options to Enhance Data Transfers between SAS® and Microsoft® Excel Files William.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Hints and Tips SAUSAG Q SORTING – NOUNIQUEKEY The NOUNIQUEKEY option on PROC SORT is a useful way in 9.3 to easily retain only those records with.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
Some other query issues:
Miscellaneous Excel Combining Excel and Access.
System Programming and administration
Logan-Hocking Schools
Introduction to SAS®.
Chapter 2: Getting Data into SAS
Chapter 3: Working With Your Data
SAS in Data Cleaning.
Parsing: Splitting fields into atomic attributes.
Presentation transcript:

Data Transformation Data cleaning

Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data

Importing an Excel Spreadsheet PROC IMPORT OUT= WORK.Fall2007 DATAFILE= "L:\DataWarehousing07f\CourseDatabase\Fall2007.xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;

Import an Access Table PROC IMPORT OUT= WORK.OrderLine DATATABLE= "OrderLin" DBMS=ACCESS REPLACE; DATABASE="I:\DataWarehousing07f\WholesaleProduct s.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN;

 Good Practice Check the metadata for a dataset PROC CONTENTS DATA= OrderLine; RUN; Print a few records PROC PRINT DATA= OrderLine (OBS= 10); RUN;

Saving SAS Datasets LIBNAME course "L:\DataWarehousing07f\CourseDatabase"; Data course.Spring2008; set spring2008; run; Note: the name associated with the libname command (“course”) must be 8 characters or less.

LIBNAME / INFILE / INPUT for character data LIBNAME identifies the location or folder where the data file is stored INFILE specifies the libname to use for reading external data. INPUT reads text format data SET reads SAS data

INFILE with INPUT for character data files DATA Fitness; INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt"; INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS; run;

Creating Derived Attributes Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code. LENGTH statements FORMAT statements FORMATS and INFORMATS PUT INPUT

PUT and INPUT Functions TextOutput = PUT(variable, format) Note: the result of a put function is always character Note: there is also a PUT statement that writes the contents of a variable to the SAS log Output = INPUT(CharacterInput, informat) Note: the variable for an input function is always character

Formats Formats always contain a period Formats for character variables always start with a $ The most used format categories are Character, Date and Time, and Numeric Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”

 Good Practice The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library Data _Null_; InputVal= 123; OutputVal= PUT(InputVal, Roman30.); PUT InputVal OutputVal; run;

Generating Dates Generating a Date dimension Usually done offline in something like Excel SAS has extensive date and datetime functions and formats SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.

Creating a text variable for Date Data Orders2; Length Date $10.; Set Orders; Date= PUT( Datepart(OrderDate), MDDYY8.); The Length statement assures that the variable will have enough space. It must come before the SET. OrderDate has DateTime format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.

SAS Functions We are especially interested in “Character” and “Date and Time” functions Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”

Useful Data Cleaning Functions Text Manipulation: –COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE Text Extraction –INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD

Parsing The process of splitting a text field into multiple fields Uses SAS functions to extract parts of a character string. –Fixed position in a string: SUBSTR –Known delimiter: SCAN Note: it is a good idea to strip blanks before you try to parse a string.

Example of Parsing Data Customer2; LENGTH street cust_addr $20.; FORMAT street cust_addr $20.; SET Customer; Cust_Addr= TRIM(Cust_Addr); Number= Scan(Cust_Addr,1,' '); Street= Scan(Cust_Addr,2,' '); run; Note: The LENGTH and FORMAT statements clear trailing blanks for further display.

Parsing Results Obs cust_addr Number street OAK 481 OAK PETE 215 PETE 3 48 COLLEGE 48 COLLEGE CHERRY 914 CHERRY WATSON 519 WATSON 6 16 ELM 16 ELM PINE 108 PINE

 Good Practice Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.

=SUBSTR( string, position ) Use this when you have a known position for characters. String: character expression Position: start position (starts with 1) Length: number of characters to take (missing takes all to the end) VAR= ‘ABCDEFG’ NEWVAR= SUBSTR(VAR,2,2) NEWVAR2= SUBSTR(VAR,4) NEWVAR= ‘BC’ NEWVAR2= ‘DEFG’

SUBSTR(variable, position ) = new-characters Replaces character value contents. Use this when you know where the replacement starts. a='KIDNAP'; substr(a,1,3)='CAT'; a: CATNAP substr(a,4)='TY' ; a: KIDTY

INDEX(source, excerpt) Searches a character expression for a string of characters. Returns the location (number) where the string begins. a='ABC.DEF (X=Y)'; b='X=Y'; x=index(a,b); x: 10 x= index(a,’DEF’); x: 5

Alternative INDEX functions INDEXC searches for a single character INDEXW searches for a word: Syntax INDEXW(source, excerpt )

Length Returns the length of a character variable The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0. The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.

Standardizing Adjusting terms to standard format. Based off of frequency prints. Use functions or IF statements –TRANWRD is easy but can produce unexpected results –IF statements are safer, but less general

Standardization Code Supplier= Tranwrd(supplier, " Incorporated", ""); If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things"; More complex logic is often needed. See the course examples.

 Good Practice It is a good idea to produce a change log for standardized changes: Data Products2 Changed; Set Products; SupplierOld= Supplier; * * Output Products2; If Trim(supplier) ^= Trim(SupplierOld) then output Changed; Proc Print Data= Changed; Var SupplierOld Supplier;

Locating Anomalies Frequency counts are a good way to identify anomalies. It is also helpful to identify standard changes that you do not have to review. Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)

De Duplicating Reconcile different representations of the same entity Done after standardizing. Usually requires multi-field testing. May use probabilistic logic, depending on the application. Should produce a change log.

Correcting Identifying and correcting values that are wrong Very difficult to do. Usually based off of exception reports or range checks.