Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.

Slides:



Advertisements
Similar presentations
Creating Summary Data Sets Ron Cody, Ed.D. Robert Wood Johnson Medical School.
Advertisements

Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Examples from SAS Functions by Example Ron Cody
Outline Proc Report Tricks Kelley Weston. Outline Examples 1.Text that spans columnsText that spans columns 2.Patient-level detail in the titlesPatient-level.
Today: Run SAS programs on Saturn (UNIX tutorial) Runs SAS programs on the PC.
Patients.txt Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date.
Measures of Central Tendency MARE 250 Dr. Jason Turner.
Basic And Advanced SAS Programming
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
Week 3 Topic - Descriptive Procedures Program 3 in course notes Cody & Smith (Chapter 2)
Chapter 10:Processing Macro Variables at Execution Time 1 STAT 541 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS PROC REPORT PROC TABULATE
Lecture 5 Sorting, Printing, and Summarizing Your Data.
Chapter 9 Producing Descriptive Statistics PROC MEANS; Summarize descriptive statistics for continuous numeric variables. PROC FREQ; Summarize frequency.
1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
USING SAS PROCEDURES SAS System Options OPTIONS Statement
©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina Chapter 17 supplement: Review of Formatting Data STAT 541.
Prepared by: Luigi Muro – Consultant
Lesson 5 - Topics Formatting Output Working with Dates Reading: LSB:3:8-9; 4:1,5-7; 5:1-4.
SAS Macro: Some Tips for Debugging Stat St. Paul’s Hospital April 2, 2007.
SAS 介绍和举例 Presented by 经济实验教学中心 商务数据挖掘中心. Raw Data Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset) Analyze Data Using.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Lesson 2 Topic - Reading in data Chapter 2 (Little SAS Book)
Measures of Position. ● The standard deviation is a measure of dispersion that uses the same dimensions as the data (remember the empirical rule) ● The.
Michael Auld PhUSE Brighton PhUSE 2011 Brighton2 Skewed F-shape curve may reveal bias in the population May indicate power of trial isn’t strong.
SQL Chapter Two. Overview Basic Structure Verifying Statements Specifying Columns Specifying Rows.
Summer SAS Workshop Lecture 2. Summer Summer SAS Workshop Lecture 2 I’ve got Data…how do I get started? Libname Review How do you do arithmetic.
1 Filling in the blanks with PROC FREQ Bill Klein Ryerson University.
Analyses using SPSS version 19
Lesson 6 - Topics Reading SAS datasets Subsetting SAS datasets Merging SAS datasets.
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
Lesson 4 - Topics Creating new variables in the data step SAS Functions.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012.
YET ANOTHER TIPS, TRICKS, TRAPS, TECHNIQUES PRESENTATION: A Random Selection of What I Learned From 15+ Years of SAS Programming John Pirnat Kaiser Permanente.
1 Statistical Software Programming. STAT 6360 –Statistical Software Programming Sorting, Printing, Summarizing Data Now that we can input data and do.
Lesson 8 - Topics Creating SAS datasets from procedures Using ODS and data steps to make reports Using PROC RANK Programs in course notes LSB 4:11;5:3.
Lecture 4 Ways to get data into SAS Some practice programming
SAS for Data Management and Analysis
An Introduction Katherine Nicholas & Liqiong Fan.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Customize SAS Output Using ODS Joan Dong. The Output Delivery System (ODS) gives you greater flexibility in generating, storing, and reproducing SAS procedure.
Chapter 17 Supplement: Alternatives to IF-THEN/ELSE Processing STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South.
Lesson 2 Topic - Reading in data Programs 1 and 2 in course notes –Chapter 2 (Little SAS Book)
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
1 Checking Data with the PRINT and FREQ Procedures.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
Longitudinal Data Techniques: Looking Across Observations Ronald Cody, Ed.D., Robert Wood Johnson Medical School.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 16 & 17 By Tasha Chapman, Oregon Health Authority.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
Applied Business Forecasting and Regression Analysis
Lesson 8 - Topics Creating SAS datasets from procedures
Chapter 4: Sorting, Printing, Summarizing
3 Iterative Processing.
Producing Descriptive Statistics
5 Number Summaries.
Hans Baumgartner Penn State University
Introduction to SAS Essentials Mastering SAS for Data Analytics
Presentation transcript:

Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ

Sample Data Set Variable Name Description Type Valid Values PATNO Patient Number Character Numerals GENDER Gender Character ‘M' or 'F' VISIT Visit Date MMDDYY10. Any valid date HR Heart Rate Numeric 40 to 100 SBP Systolic Blood Pres. Numeric 80 to 200 DBP Diastolic Blood Pres. Numeric 60 to 120 DX Diagnosis Code Character 1 to 3 digits AE Adverse Event Character '0' or '1'

Using PROC FREQ to Look for Invalid Character Values PROC FREQ DATA=PATIENTS; TITLE "Frequency Counts"; TABLES GENDER DX AE / NOCUM NOPERCENT; RUN; The FREQ Procedure Gender GENDER Frequency F 12 M 13 X 1 f 2 Frequency Missing = 1

Using PROC PRINT with a WHERE Statement PROC PRINT DATA=PATIENTS; WHERE GENDER NOT IN ('F','M',' ') OR VERIFY(DX,' ') NE 0 OR AE NOT IN ('0','1',' '); TITLE "Listing of Invalid Data"; ID PATNO; VAR GENDER DX AE; RUN;

Using PROC PRINT with a WHERE Statement Listing of Invalid Data PATNO GENDER DX AE 002 F X X 3 1 XX5 M 1 A 010 f F X f M 1.3 0

Using a Data Step to Identify Invalid Character Values DATA _NULL_; INFILE "C:PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the output window; TITLE "Listing of Invalid Data"; PATNO GENDER DX AE $1.; ***Check GENDER; IF GENDER NOT IN ('F','M',' ') THEN PUT PATNO= GENDER=; ***Check DX; IF VERIFY(DX,' ') NE 0 THEN PUT PATNO= DX=; ***Check AE; IF AE NOT IN ('0','1',' ') THEN PUT PATNO= AE=; RUN; Listing of Invalid Data PATNO=002 DX=X PATNO=003 GENDER=X PATNO=XX5 AE=A PATNO=010 GENDER=f PATNO=013 GENDER=2 PATNO=002 DX=X PATNO=023 GENDER=f PATNO=987 DX=1.3

Using PROC MEANS to Look for Outliers PROC MEANS DATA=CLEAN.PATIENTS N NMISS MIN MAX MAXDEC=3; TITLE "Checking Numeric Variables"; VAR HR SBP DBP; RUN; Checking Numeric Variables Variable Label N Nmiss Minimum Maximum HR Heart Rate SBP Systolic Blood Pressure DBP Diastolic Blood Pressure

Using PROC UNIVARIATE with an ODS Select Statement ODS SELECT EXTREMEOBS; PROC UNIVARIATE DATA=CLEAN.PATIENTS; VAR HR SBP DBP; ID PATNO; RUN; The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Extreme Observations Lowest Highest Value PATNO Obs

Using the NEXTROBS Option with PROC UNIVARIATE ODS SELECT EXTREMEOBS; PROC UNIVARIATE DATA=CLEAN.PATIENTS NEXTROBS=3; VAR HR SBP DBP; ID PATNO; RUN; Variable: DBP (Diastolic Blood Pressure) Extreme Observations Lowest Highest Value PATNO Obs

Using a WHERE statement with PROC PRINT to list out-of-range data PROC PRINT DATA=CLEAN.PATIENTS; WHERE HR NOT BETWEEN 40 AND 100 AND HR IS NOT MISSING OR SBP NOT BETWEEN 80 AND 200 AND SBP IS NOT MISSING OR DBP NOT BETWEEN 60 AND 120 AND DBP IS NOT MISSING; TITLE "Out-of-range Values for Numeric Variables"; ID PATNO; VAR HR SBP DBP; RUN;

Using a WHERE statement with PROC PRINT to list out-of-range data Out-of-range Values for Numeric Variables PATNO HR SBP DBP

Using a DATA _NULL_ Data Step to list out-of-range data values DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***output to the output Window; TITLE "Listing of Patient Numbers and Invalid Data Values"; PATNO HR SBP DBP 3.; ***Check HR; IF (HR LT 40 AND HR NE.) OR HR GT 100 THEN PUT PATNO= HR=; ***Check SBP; IF (SBP LT 80 AND SBP NE.) OR SBP GT 200 THEN PUT PATNO= SBP=; ***Check DBP; IF (DBP LT 60 AND DBP NE.) OR DBP GT 120 THEN PUT PATNO= DBP=; RUN;

Using a DATA _NULL_ Data Step to list out-of-range data values Listing of Patient Numbers and Invalid Data Values PATNO=004 HR=101 PATNO=008 HR=210 PATNO=009 SBP=240 PATNO=009 DBP=180 PATNO=010 SBP=40 PATNO=011 SBP=300 PATNO=011 DBP=20 PATNO=014 HR=22 PATNO=017 HR=208 PATNO=321 HR=900 PATNO=321 SBP=400 PATNO=321 DBP=200 PATNO=020 HR=10 PATNO=020 SBP=20 PATNO=020 DBP=8 PATNO=023 HR=22 PATNO=023 SBP=34

Using User Defined Formats to Detect Invalid Values PROC FORMAT; VALUE $GENDER 'F','M' = 'Valid' ' ' = 'Missing' OTHER = 'Miscoded'; VALUE $DX '001' - '999'= 'Valid' ' ' = 'Missing' OTHER = 'Miscoded'; VALUE $AE '0','1' = 'Valid' ' ' = 'Missing' OTHER = 'Miscoded'; RUN; PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Using FORMATS"; FORMAT GENDER $GENDER. DX $DX. AE $AE.; TABLES GENDER DX AE / NOCUM NOPERCENT; RUN; Gender GENDER Frequency ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Miscoded 4 Valid 25 Frequency Missing = 1

Using User-defined Formats and a PUT Function DATA _NULL_; INFILE "C:PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the output window; TITLE "Invalid Data Values"; PATNO GENDER DX AE $1.; IF PUT(GENDER,$GENDER.) = 'Miscoded' THEN PUT PATNO= GENDER=; IF PUT(DX,$DX.) = 'Miscoded' THEN PUT PATNO= DX=; IF PUT(AE,$AE.) = 'Miscoded' THEN PUT PATNO= AE=; RUN; Invalid Data Values PATNO=002 DX=X PATNO=003 GENDER=X PATNO=004 AE=A PATNO=010 GENDER=f PATNO=013 GENDER=2 PATNO=002 DX=X PATNO=023 GENDER=f

Using PROC RANK to List the Highest and Lowest n% of the Data %MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR); ***Compute number of groups for PROC RANK; %LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR); ***Value of the highest GROUP from PROC RANK, equal to the number of groups - 1; %LET TOP = %EVAL(&GRP - 1); PROC FORMAT; VALUE RNK 0='Low' &TOP='High'; RUN; PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP; VAR &VAR; RANKS RANGE; RUN; ***Sort and keep top and bottom n%; PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP))); BY &VAR; RUN; (continued)

Using PROC RANK to List the Highest and Lowest n% of the Data ***Produce the report; PROC PRINT DATA=NEW; TITLE "Upper and Lower &PERCENT.% Values for %UPCASE(&VAR)"; ID &IDVAR; VAR RANGE &VAR; FORMAT RANGE RNK.; RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE NEW; RUN; QUIT; %MEND HI_LOW_P;

Using PROC RANK to List the Highest and Lowest n% of the Data %HI_LOW_P(CLEAN.PATIENTS,SBP,10,PATNO) Upper and Lower 10% Values for SBP PATNO RANGE SBP 020 Low Low High High 400

Detecting Outliers Based on the Standard Deviation Data set MEANS contains one observation: Listing of Data Set MEANS m_hr s_hr continued... ***Output mean and standard deviations to a data set; proc means data=clean.patients noprint; var hr; output out=means(drop=_type_ _freq_) mean=m_hr std=s_hr; run;

Detecting Outliers Based on the Standard Deviation %let n_sd = 2; ***Two standard deviations gives approximately 5% of the outliers; data _null_; file print; title "Statistics for Numeric Variables"; set clean.patients; if _n_ = 1 then set means; if hr lt m_hr - &n_sd*s_hr and hr ne. or hr gt m_hr + &n_sd*s_hr then put patno= hr=; run;

Detecting Outliers Based on the Standard Deviation Statistics for Numeric Variables PATNO=321 HR=900

Detecting Outliers Based on Trimmed Data proc rank data=clean.patients out=tmp groups=4; var hr; ranks r_hr; run; proc means data=tmp noprint; where r_hr in (1,2); ***The middle 50%; var hr; output out=means(drop=_type_ _freq_) mean=m_hr std=s_hr; run; continued... A trimmed mean is a mean computed by first removing some of the highest and lowest values before doing the calculation.

Detecting Outliers Based on Trimmed Data data _null_; title "Outliers Based on Trimmed Data"; file print; set clean.patients; if _n_ = 1 then set means; if hr lt m_hr - &n_sd*2.63*s_hr and hr ne. or hr gt m_hr + &n_sd*2.63*s_hr then put patno= hr=; run; %let n_sd = 2;

Detecting Outliers Based on Trimmed Data Outliers Based on Trimmed Data PATNO=008 HR=210 PATNO=014 HR=22 PATNO=017 HR=208 PATNO=321 HR=900 PATNO=020 HR=10 PATNO=023 HR=22

Detecting Outliers Based on Trimmed Data proc rank data=clean.patients out=tmp groups=20; var hr; ranks r_hr; run; proc means data=tmp noprint; where r_hr not in (0,19); *The middle 90%; var hr; output out=means(drop=_type_ _freq_) mean=m_hr std=s_hr; run; Program to Trim the Top and Bottom 5% of the Data

Defining Interquartile Range Median Q3 (upper hinge) Q1 (lower hinge) x IQR Outliers IQR Diastolic Blood Pressure (DBP) Outliers

Outliers Based on the Interquartile Range %MACRO INTER_Q(DSN,VAR,IDVAR,N_IQR); PROC MEANS DATA=&DSN NOPRINT; VAR &VAR; OUTPUT OUT=TMP Q3=UPPER Q1=LOWER QRANGE=IQR; RUN; DATA _NULL_; TITLE "Outliers Based on &N_IQR Interquartile Ranges"; FILE PRINT; SET &DSN; IF _N_ = 1 THEN SET TMP; IF &VAR LT LOWER - &N_IQR*IQR AND &VAR NE. OR &VAR GT UPPER + &N_IQR*IQR THEN PUT &IDVAR= &VAR=; RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE TMP; RUN; QUIT; %MEND INTER_Q;

Outliers Based on the Interquartile Range %INTER_Q(CLEAN.PATIENTS,SBP,PATNO,2); Outliers Based on 2 Interquartile Ranges PATNO=011 SBP=300 PATNO=321 SBP=400

Using Perl Regular Expressions For Data Cleaning

Some PERL Regular Expression Examples Regular Expression Matches /cat/ the letters "cat" /cat*/the letters "ca" followed by zero or more "t"s /cat+/the letters "ca" followed by one or more "t"s /c[aeiou]t/a "c" followed by a vowel followed by the letter "t" /\d\d/any two digits /\d\d+/two or more digits

PRXPARSE Syntax RE = PRXPARSE("expression"); return code PERL regular expression Examples RETURN = PRXPARSE("/cat/"); RE = PRXPARSE("/\d\d+/");

PRXMATCH Syntax POS = PRXMATCH(return,string); Position of the beginning of the pattern. If not found, returns a zero Return code from PRXPARSE Function Text string Examples POS = PRXMATCH(RE,STRING); RETURN = PRXPARSE("/cat/"); P = PRXMATCH(RETURN,"This is a cat"); Value of P is 11

A Simple Example: Locating a SS Number DATA FIND_SS; IF _N_ = 1 THEN RETURN = PRXPARSE("/\d\d\d-\d\d-\d\d\d\d/"); RETAIN RETURN; INPUT STRING $30.; POSITION = PRXMATCH(RETURN,STRING); IF POSITION GT 0 THEN OUTPUT; DATALINES; none on this line yes! is one two ; RETURN STRING POSITION 1 yes! is one 6 1 two

Using PRXMATCH without using PRXPARSE (version 9.1) DATA FIND_SS; INPUT STRING $30.; POSITION = PRXMATCH("/\d\d\d-\d\d-\d{4}/",STRING); IF POSITION GT 0 THEN OUTPUT; DATALINES; none on this line yes! is one two ; STRING POSITION yes! is one 6 two

Using Perl Regular Expressions for Data Cleaning DATA BAD_DATA; SET CLEAN.PATIENTS; IF PRXMATCH("/\d |\d\d |\d{3}/",DX) EQ 0 AND NOT MISSING(DX); RUN; Listing of data set BAD_DATA PATNO DX 002 X

Some Regular Expression Solutions DATA BAD_OBS; LENGTH ID $ 5; INPUT ID *Valid ID's are X, Y, or Z followed by one or more digits; IF NOT PRXMATCH("/^X|Y|Z\d+/",ID); DATALINES; X12 C334 Y777 78Z 999 ; Listing of data set BAD_OBS ID C334 78Z 999

Penn State SAS Users Group Open to all SAS users in the central PA area Visit our website for more information and sign-up for our listserv. sug/