Download presentation
Presentation is loading. Please wait.
1
Hunter Glanz & Josh Horstman
The Missing Link: A Robust Macro for Recoding General Missing Data Values Hunter Glanz & Josh Horstman
2
Bio Hunter Glanz is an assistant professor of statistics and data science at Cal Poly in San Luis Obispo. He teaches primarily computing and data science courses using R, SAS, and Python. He loves to program and teach others how to as well! Josh Horstman is an independent statistical programming consultant and trainer based in Indianapolis with 20 years’ experience using SAS in the life sciences industry. He specializes in analyzing clinical trial data, and his clients have included major pharmaceutical corporations, biotech companies, and research organizations. He loves coding and is a SAS Certified Advanced Programmer. Josh enjoys presenting at SAS Global Forum and other SAS User Group meetings.
3
Motivation A hypothetical missing data situation with non-SAS dataset…
Known incomplete variables Known missing value code Nice missing value code(s) Alas, data are not always so tidy
4
The Reality At least one of the following is often true:
Location and abundance of missing values are unknown Number of variables with missing values is not small Missing value code(s) not nice, but still known The trouble with these: Potentially heavy data cleaning/processing Potential for lengthy, less reproducible conversions Potential for user-determined data conversions
5
A Short Review of Missing Values in SAS
Basics Missing character value: Blank Missing numeric value: Period A-Z Underscore Blank (except LIST input) Unofficially, SAS almost always accepts ASCII code 0 and ASCII code 255 as missing as well
6
Back To Reality: SF Salary Data
7
Robust Recoding Macro: %missfix
Goals: Recode missing values to SAS missing values in a dataset that contained a non-numeric missing value code Create numeric variables with converted values Remove original “mistaken” character variables Preserve variable names Data-driven (user- and application-independent) Reproducible Robust (not limited to non-numeric missing value codes)
8
Desired Result for SF Salary Data
9
The Macro: %missfix
10
Macro Code Part 1: Input Parameters & Validation
Three macro parameters %macro missfix( dsetin = /* Name of input dataset */ ,dsetout = /* Name of output dataset */ ,missval = /* String to treat as missing (case insensitive) */ ); %IF %INDEX(&dsetin,.) %THEN %DO; %LET mylib = %SCAN(&dsetin,1,.); %LET myds = %SCAN(&dsetin,2,.); %END; %ELSE %DO; %LET mylib = WORK; %LET myds = &dsetin; Split library and dataset names into two macro variables for later use. If no library name is supplied, assume WORK.
11
Macro Code Part 2: Learning About The Data Set
Create macro variables: number of variables, list of original variable names, list of "new" variable names proc sql noprint; select count(*), name, "new"||name into :varcnt trimmed, :varlst separated by ' ', :newvarlst separated by ' ' from dictionary.columns where memname="%UPCASE(&myds)" and libname="%UPCASE(&mylib)" and type="char"; quit; Access COLUMNS dictionary to get information about variables. Use the macro variables created in the previous step. Only want character variables.
12
Examining the Macro Variables
%put &varcnt; %put &varlst; %put &newvarlst; 9 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR11 VAR12 VAR13 newVAR2 newVAR3 newVAR4 newVAR5 newVAR6 newVAR7 newVAR11 newVAR12 newVAR13
13
Macro Code Part 3: Main DATA Step - Setup
Create temporary dataset data _tmpfixed; set &dsetin end=eof; length dropvarlst $32767 renamelst $32767; array vars (&varcnt) $ &varlst; array newvars (&varcnt) &newvarlst; array dropflag(&varcnt); retain dropflag:; Read DSETIN and set end-of-file flag Create arrays of our original and new variables. Create arrays of flags to track which variables to drop.
14
Macro Code Part 4: Main DATA Step – Variable Loop
Loop through each character variable. do i = 1 to dim(dropflag); num = input(vars(i),?? best32.); IsNum = (not missing(num)) or strip(vars[i]) in ('','.') or (.A le num le .Z) or (num eq ._); if upcase(vars(i)) = "%UPCASE(&missval)" then call missing(vars[i],newvars[i]); else do; dropflag[i] = max(ifn(IsNum,1,2),dropflag[i]); newvars[i] = num; end; Test whether an attempted numerical conversion results in valid number Check for our designated missing string. Set dropflag to 1 if original character variable should be dropped, 2 if new variable should be dropped.
15
Macro Code Part 5: Main DATA Step – Set Macro Variables
if eof then do; do i = 1 to dim(dropflag); if not missing(dropflag[i]) then dropvarlst = catx(' ', dropvarlst, choosec(dropflag[i], vname(vars[i]), vname(newvars[i]))); if dropflag[i]=1 then renamelst = catx(' ', renamelst, catx('=',vname(newvars[i]), vname(vars[i]))); end; call symputx('dropvarlst', dropvarlst); call symputx('renamelst', renamelst); drop dropvarlst renamelst dropflag: i num IsNum; run; This block runs only once, after the last record is read. The drop and rename lists are constructed based on the values of dropflag. Values copied to macro variables.
16
Examining the Macro Variables
%put &dropvarlst; %put &renamelst; newVAR2 newVAR3 VAR4 VAR5 VAR6 VAR7 VAR11 newVAR12 newVAR13 newVAR4=VAR4 newVAR5=VAR5 newVAR6=VAR6 newVAR7=VAR7 newVAR11=VAR11
17
Macro Code Part 6: Create Output Dataset & Clean Up
data &dsetout; set _tmpfixed; %IF %bquote(&dropvarlst) ne %THEN drop &dropvarlst;; %IF %bquote(&renamelst) ne %THEN rename &renamelst;; run; proc delete data=_tmpfixed; run; %mend missfix; Drop and/or rename variables from the temporary dataset to create an output dataset. Remove the temporary dataset.
18
Conclusion Extremely widely applicable macro for recoding missing values Expedites data pre-processing and cleaning Easy to implement Robust Data-driven
19
Thank You! Questions?
20
Contact Information Name: Hunter Glanz Company: Cal Poly City/State: San Luis Obispo, CA Phone:
21
Contact Information Name: Josh Horstman Company: Nested Loop Consulting LLC City/State: Indianapolis, IN Phone:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.