Hunter Glanz & Josh Horstman

Slides:



Advertisements
Similar presentations
Haas MFE SAS Workshop Lecture 3:
Advertisements

Axio Research E-Compare A Tool for Data Review Bill Coar.
Outline Proc Report Tricks Kelley Weston. Outline Examples 1.Text that spans columnsText that spans columns 2.Patient-level detail in the titlesPatient-level.
SAS Programming Techniques for Decoding Variables on the Database Level By Chris Speck PAREXEL International RTSUG – Wednesday, March 23, 2011.
I OWA S TATE U NIVERSITY Department of Animal Science Getting Started Using SAS Software Animal Science 500 Lecture No. 2.
Basic And Advanced SAS Programming
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
Chapter 10:Processing Macro Variables at Execution Time 1 STAT 541 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
STRINGS CMSC 201 – Lab 3. Overview Objectives for today's lab:  Obtain experience using strings in Python, including looping over characters in strings.
Multiple Uses for a Simple SQL Procedure Rebecca Larsen University of South Florida.
SAS Macro: Some Tips for Debugging Stat St. Paul’s Hospital April 2, 2007.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
Define your Own SAS® Command Line Commands Duong Tran – Independent Contractor, London, UK Define your Own SAS® Command Line Commands Duong Tran – Independent.
Introduction to SAS Macros Center for Statistical Consulting Short Course April 15, 2004.
Copyright © 2004, SAS Institute Inc. All rights reserved. SASHELP Datasets A real life example Barb Crowther SAS Consultant October 22, 2004.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
1 Project 7: Looping. Project 7 For this project you will produce two Java programs. The requirements for each program will be described separately on.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Hints and Tips SAUSAG Q SORTING – NOUNIQUEKEY The NOUNIQUEKEY option on PROC SORT is a useful way in 9.3 to easily retain only those records with.
Secure Coding Rules for C++ Copyright © 2016 Curt Hill
Introduction to Computing Science and Programming I
Prof: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2017
Greg Steffens Noumena Solutions
Objectives You should be able to describe: Interactive Keyboard Input
A First Book of ANSI C Fourth Edition
SAS Programming Introduction to SAS.
Scripts & Functions Scripts and functions are contained in .m-files
Secure Coding Rules for C++ Copyright © Curt Hill
ECONOMETRICS ii – spring 2018
Instructor: Raul Cruz-Cano
Tamara Arenovich Tony Panzarella
Topics Introduction to File Input and Output
Chapter 7 Files and Exceptions
SAS Essentials How SAS Thinks
PROC DOC III: Self-generating Codebooks Using SAS®
Introduction to SAS A SAS program is a list of SAS statements executed in order Every SAS statement ends with a semicolon! SAS statements can be in caps.
Microsoft Office Access 2003
PHP.
SESUG Web Scraping in SAS: A Macro-Based Approach
How to Create Data Driven Lists
3 Iterative Processing.
Introduction to DATA Step Programming: SAS Basics II
Examining model stability, an example
Conjoint Analysis.
3 Parameter Validation.
Never Cut and Paste Again
Lab 2 and Merging Data (with SQL)
Let’s Talk About Variable Attributes
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Use of PROC TABULATE Out File to Customize Tables
Creating BDS DERIVED Parameters for a Subject-level Frequency Summary Table? Then this macro can be useful.
Spreadsheets, Modelling & Databases
Trigger %macro check_trigger_run;
Passing Simple and Complex Parameters In and Out of Macros
COP 3330 Object-oriented Programming in C++
Introduction to Computer Science
Final Thoughts.
Topics Introduction to File Input and Output
Bash Scripting CS 580U - Fall 2018.
Troubles with Text Data
Introduction to SAS Essentials Mastering SAS for Data Analytics
CHAPTER 6 Testing and Debugging.
Chapter 1: Creating a Program.
Writing Robust SAS Macros
Presentation transcript:

Hunter Glanz & Josh Horstman The Missing Link: A Robust Macro for Recoding General Missing Data Values Hunter Glanz & Josh Horstman

Bio Hunter Glanz is an assistant professor of statistics and data science at Cal Poly in San Luis Obispo. He teaches primarily computing and data science courses using R, SAS, and Python. He loves to program and teach others how to as well! Josh Horstman is an independent statistical programming consultant and trainer based in Indianapolis with 20 years’ experience using SAS in the life sciences industry.  He specializes in analyzing clinical trial data, and his clients have included major pharmaceutical corporations, biotech companies, and research organizations.  He loves coding and is a SAS Certified Advanced Programmer.  Josh enjoys presenting at SAS Global Forum and other SAS User Group meetings.

Motivation A hypothetical missing data situation with non-SAS dataset… Known incomplete variables Known missing value code Nice missing value code(s) Alas, data are not always so tidy

The Reality At least one of the following is often true: Location and abundance of missing values are unknown Number of variables with missing values is not small Missing value code(s) not nice, but still known The trouble with these: Potentially heavy data cleaning/processing Potential for lengthy, less reproducible conversions Potential for user-determined data conversions

A Short Review of Missing Values in SAS Basics Missing character value: Blank Missing numeric value: Period A-Z Underscore Blank (except LIST input) Unofficially, SAS almost always accepts ASCII code 0 and ASCII code 255 as missing as well

Back To Reality: SF Salary Data

Robust Recoding Macro: %missfix Goals: Recode missing values to SAS missing values in a dataset that contained a non-numeric missing value code Create numeric variables with converted values Remove original “mistaken” character variables Preserve variable names Data-driven (user- and application-independent) Reproducible Robust (not limited to non-numeric missing value codes)

Desired Result for SF Salary Data

The Macro: %missfix

Macro Code Part 1: Input Parameters & Validation Three macro parameters %macro missfix( dsetin = /* Name of input dataset */ ,dsetout = /* Name of output dataset */ ,missval = /* String to treat as missing (case insensitive) */ ); %IF %INDEX(&dsetin,.) %THEN %DO; %LET mylib = %SCAN(&dsetin,1,.); %LET myds = %SCAN(&dsetin,2,.); %END; %ELSE %DO; %LET mylib = WORK; %LET myds = &dsetin; Split library and dataset names into two macro variables for later use. If no library name is supplied, assume WORK.

Macro Code Part 2: Learning About The Data Set Create macro variables: number of variables, list of original variable names, list of "new" variable names proc sql noprint; select count(*), name, "new"||name into :varcnt trimmed, :varlst separated by ' ', :newvarlst separated by ' ' from dictionary.columns where memname="%UPCASE(&myds)" and libname="%UPCASE(&mylib)" and type="char"; quit; Access COLUMNS dictionary to get information about variables. Use the macro variables created in the previous step. Only want character variables.

Examining the Macro Variables %put &varcnt; %put &varlst; %put &newvarlst; 9 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR11 VAR12 VAR13 newVAR2 newVAR3 newVAR4 newVAR5 newVAR6 newVAR7 newVAR11 newVAR12 newVAR13

Macro Code Part 3: Main DATA Step - Setup Create temporary dataset data _tmpfixed; set &dsetin end=eof; length dropvarlst $32767 renamelst $32767; array vars (&varcnt) $ &varlst; array newvars (&varcnt) &newvarlst; array dropflag(&varcnt); retain dropflag:; Read DSETIN and set end-of-file flag Create arrays of our original and new variables. Create arrays of flags to track which variables to drop.

Macro Code Part 4: Main DATA Step – Variable Loop Loop through each character variable. do i = 1 to dim(dropflag); num = input(vars(i),?? best32.); IsNum = (not missing(num)) or strip(vars[i]) in ('','.') or (.A le num le .Z) or (num eq ._); if upcase(vars(i)) = "%UPCASE(&missval)" then call missing(vars[i],newvars[i]); else do; dropflag[i] = max(ifn(IsNum,1,2),dropflag[i]); newvars[i] = num; end; Test whether an attempted numerical conversion results in valid number Check for our designated missing string. Set dropflag to 1 if original character variable should be dropped, 2 if new variable should be dropped.

Macro Code Part 5: Main DATA Step – Set Macro Variables if eof then do; do i = 1 to dim(dropflag); if not missing(dropflag[i]) then dropvarlst = catx(' ', dropvarlst, choosec(dropflag[i], vname(vars[i]), vname(newvars[i]))); if dropflag[i]=1 then renamelst = catx(' ', renamelst, catx('=',vname(newvars[i]), vname(vars[i]))); end; call symputx('dropvarlst', dropvarlst); call symputx('renamelst', renamelst); drop dropvarlst renamelst dropflag: i num IsNum; run; This block runs only once, after the last record is read. The drop and rename lists are constructed based on the values of dropflag. Values copied to macro variables.

Examining the Macro Variables %put &dropvarlst; %put &renamelst; newVAR2 newVAR3 VAR4 VAR5 VAR6 VAR7 VAR11 newVAR12 newVAR13 newVAR4=VAR4 newVAR5=VAR5 newVAR6=VAR6 newVAR7=VAR7 newVAR11=VAR11

Macro Code Part 6: Create Output Dataset & Clean Up data &dsetout; set _tmpfixed; %IF %bquote(&dropvarlst) ne %THEN drop &dropvarlst;; %IF %bquote(&renamelst) ne %THEN rename &renamelst;; run; proc delete data=_tmpfixed; run; %mend missfix; Drop and/or rename variables from the temporary dataset to create an output dataset. Remove the temporary dataset.

Conclusion Extremely widely applicable macro for recoding missing values Expedites data pre-processing and cleaning Easy to implement Robust Data-driven

Thank You! Questions?

Contact Information Name: Hunter Glanz Company: Cal Poly City/State: San Luis Obispo, CA Phone: 619-246-1439 Email: hglanz@calpoly.edu

Contact Information Name: Josh Horstman Company: Nested Loop Consulting LLC City/State: Indianapolis, IN Phone: 317-721-1009 Email: josh@nestedloopconsulting.com