Using Multiple SET Statements to Combine and Analyze Data

Slides:



Advertisements
Similar presentations
Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.
Advertisements

Chapter Chapter 4. Think back to any very difficult quantitative problem that you had to solve in some science class How long did it take? How many times.
Chapter 10 Introduction to Arrays
Control Structures - Repetition Chapter 5 2 Chapter Topics Why Is Repetition Needed The Repetition Structure Counter Controlled Loops Sentinel Controlled.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Estimating Readmission Rates using Incomplete Data: Implications for Two Methods of Hospital Profiling William J. O’Brien, Qi Chen, Hillary J. Mull, Ann.
Fundamentals of Python: From First Programs Through Data Structures
REPETITION STRUCTURES. Topics Introduction to Repetition Structures The while Loop: a Condition- Controlled Loop The for Loop: a Count-Controlled Loop.
Fundamentals of Python: First Programs
Chapter 5: Control Structures II (Repetition)
CHAPTER 5: CONTROL STRUCTURES II INSTRUCTOR: MOHAMMAD MOJADDAM.
C++ for Everyone by Cay Horstmann Copyright © 2012 by John Wiley & Sons. All rights reserved For Loops October 16, 2013 Slides by Evan Gallagher.
08/10/ Iteration Loops For … To … Next. 208/10/2015 Learning Objectives Define a program loop. State when a loop will end. State when the For.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Mr. Dave Clausen1 La Cañada High School Chapter 6: Repetition Statements.
Chapter 5: Control Structures II (Repetition). Objectives In this chapter, you will: – Learn about repetition (looping) control structures – Learn how.
Current Assignments Homework 2 is available and is due in three days (June 19th). Project 1 due in 6 days (June 23 rd ) Write a binomial root solver using.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Introduction to Loops For Loops. Motivation for Using Loops So far, everything we’ve done in MATLAB, you could probably do by hand: Mathematical operations.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Transformation Accountability (TRAC) Center for Mental Health Services NOMs Client-level Measures Guide for Grantees and CMHS Staff Notification Report.
Working with Loops, Conditional Statements, and Arrays.
Improving Care Coordination and Readmissions Using Real Time Predictive Analytics from an HIE New Jersey / Delaware Valley HIMSS Conference Atlantic City,
9. ITERATIONS AND LOOP STRUCTURES Rocky K. C. Chang October 18, 2015 (Adapted from John Zelle’s slides)
Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 8, 13, & 24 By Tasha Chapman, Oregon Health Authority.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
SchoolSuccess for Coordinators
Complexity Analysis (Part I)
Advantages of sas for reporting
By Sasikumar Palanisamy
CprE 185: Intro to Problem Solving (using C)
Loops BIS1523 – Lecture 10.
What to do when a test fails
Matlab Training Session 4: Control, Flow and Functions
Python: Control Structures
Screen I/O ACCEPT DISPLAY Relevant to the Fujitsu Compiler.
Loop Structures.
CS1371 Introduction to Computing for Engineers
Control Structures II (Repetition)
A First Book of ANSI C Fourth Edition
Repetition-Counter control Loop
Java Programming: Guided Learning with Early Objects
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Control Structures - Repetition
By Don Henderson PhilaSUG, June 18, 2018
Chapter 22 Reading Hierarchical Files
LESSON 11 – WHILE LOOPS UNIT 5 – 1/10/17.
Arrays, For loop While loop Do while loop
Learning to Program in Python
By Sanjay and Arvind Seshan
Learning to Program in Python
Program Design Introduction to Computer Programming By:
SAS Essentials How SAS Thinks
In Class Program: Today in History
Please use speaker notes for additional information!
By Sanjay and Arvind Seshan
Chapter 6: Repetition Statements
Searching an Array or Table
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Flowcharts and Pseudo Code
Chapter 5: Control Structures II (Repetition)
FLUENCY WITH INFORMATION TECNOLOGY
Topics Introduction to Repetition Structures
A – Pre Join Indexes.
Complexity Analysis (Part I)
Looping Structures.
Complexity Analysis (Part I)
Presentation transcript:

Using Multiple SET Statements to Combine and Analyze Data William O’Brien, MS VA Boston HCS – Center for Organization, Leadership and Management Research December 21, 2011 VA SAS Users Group Presentation

What we’ll cover What does the SET statement actually do? A simple example of using multiple SET statements to combine and analyze data from two sources. Efficiency issues and pitfalls. More realistic examples of when you might use multiple SET statements.

Single named dataset in single SET statement DATA A; SET B; RUN; B A Patient_ID Age 1 34 2 14 3 62 4 37 Patient_ID Age 1 34 2 14 3 62 4 37

Multiple named datasets in single SET statement DATA A; SET B C; RUN; Patient_ID Age 1 34 2 14 3 62 4 37 5 53 6 46 7 42 8 63 B C Patient_ID Age 1 34 2 14 3 62 4 37 Patient_ID Age 5 53 6 46 7 42 8 63

Multiple SET statements DATA A; SET B; SET C; RUN; B C A Patient_ID Age 1 34 2 14 3 62 4 37 Patient_ID Height 1 70 2 62 7 66 Patient_ID Age Height 1 34 70 2 14 62 7 66

Quick review of the PDV (Program Data Vector) _N_ TYPE=NUMERIC, LENGTH=8, DROP=NO, RETAIN=NO _ERROR_ TYPE=NUMERIC , LENGTH=8, DROP=NO, RETAIN=NO Inpt_ID Age TYPE=NUMERIC , LENGTH=3, DROP=NO, RETAIN=NO 1 35 67 Vector -> one row Stores current value and metadata about each variable All variables initialized to missing after each iteration of DATA STEP, unless RETAIN statement is used. PDV is created at compile time and updated throughout execution.

The SET Statement DATA A; SET B; OUTPUT; RETURN; RUN; Read the next row from B into the PDV. Output PDV to A. Repeat until EOF of B is reached. Read the next row from B into the PDV. Output PDV to A. Repeat until EOF of B is reached. Read the next row from C into the PDV. Output PDV to A. Repeat until EOF of C is reached. DATA A; SET B C; RUN; DATA A; SET B; SET C; RUN; Read the next row from B into the PDV. Read the next row from C into the PDV. Output PDV to A. Repeat until EOF of B or C is reached.

Using multiple SET statements to solve a realistic problem For each admission in an inpatient administrative dataset, how many outpatient encounters did the patient have in the prior year?

Datasets INPATIENT DATASET INPT_ID ADMITDAY 1 19FEB2008 1 07JUL2008 2 16JAN2008 3 31MAR2008 OUTPATIENT DATASET OUTPT_ID VIZDAY 1 30MAR2007 1 12OCT2007 3 10FEB2007 3 20JUL2007 3 05DEC2007 3 15APR2008 Patient identifier variable has a different name between datasets. This is intentional. We’ll see why...

Desired outcome INPATIENT DATASET INPT_ID ADMITDAY 1 19FEB2008 1 07JUL2008 2 16JAN2008 3 31MAR2008 OUTPATIENT DATASET OUTPT_ID VIZDAY 1 30MAR2007 1 12OCT2007 3 10FEB2007 3 20JUL2007 3 05DEC2007 3 15APR2008 NUMBER OF OUTPT VISITS IN 1-YEAR PRIOR TO ADMITDAY INPT_ID ADMITDAY NVISITS 1 19FEB2008 2 1 07JUL2008 1 2 16JAN2008 0 3 31MAR2008 2

Simple solution Algorithm: 1) Load next inpatient record. * HOW MANY ROWS ARE IN THE OUTPATIENT DATASET?; DATA _NULL_; IF FALSE THEN SET OUTPATIENT NOBS=NOBS; CALL SYMPUT("NOBS",NOBS); RUN; %PUT &NOBS; * COUNT NUMBER OF OUTPATIENT VISITS THE PATIENT HAD IN THE YEAR PRIOR TO ADMITDAY ; DATA OUTPT_VISITS; LENGTH NVISITS K 3 ; SET INPATIENT; NVISITS=0; DO K=1 TO &NOBS; SET OUTPATIENT POINT=K; IF OUTPT_ID=INPT_ID AND (0 LE ADMITDAY-VIZDAY LE 365) THEN NVISITS=NVISITS+1; END; KEEP INPT_ID ADMITDAY NVISITS; OUTPUT; RETURN; Algorithm: 1) Load next inpatient record. 2) Load every outpatient record one after another, and increment NVISITS if the outpatient ID matches the current inpatient ID, and the visit occurred within 0-365 days of the current admit day.

PDV after each SET OUTPATIENT statement _N_ INPT_ID ADMITDAY OUTPT_ID VIZDAY NVISITS 1 1 19FEB2008 1 30MAR2007 1 1 1 19FEB2008 1 12OCT2007 2 1 1 19FEB2008 3 10FEB2007 2 1 1 19FEB2008 3 20JUL2007 2 1 1 19FEB2008 3 05DEC2007 2 1 1 19FEB2008 3 15APR2008 2 2 1 07JUL2008 1 30MAR2007 0 2 1 07JUL2008 1 12OCT2007 1 2 1 07JUL2008 3 10FEB2007 1 2 1 07JUL2008 3 20JUL2007 1 2 1 07JUL2008 3 05DEC2007 1 2 1 07JUL2008 3 15APR2008 1 3 2 16JAN2008 1 30MAR2007 0 3 2 16JAN2008 1 12OCT2007 0 3 2 16JAN2008 3 10FEB2007 0 3 2 16JAN2008 3 20JUL2007 0 3 2 16JAN2008 3 05DEC2007 0 3 2 16JAN2008 3 15APR2008 0 4 3 31MAR2008 1 30MAR2007 0 4 3 31MAR2008 1 12OCT2007 0 4 3 31MAR2008 3 10FEB2007 0 4 3 31MAR2008 3 20JUL2007 1 4 3 31MAR2008 3 05DEC2007 2 4 3 31MAR2008 3 15APR2008 2 INPATIENT DATASET INPT_ID ADMITDAY 1 19FEB2008 1 07JUL2008 2 16JAN2008 3 31MAR2008 OUTPATIENT DATASET OUTPT_ID VIZDAY 1 30MAR2007 1 12OCT2007 3 10FEB2007 3 20JUL2007 3 05DEC2007 3 15APR2008

Efficiency Two problems with this simplified approach: #1 - Always starts searching on the first outpatient record. Solution: in each data step iteration, keep track of the row number in the outpatient file where you first found a match for the patient identifier you’re looking for. On the next iteration of the data step, use this retained row number as the starting row for the search. #2 - Always keeps searching until end of file. Solution: instead of a do loop that searches through every row of the outpatient data, use a do while loop that terminates once the outpatient ID is greater than the inpatient ID. These are of great concern in large datasets. Iterating through records needlessly takes a long time (longer than you might think). Sorting by patient ID and date is necessary for this to work.

Pitfalls Incorrect use of the POINT option can cause an infinite loop. Make sure things are moving along at a smooth pace. Try this: if not mod(_N_,100000) then put _N_=; On every hundred thousandth iteration of the data step, the current value of _N_ will be output to the log. This works in the Windows enhanced editor environment, not sure about Enterprise Guide. Watch out for variable names that are common between the two datasets. The value and metadata of the variable in the second named dataset will overwrite that of the first.

Another application HSR&D funded grant: “Validating and Classifying VA Readmissions for Quality Assessment and Improvement” – Amy Rosen, PI Objective 1) Estimate risk-adjusted models to predict 30-day readmissions for patients discharged with HF, AMI, or pneumonia from an acute-care VA facility. How does the programmer create a vector of risk adjustment variables? For each inpatient index discharge, search through that patient’s inpatient and outpatient history, and flag the risk adjustment variable as YES upon seeing a relevant diagnosis. Datasets contain 100,000+ index admissions and 20M rows of outpatient data to search through for risk adjusment terms.

One last example “You have two datasets: one with 425 million rows, each row containing information about one marketing email sent to a potential customer. The second dataset has 10 million records, each row having information about an order placed by a customer. For each order, find out which marketing stimulus, if any, we can attribute the order to, based on certain business rules. Here’s a laptop with PC SAS for you to do it on.”

Thank you! If anyone has questions or requests for sample code, email me at: william.obrien@va.gov