Using Multiple SET Statements to Combine and Analyze Data

Using Multiple SET Statements to Combine and Analyze Data
William O’Brien, MS VA Boston HCS – Center for Organization, Leadership and Management Research December 21, 2011 VA SAS Users Group Presentation

What we’ll cover What does the SET statement actually do?
A simple example of using multiple SET statements to combine and analyze data from two sources. Efficiency issues and pitfalls. More realistic examples of when you might use multiple SET statements.

Single named dataset in single SET statement
DATA A; SET B; RUN; B A Patient_ID Age 1 34 2 14 3 62 4 37 Patient_ID Age 1 34 2 14 3 62 4 37

Multiple named datasets in single SET statement
DATA A; SET B C; RUN; Patient_ID Age 1 34 2 14 3 62 4 37 5 53 6 46 7 42 8 63 B C Patient_ID Age 1 34 2 14 3 62 4 37 Patient_ID Age 5 53 6 46 7 42 8 63

Multiple SET statements
DATA A; SET B; SET C; RUN; B C A Patient_ID Age 1 34 2 14 3 62 4 37 Patient_ID Height 1 70 2 62 7 66 Patient_ID Age Height 1 34 70 2 14 62 7 66

Quick review of the PDV (Program Data Vector)
_N_ TYPE=NUMERIC, LENGTH=8, DROP=NO, RETAIN=NO _ERROR_ TYPE=NUMERIC , LENGTH=8, DROP=NO, RETAIN=NO Inpt_ID Age TYPE=NUMERIC , LENGTH=3, DROP=NO, RETAIN=NO 1 35 67 Vector -> one row Stores current value and metadata about each variable All variables initialized to missing after each iteration of DATA STEP, unless RETAIN statement is used. PDV is created at compile time and updated throughout execution.

The SET Statement DATA A; SET B;
OUTPUT; RETURN; RUN; Read the next row from B into the PDV. Output PDV to A. Repeat until EOF of B is reached. Read the next row from B into the PDV. Output PDV to A. Repeat until EOF of B is reached. Read the next row from C into the PDV. Output PDV to A. Repeat until EOF of C is reached. DATA A; SET B C; RUN; DATA A; SET B; SET C; RUN; Read the next row from B into the PDV. Read the next row from C into the PDV. Output PDV to A. Repeat until EOF of B or C is reached.

Using multiple SET statements to solve a realistic problem
For each admission in an inpatient administrative dataset, how many outpatient encounters did the patient have in the prior year?

Datasets INPATIENT DATASET INPT_ID ADMITDAY FEB2008 JUL2008 JAN2008 MAR2008 OUTPATIENT DATASET OUTPT_ID VIZDAY MAR2007 OCT2007 FEB2007 JUL2007 DEC2007 APR2008 Patient identifier variable has a different name between datasets. This is intentional. We’ll see why...

Desired outcome INPATIENT DATASET INPT_ID ADMITDAY 1 19FEB2008
JUL2008 JAN2008 MAR2008 OUTPATIENT DATASET OUTPT_ID VIZDAY MAR2007 OCT2007 FEB2007 JUL2007 DEC2007 APR2008 NUMBER OF OUTPT VISITS IN 1-YEAR PRIOR TO ADMITDAY INPT_ID ADMITDAY NVISITS FEB JUL JAN MAR

Simple solution Algorithm: 1) Load next inpatient record.
* HOW MANY ROWS ARE IN THE OUTPATIENT DATASET?; DATA _NULL_; IF FALSE THEN SET OUTPATIENT NOBS=NOBS; CALL SYMPUT("NOBS",NOBS); RUN; %PUT &NOBS; * COUNT NUMBER OF OUTPATIENT VISITS THE PATIENT HAD IN THE YEAR PRIOR TO ADMITDAY ; DATA OUTPT_VISITS; LENGTH NVISITS K 3 ; SET INPATIENT; NVISITS=0; DO K=1 TO &NOBS; SET OUTPATIENT POINT=K; IF OUTPT_ID=INPT_ID AND (0 LE ADMITDAY-VIZDAY LE 365) THEN NVISITS=NVISITS+1; END; KEEP INPT_ID ADMITDAY NVISITS; OUTPUT; RETURN; Algorithm: 1) Load next inpatient record. 2) Load every outpatient record one after another, and increment NVISITS if the outpatient ID matches the current inpatient ID, and the visit occurred within days of the current admit day.

PDV after each SET OUTPATIENT statement
_N_ INPT_ID ADMITDAY OUTPT_ID VIZDAY NVISITS FEB MAR FEB OCT FEB FEB FEB JUL FEB DEC FEB APR JUL MAR JUL OCT JUL FEB JUL JUL JUL DEC JUL APR JAN MAR JAN OCT JAN FEB JAN JUL JAN DEC JAN APR MAR MAR MAR OCT MAR FEB MAR JUL MAR DEC MAR APR INPATIENT DATASET INPT_ID ADMITDAY FEB2008 JUL2008 JAN2008 MAR2008 OUTPATIENT DATASET OUTPT_ID VIZDAY MAR2007 OCT2007 FEB2007 JUL2007 DEC2007 APR2008

Efficiency Two problems with this simplified approach:
#1 - Always starts searching on the first outpatient record. Solution: in each data step iteration, keep track of the row number in the outpatient file where you first found a match for the patient identifier you’re looking for. On the next iteration of the data step, use this retained row number as the starting row for the search. #2 - Always keeps searching until end of file. Solution: instead of a do loop that searches through every row of the outpatient data, use a do while loop that terminates once the outpatient ID is greater than the inpatient ID. These are of great concern in large datasets. Iterating through records needlessly takes a long time (longer than you might think). Sorting by patient ID and date is necessary for this to work.

Pitfalls Incorrect use of the POINT option can cause an infinite loop. Make sure things are moving along at a smooth pace. Try this: if not mod(_N_,100000) then put _N_=; On every hundred thousandth iteration of the data step, the current value of _N_ will be output to the log. This works in the Windows enhanced editor environment, not sure about Enterprise Guide. Watch out for variable names that are common between the two datasets. The value and metadata of the variable in the second named dataset will overwrite that of the first.

Another application HSR&D funded grant: “Validating and Classifying VA Readmissions for Quality Assessment and Improvement” – Amy Rosen, PI Objective 1) Estimate risk-adjusted models to predict 30-day readmissions for patients discharged with HF, AMI, or pneumonia from an acute-care VA facility. How does the programmer create a vector of risk adjustment variables? For each inpatient index discharge, search through that patient’s inpatient and outpatient history, and flag the risk adjustment variable as YES upon seeing a relevant diagnosis. Datasets contain 100,000+ index admissions and 20M rows of outpatient data to search through for risk adjusment terms.

One last example “You have two datasets: one with 425 million rows, each row containing information about one marketing sent to a potential customer. The second dataset has 10 million records, each row having information about an order placed by a customer. For each order, find out which marketing stimulus, if any, we can attribute the order to, based on certain business rules. Here’s a laptop with PC SAS for you to do it on.”

Thank you! If anyone has questions or requests for sample code, me at:

Using Multiple SET Statements to Combine and Analyze Data

Similar presentations

Presentation on theme: "Using Multiple SET Statements to Combine and Analyze Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Multiple SET Statements to Combine and Analyze Data

Similar presentations

Presentation on theme: "Using Multiple SET Statements to Combine and Analyze Data"— Presentation transcript:

Similar presentations

About project

Feedback