Parallel Processing in Base SAS Shaving 6 hours off a 9 hour run time.
Parallel Processing in Base SAS Problem: Mission critical SAS job takes 9 hours to run. Approach: Use 1 supervisor process and up to 15 subordinate processes to perform the work simultaneously over multiple logical cores. Results: 9 hour run time reduced to 3 hours Myth 1: Parallel processing is not for I/O bound jobs. Truth 1: Primary read went from 4 hrs 45 min. Myth 2: You must license SAS/Connect to do parallel processing. Truth 2: We just used base SAS. Processing environment: UNIX (AIX) server with 80 logical cores running SAS 9.4 M4. Results in other environments may vary.
Parallel Processing in Base SAS We have enormous computational power at our disposal. Our UNIX machine has 80 logical cores. How many cores were we using on our mission critical 9 hour job? One. This is just nuts. Big hardware costs big money. Use it. Parallel Processing takes full advantage of your hardware.
Parallel Processing in Base SAS How do we set up parallel processing? Design – Structuring a job for parallel processing Beginner – SAS options. Just about everyone should do this! Intermediate – Concurrent whole Procedure or Data steps Advanced – Splitting a single Data step into multiple processes Programming “tools” – Implementing our design. OPTION THREADS – Enables multi-threading SYSTASK – Launches a subordinate process WAITFOR – Creates synch point in the supervisor process SET – Passes parameters to subordinate processes SYSGET – Receives parameters from supervisor process
“Beginner” Parallel Processing OPTIONS THREADS CPUCOUNT=ACTUAL; The below code went from 4 hours to 30 minutes, just by setting the options and letting SAS use its built-in multi- threading capabilities. You’re paying for this. Use it. PROC MEANS DATA=&Data._&Period. (OBS = &Means_Obs) MAXDEC = 2 QMETHOD = P2 n nmiss min max mean p1 p5 p25 p50 p75 p90 p95 p99 nway; RUN;
Traditional Design (Serial Processing) STEP 1 All steps are run one at a time in series. Total run time for all steps is the sum of the run times of each individual step. If each step takes 10 minutes, the total run time is 1 hour. STEP 2 STEP 3 STEP 4 STEP 5 STEP 6
“Intermediate” Parallel Processing Design STEP 1 STEP 2 STEP 3 STEP 4 STEP 5 STEP 6 All steps are run simultaneously in parallel. Total run time for all steps is the run time of the single longest running step. If each step takes 10 minutes, total run time is 10 minutes. We just shaved 50 minutes off our run time. In “Intermediate” parallel processing, we run whole steps. The trick is to find independent steps (no interdependencies) that can run concurrently.
Subordinate processes Supervisor Process A Parallel Processing SAS Job Set-up Step Subordinate processes Process Spawner Independent Step(s) Sub Process 1 Sub Process 2 Sub Process 3 Synch Point The Supervisor process controls the order and timing of all sub processes. The Supervisor process spawns one to many sub processes and continues executing until it reaches a synch (wait) point. The Supervisor process then “sleeps” until the sub processes complete. When all sub processes are complete, the Supervisor process wakes up and checks all their return codes. If the return codes are all good, the Supervisor executes any steps dependent on the results produced by the sub processes. If the return codes are not good, the Supervisor executes an error handling routine Check Return Codes RC is good RC is bad Dependent Step(s) Error Handling End of job
“Advanced” Parallel Processing Breaking up a Data Step If a read of a raw file with 200 million records were broken into 5 sub processes, each sub process on average would read 40 million records. Each sub process would be assigned to read a particular set of records simultaneously from the raw file as shown in this graphic. Raw File 200 million records total Process 1: 1 - 40M Process 2: 40M - 80M Process 3: 80M - 120M Watch out for RETAIN and LAG statements or any other inter-record processing. Process 4: 120M - 160M Process 5: 160M - 200M
My Nine Hour SAS Job (“Before”) Traditional Design (Serial Processing) Overall run time: 9 hours for 230,000,000 records Trade Input All input programs are run in series (one at a time), each program waiting for the prior program to complete before reading even one record. Trades are the largest. Run time is about 4 hours for 180+ million trades. Consumer Input Inquiries Input Public Records Input The interleave combines the four types of data. Interleave Overall run time is the sum of the run times of each individual step. Trade Freq 1 Trade Freq 2 Trade Freq 3 All frequencies are run in series. Total run time for frequencies is the sum of the run times of each individual frequency. Trade Freq 4 Trade Freq 5 Consumer Freq Inquiries Freq Pub Recs Freq
My Nine Three Hour SAS Job (“After”) Additional Processes … “Supervisor” Process Consumer Input Trade Input Process 1 Trade Input Process 2 Trade Input Process 3 Trade Input Process … Parallel Processing Design Overall run time: 3 hours for 230 million records. Inquiries Input Public Records Input Instead of serially processing the input of Trades, up to 15 sub processes are run simultaneously. Run time went from 4 hours to less than 45 minutes for 180+ million trades. While the Trade input processes are running, all processing, including frequencies, is simultaneously completed for Consumers, Inquiries, and Public Records. Cons, Inquires, & Pub Recs Freq Interleave Trade Frequency Process 1 Trade Frequency Process 2 Trade Frequency Process 3 Trade Frequency Process 4 Trade Frequency Process 5 Trade frequencies are run in parallel (simultaneously). Overall run time is the run time of the single longest running process. End of Job
Parallel Processing in Base SAS What SAS tools do we need for parallel processing? OPTION THREADS – Enables multi-threading SYSTASK – Launches a subordinate process WAITFOR – Creates a sync point in the supervisor process SET – Passes parameters to subordinate processes SYSGET – Receives parameters from supervisor process
SYSTASK – Launches a subordinate process We use SYSTASK to launch subordinate processes. SYSTASK passes commands to UNIX. For parallel processing, the commands will execute a SAS job, and we will code the commands just as though we were typing them in from the command line. SYSTASK(nohup sas amazing_pgm.sas &) NOWAIT TASKNAME=task name STATUS=a return code macro variable; We append to SYSTASK the NOWAIT, TASKNAME, and STATUS parameters. NOWAIT causes the sub process to execute asynchronously. TASKNAME gives us a name by which to reference a given sub process. STATUS causes a return code to be placed in the specified macro variable. Note that SYSTASK is not restricted to SAS programs alone but can be used to launch almost any UNIX process. See the paper corresponding to these slides for additional information.
WAITFOR – A Synch Point in the Supervisor Process We run our subordinate processes asynchronously, that is, the supervisor process does not wait for the sub processes to finish before resuming execution. At some point, however, we’ll want to bring the results of the sub processes back into the supervisor process. A WAITFOR instructs the supervisor process to wait for the sub processes to complete. Parameter _ALL_ tells the supervisor to wait until all sub processes complete. _ANY_ tells the supervisor to resume execution if any sub process completes. If we want to run two SAS programs as sub processes and then use the results from both, we might code the following: SYSTASK(nohup sas amazing_pgm.sas &) NOWAIT TASKNAME=Thrd1 STATUS=RC1; SYSTASK(nohup sas astounding_pgm.sas &) NOWAIT TASKNAME=Thrd2 STATUS=RC2; ... Additional SAS statements in the supervisor process ... WAITFOR _ALL_ Thrd1 Thrd2; See the paper corresponding to these slides for additional information.
SET – Passes parameters to subordinate processes SET establishes UNIX level parameters. We use SET as part of the commands passed by SYSTASK to UNIX. For example, if we wanted to pass a LIBNAME called “SAS_Lib” to a sub process, we might code: SYSTASK(nohup sas -set lib_name SAS_Lib amazing_pgm.sas &) NOWAIT TASKNAME=Thrd1 STATUS=RC1; Note that there are other ways to pass parameters. I happen to like SET. YMMV. See the paper corresponding to these slides for additional information.
SYSGET – Receives supervisor process parameters Once we have SET a parameter in the supervisor process, we need to get that value into the subordinate process. SYSGET will get UNIX level parameters. To get the LIBNAME we SET in the previous slide, we might code the following: %LET lib_name = %SYSGET(lib_name); Note that SYSGET is a macro function. See the paper corresponding to these slides for additional information.
SAS code in the subordinate process The SAS code in the subordinate process: Every day, regular SAS code just like you’d submit from the UNIX command line or Enterprise Guide. The only real difference is the use of SYSGET SET and SYSGET can be used for any SAS programs submitted from the command line. In other words, the use of SET and SYSGET isn’t restricted to just parallel processing. See the paper corresponding to these slides for additional information.
Parallel Processing in Base SAS Q & A
Contact Information Name: Jim Barbour Company: Experian City/State: Costa Mesa, CA Email: jim.barbour@gmail.com