SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.

Slides:



Advertisements
Similar presentations
Effecting Efficiency Effortlessly Daniel Carden, Quanticate.
Advertisements

Haas MFE SAS Workshop Lecture 3:
Axio Research E-Compare A Tool for Data Review Bill Coar.
The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Slide C.1 SAS MathematicalMarketing Appendix C: SAS Software Uses of SAS  CRM  datamining  data warehousing  linear programming  forecasting  econometrics.
S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008.
Outline Proc Report Tricks Kelley Weston. Outline Examples 1.Text that spans columnsText that spans columns 2.Patient-level detail in the titlesPatient-level.
Tutorial 12: Enhancing Excel with Visual Basic for Applications
Statistics in Science  Introducing SAS ® software Acknowlegements to David Williams Caroline Brophy.
I OWA S TATE U NIVERSITY Department of Animal Science Modifying and Combing SAS Data Sets (Chapter in the 6 Little SAS Book) Animal Science 500 Lecture.
Introduction to Structured Query Language (SQL)
XP 1 Working with JavaScript Creating a Programmable Web Page for North Pole Novelties Tutorial 10.
Introduction to SQL Session 1 Retrieving Data From a Single Table.
Basic And Advanced SAS Programming
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
1 Computer Applications in Epidemiology Dongmei Li Lecture 26 5/6/2009.
Using Proc Datasets for Efficiency Originally presented as a Coder’s NESUG2000 by Ken Friedman Reviewed by Karol Katz.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Creating SAS® Data Sets
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
Efficiencies with Large Datasets Greater Atlanta SAS Users Group July 18, 2007 Peter Eberhardt.
SAS SQL SAS Seminar Series
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
Key Data Management Tasks in Stata
SAS Macro: Some Tips for Debugging Stat St. Paul’s Hospital April 2, 2007.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
Lesson 2 Topic - Reading in data Chapter 2 (Little SAS Book)
Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat.
Create Lists in Millennium Jenny Schmidt SWITCH Library Consortium.
1 Efficient SAS Coding with Proc SQL When Proc SQL is Easier than Traditional SAS Approaches Mike Atkinson, May 4, 2005.
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
An Object-Oriented Approach to Programming Logic and Design Chapter 3 Using Methods and Parameters.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 5 Reading and Manipulating SAS ® Data Sets and Creating Detailed Reports Xiaogang Su Department of Statistics University of Central Florida.
© OCS Biometric Support 1 APPEND, EXECUTE and MACRO Jim Groeneveld, OCS Biometric Support, ‘s Hertogenbosch, Netherlands. PhUSE 2010 – CC05 PhUSE 2010.
Creating and Using Custom Formats for Data Manipulation and Summarization Presented by John Schmitz, Ph.D. Schmitz Analytic Solutions, LLC Certified Advanced.
Chapter 17: Formatting Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Controlling Input and Output
Lecture 4 Ways to get data into SAS Some practice programming
An Introduction Katherine Nicholas & Liqiong Fan.
SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
CC07 PhUSE 2011 Seven Sharp tips for Clinical Programmers David Garbutt Rohit Banga BIOP AG.
Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Lesson 2 Topic - Reading in data Programs 1 and 2 in course notes –Chapter 2 (Little SAS Book)
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Online Programming| Online Training| Real Time Projects | Certifications |Online Classes| Corporate Training |Jobs| CONTACT US: STANSYS SOFTWARE SOLUTIONS.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
Session 1 Retrieving Data From a Single Table
Chapter 2: Getting Data into SAS
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Former Chapter 23: Selecting Efficient Sorting Strategies
Topics Introduction to File Input and Output
SAS Essentials How SAS Thinks
Introduction to DATA Step Programming: SAS Basics II
Topics Introduction to File Input and Output
Introduction to SAS Essentials Mastering SAS for Data Analytics
Presentation transcript:

SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles

Efficiency can be measured in many ways – e.g.: CPU time Disk space required Memory Input / Output Original Programmer time Maintenance Programmer time

Outline Copying a data set Changing attributes Appending data sets Procedures Rename variables Alternatives to logical OR constructs Formats Indexing Disk Space Views Proc Sort and disk space Hashing – merging data sets

Avoid reading the dataset multiple times Data work.data1 work.data2 work.data3; Set lib1.master; If substr(x, 1, 1) EQ 'A' then output work.data1; Else if substr(x, 1, 1) EQ 'B' then output work.data2; Else if substr(x, 1, 1) EQ 'C' then output work.data3; Run; Data work.data1 work.data2 work.data3; Set lib1.master; Type = substr(x, 1, 1); If type EQ 'A' then output work.data1; Else if type EQ 'B' then output work.data2; Else if type EQ 'C' then output work.data3; Drop type; Run;

Copying a dataset: /* inefficient */ data work.data; set lib1.data; run; /* efficient */ proc datasets lib = work nolist; copy in = lib1 out = work; select data; quit;

Changing Attributes: /* inefficient */ /* reads & writes one observation at a time */ data work.data; set lib1.data; label age = 'Years'; format salary dollar10.; rename cars = autos; Run;

Changing Attributes: /* efficient */ proc datasets lib = work nolist; copy in = lib1 out = work; select data; modify data (label = "Demographic Data"); label age = 'Years'; format salary dollar10.; rename cars = autos; change data = demograph; contents data = demograph; quit;

Appending datasets: /* inefficient */ /* reads and writes one observation at a time */ data work.data1; set work.data1 work.data2; run;

Appending datasets: /* efficient */ proc datasets nolist; append base = work.data1 data = work.data2; quit;

Give procedures what they need – but no more: /* inefficient */ Proc sort data = lib1.data Out = work.data; By var1 var2 var3; Run;

Give procedures what they need – but no more: /* still inefficient, but less so */ Proc sort data = lib1.data Out = work.data (drop = var4); By var1 var2 var3; Run;

Give procedures what they need – but no more: /* Efficient */ Proc sort data = lib1.data (drop = var4) Out = work.data; By var1 var2 var3; Run;

Give procedures what they need – but no more: Tools to limit datasets: Drop / keep Subsetting if Where Firstobs / Obs Use these on the input dataset as much as possible NOTE 1: Where has its affect after the Drop / Keep NOTE 2: You can now use Firstobs / Obs along with Where; Where has its affect first, then Firstobs / Obs

Use rename rather than reassign: Reassign: Creates another variable (probably needlessly) Takes up more space in the data set Executes needlessly each execution of the data step Is slower Rename: Occurs just once, at compile time, not execution time Might be able to avoid reading the data one observation at a time

Use rename rather than reassign: /* inefficient */ Data work.data; Set lib1.data; Var2 = var1; Run;

Use rename rather than reassign: /* still inefficient, but less so */ Data work.data; Set lib1.data; Rename Var1 = var2; Run;

Use rename rather than reassign: /* Efficient */ Proc datasets lib = work Nolist; Copy in = lib1 Out = work; Select data; Modify data; Rename var1 = var2; quit;

Avoid using logical OR constructs Can use IN() or Select/when – much faster, easier to code and understand (consider the order of the choices listed) data work.data1; set lib1.data1; length msg $20; select (category); when ('A') msg = 'Category A'; when ('B') msg = 'Category B'; when ('C') msg = 'Category C'; otherwise msg = 'Category unknown'; end; run;

Use formats to make coding shorter and more to the point: proc format ; value $status 'M','SEP' = 'A' 'S', 'D', 'W' = 'B'; run; data work.data; set lib1.data; where put(status, $status.) = 'A'; run; OR data work.data; set lib1.data (where = (put(status, $status.) = 'A')); Run;

Consider using an index instead of sorting the dataset When to consider using an index: When you can subset the dataset using a where option/statement When the dataset will be involved in a merge / join Tip: Use “options msglevel = I” (for Info) when using indexes; gives INFO: message in log when used /* 3 Ways of Creating an Index: */ Data Step Proc Datasets Proc SQL

Indices and the Data Step data work.data1 (index = (var1 = (var1) composite = (var1 var2 ))); set lib1.data1; run; /* Review index */ proc contents data = work.data1; run; /* Delete index – not preferred method */ data work.data1; set lib1.data1; run;

Alphabetic List of Indexes and Attributes # of Unique # Index Values Variables 1 composite 50 var1 var2 2 var1 50

Create index with proc datasets proc datasets lib = work nolist; Modify data1; /* do not include missing values in the index */ Index create var1 / nomiss; /* insure that index values are unique */ Index create composite = (var1 var2) / unique; Quit; /* delete index */ proc datasets lib = work nolist; Modify data1; Index delete var1 composite; Quit;

Create index with proc sql Proc sql; /* simple index */ Create index on (variables(s)); Create index prodno on work.products(prodno); /* composite index */ /* notice the commas */ Create index ordno on work.orders(custno, prodno, orderno); quit; /* Delete index */ proc sql; drop index ordno, orderno from work.orders; drop index prodno from work.products; quit;

DISK SPACE CONSIDERATIONS Consider putting long memo fields into a separate, associated dataset Use _NULL_ as the data set name when you do not need to create a dataset (e.g., when creating macro variables) Use the KEEP &/or DROP data set options (on input &/or output) or statements to limit the variables. Use the WHERE data set option (on input &/or output) or statement to limit the observations.

DISK SPACE CONSIDERATIONS Use data set compression (data set options COMPRESS = YES REUSE = YES). This is primarily useful with character data, or numeric data where the numbers are small. data work.new2 (compress = yes reuse = yes); set work.new; run; Use a data set view, rather than creating a SAS data set – essentially a “pointer” to the data. Use SQL to merge, summarize, sort, etc. instead of a combination of DATA steps and procs.

DISK SPACE CONSIDERATIONS Store the data in the order in which it is usually required. This saves the disk space used to re-sort the data. Create only the indexes that are needed - delete any unneeded ones. Delete old work libraries (use Windows Explorer) Delete old autosaved programs (use Windows Explorer) Delete datasets in current programs as soon as possible (proc datasets)

DISK SPACE CONSIDERATIONS Use appropriate length of variables -Find length of current character values, then make length current + 15% in new dataset to allow for longer lengths o Remember that character variables get their length from their first use, if not explicitly defined (e.g., scan, repeat, symget functions have default length of 200) -For numerics, can make length as low as 3 (default of 8) – good for ages (length of 3), dates (length of 4)

SAS DATA VIEWS Can be created with either a data step or proc sql. Does not contain any data – only instructions on how to access the data When the original data changes, it is immediately reflected in the view.

Data set View data work.class / view = class; set sashelp.class; where weight LE 125; drop height; run; title 'Printing View work.class'; proc print data = work.class heading = h n; run;

Data set View /* retrieve the source of a view */ /* prints the source in the log */ data view = work.class; describe; run;

Create view in sql title 'View Created with SQL'; proc sql; create view work.subview as select * from sashelp.class where height GT 50 order by name; select * from work.subview; quit; /* print source statements in log */ proc sql; describe view work.subview; quit;

Source stmts of SQL View 76 proc sql; 77 describe view work.subview; NOTE: SQL view WORK.SUBVIEW is defined as: select * from SASHELP.CLASS where height>50 order by name asc; 78 quit;

Saving Disk Space when using Proc Sort proc sort data = sashelp.class out = work.class1 tagsort; by name; run;

Tagsort : Stores only the BY variables and the obs # in temp. files Uses tags to retrieve the records from the input data set in sorted order Not supported by the multi-threaded sort. Best used when the total length of BY variables is small compared to length of entire observation. Can increase processing time.

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; * Set up hash info; * Read input dataset; * Check if key value is stored in hash; * Get value of data variable; * Write obs to the output data set(s); run;

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; if 0 then do; set ; set (keep = ); end;... run;

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; * Set up hash info; retain rc 0; if _n_ EQ 1 then do; declare hash h(dataset: " "); h.definekey(" ", " "); h.definedata(" ", " "); h.definedone(); * all definitions are complete; * assign missing values to vars; call missing(, ); end;... run;

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; * Set up hash info; * Read input dataset; set ;... run;

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; * Set up hash info; * Read input dataset; * Check if key value is stored in hash; rc = h.check();... Run;

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; * Set up hash info; * Read input dataset; * Check if key value is stored in hash; * Get value of data variable; if rc EQ 0 then do; rc = h.find(); end;... run;

Using Hashing to Merge Data Sets data ; * Set up input and lookup datasets; * Set up hash info; * Read input dataset; * Check if key value is stored in hash; * Get value of data variable; * Write obs to the output data set(s); output ; run;

Using Hashing to Merge Data Sets data ; if 0 then do; set ; set (keep = ); end; retain rc 0; if _n_ EQ 1 then do; declare hash h(dataset: " "); h.definekey(" ", " "); h.definedata(" ", " "); h.definedone(); * all definitions are complete; * assign missing values to vars; call missing(, ); end; set ; rc = h.check(); * is key stored in hash?; if rc EQ 0 then do; rc = h.find(); * get value of data variable; end; output ; drop rc; run;

Questions ?