Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

Effecting Efficiency Effortlessly Daniel Carden, Quanticate.
An Exercise in Improving SAS Performance on Mainframe Processors
Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.
S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Segmentation and Paging Considerations
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
Introduction to C Programming
Tables Lesson 6. Skills Matrix Tables Tables store data. Tables are relational –They store data organized as row and columns. –Data can be retrieved.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Introduction of z/OS Basics © 2006 IBM Corporation Chapter 5: Working with data sets.
Introduction to C Programming
1 Computer Applications in Epidemiology Dongmei Li Lecture 26 5/6/2009.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
SAS: Managing Memory and Optimizing System Performance Jacek Czajkowski 09/29/2008.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
Structured COBOL Programming, Stern & Stern, 9th Edition
CHAPTER 4: INTRODUCTION TO COMPUTER ORGANIZATION AND PROGRAMMING DESIGN Lec. Ghader Kurdi.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Efficiencies with Large Datasets Greater Atlanta SAS Users Group July 18, 2007 Peter Eberhardt.
SAS SQL SAS Seminar Series
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Numeric precision in SAS. Two aspects of numeric data in SAS The first is how numeric data are stored (how a number is represented in the computer). –
1 Computing Software. Programming Style Programs that are not documented internally, while they may do what is requested, can be difficult to understand.
Introduction to SAS. What is SAS? SAS originally stood for “Statistical Analysis System”. SAS is a computer software system that provides all the tools.
Key Data Management Tasks in Stata
Multiple Uses for a Simple SQL Procedure Rebecca Larsen University of South Florida.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat.
Database structure and space Management. Database Structure An ORACLE database has both a physical and logical structure. By separating physical and logical.
Chapter 15 Introduction to PL/SQL. Chapter Objectives  Explain the benefits of using PL/SQL blocks versus several SQL statements  Identify the sections.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
A SAS User's Guide to Storage Management Allan Page Senior Marketing Analyst Canadian Tire Financial Services.
© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.
Access Chapter 1: Intro to Access Objectives Navigate among objects in Access database Difference between working in storage and memory Good database file.
14b. Accessing Data Files in SAS ®. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007.
1 Copy and paste your photo into your opening and closing slide Aleph Parallel Indexing Jerry Specht Senior Support Analyst
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
For a programming more efficient Claude Guyot PhUSE 2010 – Berlin Paper CS05.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
Online Programming| Online Training| Real Time Projects | Certifications |Online Classes| Corporate Training |Jobs| CONTACT US: STANSYS SOFTWARE SOLUTIONS.
SQL Triggers, Functions & Stored Procedures Programming Operations.
( ) 1 Chapter # 8 How Data is stored DATABASE.
Better Metadata Through SAS® II: %SYSFUNC, PROC DATASETS, and Dictionary Tables.
Subject Name: File Structures
CHP - 9 File Structures.
Database Performance Tuning and Query Optimization
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Disk storage Index structures for files
Programming Logic and Design Fourth Edition, Comprehensive
files Dr. Bhargavi Goswami Department of Computer Science
Chapter 14 Sorting and Merging.
How are your SAS Skills? Chapter 1: Accessing Data (Question # 1)
Chapter 11 Database Performance Tuning and Query Optimization
Writing Robust SAS Macros
Presentation transcript:

Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer

Introduction ‘Large’ SAS dataset is a subjective term as it largely depends on how a user perceives it to be and also on the available resources. Typical SAS Dataset Observations Variables “LARGE” SAS Dataset Millions of Observations Hundreds of Variables

Challenges with Large SAS datasets Storage: – Disk Space – Memory Time: – Real Time – CPU time There are several ways by which one can handle these challenges. One of the most common approach is to make LARGE datasets smaller without losing any of its information.

Agenda Techniques to reduce or compress the LARGE SAS datasets Programming tips to make working with LARGE SAS datasets efficient.

LENGTH STATEMENT Can be used to set or control the number of bytes required to store a SAS variable. Advantages: – Reduces the storage space required by variables. Drawbacks: – Requires extra programming time to reduce length of variables. – Might lead to wrong results when used incorrectly. -Reducing the length of fractional numbers or numbers with decimals might result in a loss of accuracy due to truncation. -It requires not more than 4 bytes to store a reasonable SAS date. – There are no warnings or errors are issued when the specified length in the LENGTH statement results in the truncation of data.

%SQUEEZE macro Originally developed by Ross Bettinger* can be used to find the minimum lengths required for both numeric and character variables in a SAS dataset. These minimum lengths can be assigned to the variables to reduce their storage space and hence reducing the size of the SAS dataset. * Please see the references slide at the end of presentation for more details.

COMPRESS Dataset Option Compression is a process that reduces the number of bytes required to represent each observation. Dataset options COMPRESS= BINARY|CHAR can be used to carry out the compression of observations in output SAS datasets. The resultant compressed dataset obtained from this compression technique requires less storage space and fewer I/O operations to read or write data, however additional CPU resources might be needed to access that file. * COMPRESS= NO disables compression.

LENGTH & COMPRESS Results

Results Summary LENGTH Statement when used in combination with COMPRESS= BINARY option yields the best results with respect to reduction of data size. The resultant data was around 15 % of the original data size. SAS programs were executed using these reduced datasets and it was observed that the performance was faster than running them through corresponding original datasets. This observation may not be true with all SAS datasets as it largely depends on the size and structure of the dataset, number/lengths of the variables, SAS job to be performed and the operating environment.

DROP= and KEEP= Dataset options SAS dataset may contain many variables that are either completely blank or not required for report generation. Dropping such missing variables might make a huge difference when the size of the dataset is large. The DROP= and KEEP= option or DATA step statements DROP and KEEP can be used to select variables from a SAS dataset. DROP and KEEP dataset options are always flexible, efficient and advantageous to use over DROP and KEEP DATA step statements as: – They can be exclusively applied to the variables of input and/or output dataset – They can be used in Procedures

Summary I Points discussed: – Techniques to reduce the size of Large SAS datasets. – Advantages and drawbacks of these techniques. Next section covers few programming tips and techniques to make working with LARGE SAS datasets more efficient.

Tip # 1 – Create Variables While working on analysis datasets, one usually tends to create flag variables based on certain criteria. Since, the default length of numeric variable is 8 bytes and its minimum length cannot be less than 3 bytes, instead one can think of creating a character variable to store such flags. For example, if a flag contains either 0 or 1, then it can be stored as a character variable having a length of 1 byte.

Tip # 2 – PROC DATASETS To efficiently utilize the available storage space, one can delete the SAS datasets in the WORK library, other temporary space, or permanent space when they are no longer needed. This can be done using the DATASETS procedure as shown in the example below: /* To clear ALL work datasets */ PROC DATASETS LIBRARY=WORK KILL NOLIST; QUIT;

Tip # 3 – _NULL_ _NULL_ can be used as a dataset name in the DATA statement when we need to execute a DATA step but do not want to create a SAS dataset. This is particularly useful while creating macro variables based on values obtained from an input SAS data set. For example: /* To count number of observations in dataset */; DATA _NULL_; SET raw.testdrug END =eof; IF eof THEN CALL SYMPUT('nobs', left(put(_N_))); RUN;

Tip # 4 – SAS Views SAS Views are virtual SAS datasets that can be used as an alternative to SAS datasets. Advantages: – Avoid unnecessary reading or writing of temporary datasets. – Occupy very small amount of storage space as compared to the space required by its original SAS data set. Drawbacks: – It takes additional time to process the data defined by a SAS VIEW as compared to processing a regular SAS dataset.

Tip # 5 - Merging Merging of two or more datasets can be done by either using SQL JOIN or DATA Step MERGE depending on the data situation. Since DATA Step MERGE requires prior sorting, it might result in creation of temporary datasets. SQL JOIN can be used to avoid such requirement, thus saving the storage space.

Tip # 6 - Subsetting Subsetting or filtering any unwanted records from a dataset reduce the size of the data thus saves the storage space. For this purpose, either a WHERE statement or an IF statement can be appropriately used depending on the underlying task and data situation. In most cases, a WHERE statement proves to be efficient and faster in performance than an IF statement.

Tip # 7 - Sorting Sorting a large dataset with several key variables can take enormous time and it can cause insufficient disk space or memory error. One can perform Sorting and Subsetting in the same SORT procedure as: – Takes lesser amount of processing time than total time consumed to individually subset a data and then sort it in two separate steps. – Only one dataset created in the WORK library instead of two thus saving the storage space. The TAGSORT option in the PROC SORT statement can be used when there insufficient disk space is available to sort a large SAS data set.

Summary II While working with Large datasets, one might come across few challenges or constraints that are hardly encountered while working with smaller size SAS datasets. There are many techniques available to overcome these challenges. It is very important to be aware of the benefits and trade-offs of each of the available technique depending on the data situation, project requirements and available resources.

References SAS Support, Paul Gorrell, NESUG 2007, Numeric Length: Concepts and Consequences Andrew H. Karp and David Shamlin, SUGI Paper 3-28, Indexing and Compressing SAS® Data Sets: How, Why and Why Not Sunil Gupta, SAS Global Forum, Paper , WHERE vs. IF Statements: Knowing the Difference in How and When to Apply Selvaratnam Sridharma, NESUG 2006, How to Reduce the Disk Space Required by a SAS® Data Set SAS Support, Sample 24804: %SQUEEZE-ing Before Compressing Data, Redux

Questions ?

THANK YOU!! Any comments or suggestions are welcomed on