Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

Chapter 5: Creating and Managing Tables using PROC SQL 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 9: Introducing Macro Variables 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.
1 C++ Syntax and Semantics The Development Process.
Introduction to SAS Programming Christina L. Ughrin Statistical Software Consulting Some notes pulled from SAS Programming I: Essentials Training.
Statistics in Science  Introducing SAS ® software Acknowlegements to David Williams Caroline Brophy.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Chapter 13: Creating Samples and Indexes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Microsoft Visual Basic 2005: Reloaded Second Edition Chapter 9 Structures and Sequential Access Files.
Chapter 2: Introduction to C++.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Oracle Data Definition Language (DDL)
CSC 125 Introduction to C++ Programming Chapter 2 Introduction to C++
Chapter 10:Processing Macro Variables at Execution Time 1 STAT 541 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Java Primitives The Smallest Building Blocks of the Language (corresponds with Chapter 2)
Numeric precision in SAS. Two aspects of numeric data in SAS The first is how numeric data are stored (how a number is represented in the computer). –
C Tokens Identifiers Keywords Constants Operators Special symbols.
Key Data Management Tasks in Stata
©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina Chapter 17 supplement: Review of Formatting Data STAT 541.
Creating and Managing Indexes Using Proc SQL Chapter 6 1.
CS1 Lesson 2 Introduction to C++ CS1 Lesson 2 -- John Cole1.
Chapter 15: Combining Data Horizontally 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Outline Character Strings Variables and Assignment Primitive Data Types Expressions Data Conversion Interactive Programs Graphics Applets Drawing Shapes.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
Chapter 1: Introduction to SAS  SAS programs: A sequence of statements in a particular order  Rules for SAS statements: –Every SAS statement ends in.
Flow of Control Part 1: Selection
1 Structured Query Language (SQL). 2 Contents SQL – I SQL – II SQL – III SQL – IV.
Copyright © 2012 Pearson Education, Inc. Chapter 2: Introduction to C++
Chapter 16: Using Lookup Tables to Match Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Copyright © 2015, 2012, 2009 Pearson Education, Inc., Publishing as Addison-Wesley All rights reserved. Chapter 2: Introduction to C++
An Introduction to Programming with C++ Sixth Edition Chapter 14 Sequential Access Files.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Harvard University Oracle Database Administration Session 6 Object Storage.
A SAS User's Guide to Storage Management Allan Page Senior Marketing Analyst Canadian Tire Financial Services.
Chapter 17: Formatting Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
A Simple Java Program //This program prints Welcome to Java! public class Welcome { public static void main(String[] args) { public static void main(String[]
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Chapter 19: Introduction to Efficient SAS Programming 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 17 Supplement: Alternatives to IF-THEN/ELSE Processing STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South.
Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Microsoft Visual Basic 2005: Reloaded Second Edition Chapter 9 Structures and Sequential Access Files.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Introduction to C++
Chapter 14: Combining Data Vertically 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
CS4432: Database Systems II
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
2.1 The Part of a C++ Program. The Parts of a C++ Program // sample C++ program #include using namespace std; int main() { cout
Chapter 14: Sequential Access Files
Microsoft Visual Basic 2005: Reloaded Second Edition
Chapter 2: Getting Data into SAS
Chapter 19: Introduction to Efficient SAS Programming
Chapter 3: Working With Your Data
Chapter 13: Creating Samples and Indexes
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Former Chapter 23: Selecting Efficient Sorting Strategies
Chapter 1: Introduction to SAS
Variables and Arithmetic Operations
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Topics Introduction to File Input and Output
2.1 Parts of a C++ Program.
Chapter 24 (4th ed.): Creating Functions
Storing Negative Integers
Chapter 2: Introduction to C++.
Data Types and Expressions
Chapter 24: Querying Data Efficiently
B065: PROGRAMMING Variables 1.
EECE.2160 ECE Application Programming
Topics Introduction to File Input and Output
Presentation transcript:

Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina

2 Outline Reducing Data Storage Space Reducing Data Storage Space Compressing Data Files Compressing Data Files Using Views to Conserve Data Storage Using Views to Conserve Data Storage There’s a level of detail here that I used to cover in STAT 740 and long ago abandoned. I’ll take the same approach for this material

3 Reducing Data Storage Space Character variables Character variables –Reminder: The LENGTH statement should be listed immediately after the DATA step (even before a SET command) to take effect –Use codes rather than lengthy character variables where possible

4 Reducing Data Storage Space Numeric variables Numeric variables –The LENGTH variable should only be used with integers, since it otherwise truncates significant digits from the numeric variable (sign, exponent, mantissa) –DEFAULT= assigns a default length to all subsequent numeric variables, and hence should be used with caution –PROC COMPARE can summarize rounding errors

5 Compressing Data Files Uncompressed Data files have several inefficiencies Uncompressed Data files have several inefficiencies –Column space is constant for each record –Observation lengths are equal –Character variables are padded with blanks –Numeric variables are padded with 0s in the mantissa –New observations may cause an entire new page to be created

6 Compressing Data Files Compressed Data files have efficiencies that you might anticipate, as well as some that would surprise you Compressed Data files have efficiencies that you might anticipate, as well as some that would surprise you –Observations are treated as a string of bytes –Blanks are removed –Consecutive repeated characters and numbers are compressed –Information on updated observations is not necessarily stored on the same page Greater overhead is required (e.g., pointers) Greater overhead is required (e.g., pointers)

7 Compressing Data Files Rules for when to compress data sets are intuitive: Rules for when to compress data sets are intuitive: –Large data sets –Many long character variables –Many repeated character/numeric variables –Many missing values –Many consecutive repeated character/numeric variables

8 Compressing Data Files Two options (and accompanying suboptions) for compressing files Two options (and accompanying suboptions) for compressing files OPTIONS COMPRESS=NO|YES|CHAR|BINARY OPTIONS COMPRESS=NO|YES|CHAR|BINARY –System compress (affects every data set in your SAS session) DATA dsname (COMPRESS=NO|YES|CHAR|BINARY) DATA dsname (COMPRESS=NO|YES|CHAR|BINARY) –Data set compress –YES and CHAR are good for simple character repeats –BINARY is efficient for long observations, and data with large blocks of numeric variables (e.g., testing data) –BINARY requires more CPU to uncompress SAS writes a message to LOG summarizing compression SAS writes a message to LOG summarizing compression

9 Compressing Data Files By default, new observations are appended to the end of a data set (implicit OUTPUT). REUSE allows SAS to repurpose accumulated empty space in the compressed data set By default, new observations are appended to the end of a data set (implicit OUTPUT). REUSE allows SAS to repurpose accumulated empty space in the compressed data set OPTIONS REUSE=NO|YES OPTIONS REUSE=NO|YES –System reuse DATA dsname (COMPRESS=YES REUSE=YES|NO) DATA dsname (COMPRESS=YES REUSE=YES|NO) –Data set reuse Once selected, the REUSE option cannot be changed Once selected, the REUSE option cannot be changed

10 Compressing Data Files Remember the use of POINT= when creating random samples in Chapter 13? This direct access has high overhead for compressed data sets and can be disabled with POINTOBS=NO to prevent this inefficient access technique Remember the use of POINT= when creating random samples in Chapter 13? This direct access has high overhead for compressed data sets and can be disabled with POINTOBS=NO to prevent this inefficient access technique

11 DATA step views We introduced SQL views (partially compiled tables) in Chapter 7 as an important space-saving measure We introduced SQL views (partially compiled tables) in Chapter 7 as an important space-saving measure Views can be created in the DATA step as well (and are distinct from SQL views): Views can be created in the DATA step as well (and are distinct from SQL views): DATA dsname/VIEW=dsname; DATA VIEW=dsname; DESCRIBE;

12 DATA step views Remember that views are not a panacea—they should not be called multiple times in a program since they have to read anew their source data each time the view is referenced. Remember that views are not a panacea—they should not be called multiple times in a program since they have to read anew their source data each time the view is referenced. Consider saving the view in another data set instead, then referencing that data set instead of the view Consider saving the view in another data set instead, then referencing that data set instead of the view Views must be kept current as underlying data sets change, which creates additional overhead. Views must be kept current as underlying data sets change, which creates additional overhead. Don’t create views that use files whose variable names/length/labels often change Don’t create views that use files whose variable names/length/labels often change