Former Chapter 23: Selecting Efficient Sorting Strategies

Slides:



Advertisements
Similar presentations
Chapter 5: Creating and Managing Tables using PROC SQL 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Advertisements

Chapter 9: Introducing Macro Variables 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Performing Queries Using PROC SQL Chapter 1 1 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 3: Combining Tables Horizontally using PROC SQL 1 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
PHP meets MySQL.
Microsoft Access 2003 Define some key Access terminology: Field – A single characteristic or attribute of a person, place, object, event, or idea. Record.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Access 2003 Lab 3 Analyzing Data and Creating Reports.
©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina Chapter 17 supplement: Review of Formatting Data STAT 541.
Exploring Microsoft Office XP - Microsoft Word 2002 Chapter 71 Exploring Microsoft Word Chapter 7 The Expert User: Workgroups, Forms, Master Documents,
Chapter 15: Combining Data Horizontally 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Chapter 16: Using Lookup Tables to Match Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 17: Formatting Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 19: Introduction to Efficient SAS Programming 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious.
Chapter 4: Combining Tables Vertically using PROC SQL 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 17 Supplement: Alternatives to IF-THEN/ELSE Processing STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South.
Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
Chapter 14: Combining Data Vertically 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
CS4432: Database Systems II
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
SAS and Other Packages SAS can interact with other packages in a variety of different ways. We will briefly discuss SPSSX (PASW) SUDAAN IML SQL will be.
Chapter 5: Enhancing Your Output with ODS
Physical Changes That Don’t Change the Logical Design
Chapter 6: Modifying and Combining Data Sets
Decomposition Storage Model (DSM)
Chapter 2: Getting Data into SAS
Chapter 19: Introduction to Efficient SAS Programming
Hash-Based Indexes Chapter 11
Chapter 3: Working With Your Data
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Chapter 1: Introduction to SAS
Chapter 11: File System Implementation
Chapter 4: Sorting, Printing, Summarizing
Tutorial 1 – Introduction To Microsoft Access 2003
Database Implementation Issues
INFO/CSE 100, Spring 2006 Fluency in Information Technology
Tutorial 3 – Querying a Database
Lecture 12 Lecture 12: Indexing.
Chapters 5 and 7 supplement
Chapter 24 (4th ed.): Creating Functions
Alex Chaplin October 30th, 2018
Tutorial 1 – Introduction To Microsoft Access 2003
Programming Logic and Design Fourth Edition, Comprehensive
More about Databases.
Hash-Based Indexes Chapter 10
Chance to make SAS-L History!
PaP Publication strategy TT2019 – Combined PaP
Using Macros to Solve the Collation Problem
DATABASE IMPLEMENTATION ISSUES
Spreadsheets, Modelling & Databases
Chapter 24: Querying Data Efficiently
Indexing 4/11/2019.
Efficient Selective Unduplication Using the MODIFY Statement
Integrating Office 2013 Programs
Chapter 11 Instructor: Xin Zhang
Database Implementation Issues
STAT 515 Statistical Methods I Lecture 1 August 22, 2019 Originally prepared by Brian Habing Department of Statistics University of South Carolina.
Presentation transcript:

Former Chapter 23: Selecting Efficient Sorting Strategies STAT 541 Former Chapter 23: Selecting Efficient Sorting Strategies This chapter was deleted from later editions ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina

Outline Avoiding Unnecessary Sorts Using a Threaded Sort Calculating and Allocating Sort Resources Handling Large Data Sets Removing Duplicate Observations Efficiently Take-away message from first bullet: SORT requires a lot of resources—minimize its use.

Avoiding Unnecessary Sorts Sorts can be avoided in some situations BY groups with an Index If a data set includes an index, you can use a BY statement on the indexed variable without having used PROC SORT The BY statement can be used in a DATA step or PROC step Processing a data set with an index may be less efficient than PROC SORT Does not index if DESCENDING or NOTSORTED are used or data is pre-sorted Cover first two sub-bullets only 3

Avoiding Unnecessary Sorts The NOTSORTED option groups the data on the BY variable, but doesn’t order groups Useful when sorting on nominal groupings Results are interesting when data is not pre-grouped proc freq data=stat541.fall2008; by gender notsorted; table major; run; We can use FIRST. and LAST. with the NOTSORTED option Great idea for non-SAS sorted/grouped data (e.g., Excel worksheets). Odd results otherwise. 4

Avoiding Unnecessary Sorts You can actually group on formatted values rather than the variable itself GROUPFORMAT option can only be used in the DATA step GROUPFORMAT allows you to create groups without creating a new variable The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE GROUPFORMAT’s big advantage lies in the third bullet. Get in the habit of using CLASS rather than BY. 5

Avoiding Unnecessary Sorts PROC CONTENTS can be used to see whether data is already sorted The SORTEDBY option can then be used to include the sort information as a data set attribute You should always check whether data is sorted 6

PROC SORT dsname THREADS|NOTHREADS; Using a Threaded Sort Threaded sorts can distribute sorting across multiple CPUs PROC SORT dsname THREADS|NOTHREADS; You can modify or query the number of CPUs with system option CPUCOUNT Skim 7

Handling Large Data Sets If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination. The BY statement is sometimes unnecessary 8

Handling Large Data Sets Many methods are available FIRSTOBS= OBS= in DATA step IF/OUTPUT in DATA step WHERE in PROC SORT step WHERE in DATA step A DATA step is better than PROC APPEND for reassembling a large data set Most of the tools here are very familiar to us. 9

Handling Large Data Sets The TAGSORT option saves only the BY variables and observation numbers in temporary files This saves on the space set aside for a SORT “tags” are a convenient feature; this should remind you of order() in R. 10

Removing Duplicate Observations Efficiently NODUPKEY NODUPRECS Checks the entire record, not just the BY variables Can be limited to post-DROP and post-KEEP variables FIRST. and LAST. I’m surprised the book included this choice 11