Download presentation
Presentation is loading. Please wait.
Published byRolf Hawkins Modified over 6 years ago
1
Former Chapter 23: Selecting Efficient Sorting Strategies
STAT 541 Former Chapter 23: Selecting Efficient Sorting Strategies This chapter was deleted from later editions ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina
2
Outline Avoiding Unnecessary Sorts Using a Threaded Sort
Calculating and Allocating Sort Resources Handling Large Data Sets Removing Duplicate Observations Efficiently Take-away message from first bullet: SORT requires a lot of resources—minimize its use.
3
Avoiding Unnecessary Sorts
Sorts can be avoided in some situations BY groups with an Index If a data set includes an index, you can use a BY statement on the indexed variable without having used PROC SORT The BY statement can be used in a DATA step or PROC step Processing a data set with an index may be less efficient than PROC SORT Does not index if DESCENDING or NOTSORTED are used or data is pre-sorted Cover first two sub-bullets only 3
4
Avoiding Unnecessary Sorts
The NOTSORTED option groups the data on the BY variable, but doesn’t order groups Useful when sorting on nominal groupings Results are interesting when data is not pre-grouped proc freq data=stat541.fall2008; by gender notsorted; table major; run; We can use FIRST. and LAST. with the NOTSORTED option Great idea for non-SAS sorted/grouped data (e.g., Excel worksheets). Odd results otherwise. 4
5
Avoiding Unnecessary Sorts
You can actually group on formatted values rather than the variable itself GROUPFORMAT option can only be used in the DATA step GROUPFORMAT allows you to create groups without creating a new variable The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE GROUPFORMAT’s big advantage lies in the third bullet. Get in the habit of using CLASS rather than BY. 5
6
Avoiding Unnecessary Sorts
PROC CONTENTS can be used to see whether data is already sorted The SORTEDBY option can then be used to include the sort information as a data set attribute You should always check whether data is sorted 6
7
PROC SORT dsname THREADS|NOTHREADS;
Using a Threaded Sort Threaded sorts can distribute sorting across multiple CPUs PROC SORT dsname THREADS|NOTHREADS; You can modify or query the number of CPUs with system option CPUCOUNT Skim 7
8
Handling Large Data Sets
If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination. The BY statement is sometimes unnecessary 8
9
Handling Large Data Sets
Many methods are available FIRSTOBS= OBS= in DATA step IF/OUTPUT in DATA step WHERE in PROC SORT step WHERE in DATA step A DATA step is better than PROC APPEND for reassembling a large data set Most of the tools here are very familiar to us. 9
10
Handling Large Data Sets
The TAGSORT option saves only the BY variables and observation numbers in temporary files This saves on the space set aside for a SORT “tags” are a convenient feature; this should remind you of order() in R. 10
11
Removing Duplicate Observations Efficiently
NODUPKEY NODUPRECS Checks the entire record, not just the BY variables Can be limited to post-DROP and post-KEEP variables FIRST. and LAST. I’m surprised the book included this choice 11
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.