Download presentation
Presentation is loading. Please wait.
Published byLorraine McGee Modified over 9 years ago
1
SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious
2
Sorting –The Obvious First Why Sort ? “Data and information is almost always presented in a sorted or structured way”
3
Sorting - The Obvious First proc sort data=claims; by claim client; Its important to know your data How many variables How many distinct data values for each Sort puts your records in order - BY the values of the variables you list. SAUSAG 69 – 20 Feb 2014
4
Sorting – Do You Need To? proc sort data=claims; by claim; Proc tabulate...; class claim;... An unnecessary SORT Some PROCS do their own sorting: TABULATE MEANS REPORT SQL (which can run out of memory for really big data sets) SAUSAG 69 – 20 Feb 2014
5
Sorting – Do You Need To? Only use PROC SORT before REPORT, TABULATE, MEANS if there’s another reason later. For PROC MEANS substitute BY with CLASS e.g. PROC MEANS NWAY; CLASS x y z; Is similar to PROC SORT; BY x y z; PROC MEANS; BY x y z; And saves significant time by avoiding the SORT SAUSAG 69 – 20 Feb 2014
6
Sort Only What You Need proc sort data=claims out=Sorted_claims; where client =: 'A'; by claim; Sort just the rows you want… … and just the columns you want… proc sort data=claims(keep = c:) out=Sorted_claims; by claim; Leaving out unwanted rows and columns can produce dramatic performance improvements. SAUSAG 69 – 20 Feb 2014
7
Sorting – Proc Sort vs Proc SQL /* SORT Procedure */ proc sort data=claims; by client claim; run; /* SQL Procedure */ proc sql; create table claims as select * from claims order by client claim; quit; Both will sort your data. No significant performance difference. Choose according to clarity, functional requirement and efficiency. Make it as clear and simple as possible! SAUSAG 69 – 20 Feb 2014
8
Sorted Status of a Data Set proc sort data=claims; by claim client; Sort Information Sortedby CLAIM CLIENT Validated YES Character Set ANSI Sort status is saved as part of a SAS data set. So SAS won’t waste time re-sorting if it’s already in the required order. SAUSAG 69 – 20 Feb 2014
9
Setting Sorted Status of a Data Set data client_claims ( sortedby = client ); merge clients claims; by client ; Sort Information Sortedby CLIENT Validated NO Character Set ANSI If you know a data set is sorted, say so with the SORTEDBY= option!. So SAS won’t waste time re-sorting later. SAUSAG 69 – 20 Feb 2014
10
Presorted or Notsorted SAUSAG 69 – 20 Feb 2014 proc sort data=claims out=sorted presorted ; by claim; PRESORTED option for when data probably sorted! SAS will check and only sort if necessary. proc print data=grouped_claims; by claim NOTSORTED ; No need to sort if data is grouped BY the required variable – it doesn’t matter its NOTSORTED (you just have to say so).
11
Sorting and Maintaining Order proc sort data=claims; by claim ; By default, SAS maintains the original order of records within a BY group. proc sort data=claims noequals ; by claim ; Using the NOEQUALS option means SAS won’t necessarily retain the original ordering. More efficient but, directly affects the results of using NODUPKEY SAUSAG 69 – 20 Feb 2014
12
Sorting Duplicates proc sort data=claims out=no_duplicates nodupkey ; by claim; proc sort data=claims out=no_duplicates dupout=dups nodupkey; by claim; NODUPKEY effectively keeps the first record of any duplicates. DUPOUT= puts the duplicates to a separate table. SAUSAG 69 – 20 Feb 2014
13
Separating Unique & Duplicate Rows proc sort data=claims out=sorted ; by claim; run; data unique_claims dup_claims; set sorted; by claim; if first.claim and last.claim then output unique_claims; else output dup_claims; run; It works, but needs an extra pass of the data. SAUSAG 69 – 20 Feb 2014
14
Separating Unique & Duplicate Rows - the smarter way proc sort data=claims out=duplicates uniqueout=uniques nouniquekey ; by claim; run; NOUNIQUEKEY ensures no records with a unique key are written to the OUT= table. …and the UNIQUEOUT= option directs the unique records to a separate table SAUSAG 69 – 20 Feb 2014
15
Sorting – Case Insensitive proc sort data=names out=simply_sorted; by name; data names2; set names; upcase_name = upcase(name); proc sort data=names2 out=upcase_sorted(keep=name); by upcase_name; Upper case letters are before lower case in the ASCII collating sequence. Creating an upper (or lower) case copy of the variable is the old solution. SAUSAG 69 – 20 Feb 2014
16
Sorting – Case Insensitive - Smarter proc sort data=names out=linguistic_sorted sortseq=linguistic ; by name; SORTSEQ option specifies the collating sequence (ASCII/EBCDIC/other languages) or, LINGUISTIC option modifies the current collating sequence. The affect is to make the sort case insensitive. SAUSAG 69 – 20 Feb 2014
17
Sorting – Case Insensitive – with SQL proc sql; create table sql_sorted as select * from names order by upcase(name) ; PROC SQL allows the use of functions in the Order By (and other) clauses. The result is different from Proc SORT using the sorteq=linguistic. SAUSAG 69 – 20 Feb 2014
18
Sorting Out Spaces proc sort data=names out=simply_sorted; by name; data names_temp; set names; temp_name = upcase(compress(name)); run; proc sort data=names_temp out=temp_sorted(keep=name); by temp_name; A standard sort is obviously no use. Creating another variable for sorting, without spaces, is the old solution.
19
Sorting Out Spaces Proc SORT can too! This sub-option of the LINGUISTIC sortseq option, effectively ignores spaces as well as being case- insensitive. proc sql; create table sql_sorted as select * from names order by upcase(compress(name)); proc sort data=names out=alt_handling_sorted sortseq = linguistic(alternate_handling = shifted); by name; Proc SQL can do it too. SAUSAG 69 – 20 Feb 2014
20
Sorting by Numbers proc sort data=students out=simply_sorted; by student; Sorting text with numeric prefixes e.g. student id and name … … results in nothing useful! SAUSAG 69 – 20 Feb 2014
21
Sorting by Numbers An extra data step can create a numeric variable to sort with (as can SQL of course) data students_temp; set students; student_num = input(scan(student,1), 2.); run; proc sort data=students_temp out=temp_sorted(keep=student); by student_num; proc sql; create table sql_sorted as select * from students order by input(scan(student,1), 2.); SAUSAG 69 – 20 Feb 2014
22
Sorting by Numbers The numeric_collation sub-option of the LINGUISTIC sortseq option, sorts by the numeric values that prefix the variable values. proc sort data=students out=num_collation_sorted sortseq = linguistic (numeric_collation=on); by student; SAUSAG 69 – 20 Feb 2014
23
Questions? Did you learn something new from this presentation? SAUSAG 69 – 20 Feb 2014
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.