Derek Morgan, Principal Statistical Programmer, PAREXEL International

Derek Morgan, Principal Statistical Programmer, PAREXEL International
Presenter Derek Morgan, Principal Statistical Programmer, PAREXEL International Derek has been a SAS user for over 30 years, and is author of the SAS Press Book, “The Essential Guide to SAS® Dates and Times”. In his spare time, Derek plays bass guitar, and has performed with three members of the Rock and Roll Hall of Fame.

PROC SORT (then and) NOW
Derek Morgan, PAREXEL International

IN THE BEGINNING PROC SORT DATA=sas-dataset; BY by-variables(s); RUN;
At least one of these in almost every SAS program Biggest option was to use something other than IBM’s SYNCSORT as the sorting algorithm Limited storage resources made sorting in place desirable

NOW? PROC SORT DATA=sas-dataset collating-sequence-option (???) other options (e.g., NODUPKEY) ; RUN;

WHAT’S A “collating-sequence-option”?
THEN: ASCII or EBCDIC NOW? THAT. ASCII sorts character variables using the ASCII collating sequence. You need this option only when you want to achieve an ASCII ordering on a system where EBCDIC is the native collating sequence. DANISH sorts characters according to the Danish and Norwegian convention. EBCDIC sorts character variables using the EBCDIC collating sequence. You need this option only when you want to achieve an EBCDIC ordering on a system where ASCII is the native collating sequence. FINNISH sorts characters according to the Finnish and Swedish convention. NATIONAL sorts character variables using an alternate collating sequence, as defined by your installation, to reflect a country's National Use Differences. NORWEGIAN REVERSE sorts character variables using a collating sequence that is reversed from the normal collating sequence. SWEDISH SORTSEQ= collating-sequence The collating-sequence can be one of the following: collating-sequence-option translation_table encoding-value LINGUISTIC collating-sequence-option | translation_table specifies one of the PROC SORT statement collating-sequence-options (ASCII, DANISH, EBCDIC, FINNISH, NORWEGIAN, REVERSE, SWEDISH) or a translation table, which can be one that SAS provides or any user-defined translation table. Translation tables provided by SAS are: ASCII, DANISH, EBCDIC, FINNISH, ITALIAN, NORWEGIAN, POLISH, REVERSE, SPANISH, and SWEDISH. encoding-value specifies an encoding value. The result is the same as a binary collation of the character data represented in the specified encoding. LINGUISTIC<(collating-options )> specifies linguistic collation, which sorts characters in a culturally sensitive manner according to rules that are associated with a language and locale. The following are options that can be used when specifying SORTSEQ=LINGUISTIC. These options modify the linguistic collating sequence: ALTERNATE_HANDLING=SHIFTED controls the handling of variable characters like spaces, punctuation, and symbols. When this option is not specified (using the default value Non-Ignorable), differences among these variable characters are of the same importance as differences among letters. If the ALTERNATE_HANDLING option is specified, these variable characters are of minor importance. CASE_FIRST= specifies the order of uppercase and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY, or IDENTICAL levels. The following table provides the values and information for the CASE_FIRST argument: UPPER Sorts uppercase letters first, then the lowercase letters. LOWER Sorts lowercase letters first, then the uppercase letters. COLLATION= specifies character ordering. Values for COLLATION= Value Description BIG5HAN Specifies Pinyin ordering for Latin and specifies bug5 charset ordering for Chinese, Japanese, and Korean characters. DIRECT Specifies a Hindi variant. GB21312HAN Specifies Pinyin ordering for Latin and specifies gb2312han charset ordering for Chinese, Japanese, and Korean characters. PHONEBOOK Specifies a telephone-book style for ordering of characters. Select PHONEBOOK only with the German language. PINYIN Specifies an ordering for Chinese, Japanese, and Korean characters based on character-by-character transliteration into Pinyin. This ordering is typically used with simplified Chinese. POSIX Portable Operating System Interface. This option specifies a “C” locale ordering of characters. STROKE Specifies a nonalphabetic writing style ordering of characters. Select STROKE with Chinese, Japanese, Korean, or Vietnamese languages. This ordering is typically used with Traditional Chinese. TRADITIONAL Specifies a traditional style for ordering of characters. For example, select TRADITIONAL with the Spanish Language. LOCALE= locale_name specifies the locale name in the form of a POSIX name (for example, ja_JP). NUMERIC_COLLATION= orders integer values within the text by the numeric value instead of characters used to represent the numbers. ON Order numbers by the numeric value. For example, "8 Main St." would sort before "45 Main St.". OFF Order numbers by the character value. For example, "45 Main St." would sort before "8 Main St.". STRENGTH= The value of strength is related to the collation level. There are five collation-level values. The following table provides information about the five levels. The default value for strength is related to the locale. PRIMARY or 1 PRIMARY specifies differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character. SECONDARY or 2 Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). A secondary difference is ignored when there is a primary difference anywhere in the strings. Other differences between letters can also be considered secondary differences, depending on the language. TERTIARY or 3 Upper and lowercase differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). For an example, see Linguistic Sorting Using ALTERNATE_HANDLING=. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. Another example is the difference between large and small Kana. QUATERNARY or 4 When punctuation is ignored at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "a-b" < "ab" < "aB"). For an example, see Linguistic Sorting Using ALTERNATE_HANDLING= and STRENGTH=. The quaternary level should be used if ignoring punctuation is required or when processing Japanese text. This difference is ignored when there is a primary, secondary, or tertiary difference. IDENTICAL or 5 When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the Normalization Form D (NFD) form of each string are compared at this level, just in case there is no difference at levels This level should be used sparingly, because code- point value differences between two strings rarely occur. For example, only Hebrew cantillation marks are distinguished at this level.

Collating Sequence Options
Recognition that not every SAS user works with data in English A part of National Language Support (NLS) Only one collating sequence permitted for a given PROC SORT Collating done with translation tables or by rules (“LINGUISTIC Collation”)

Translation Tables Several included with SAS
ASCII; DANISH; EBCDIC; FINNISH; ITALIAN; NORWEGIAN; POLISH; REVERSE; SPANISH; SWEDISH Can create your own translation table Consult your OS guide for possible differences in implementation of translation tables

LINGUISTIC only works in PROC SORT
SORTSEQ= option Can specify a translation table, NATIONAL, or LINGUISTIC Can be specified as a system option BUT... LINGUISTIC only works in PROC SORT

Rules-based (LINGUISTIC) Collation
There are many possibilities and lots of details - you will want to refer to the documentation Can now control: Handling of special characters Case priority Character ordering Whether numbers in text are treated as numbers or text 5 levels of differentiation between characters available

Five Levels of Differentiation?
PRIMARY: Differences between base characters a≠b, but a=á, and a=A SECONDARY: Differences in accents in characters a≠b, a≠á, but a=A TERTIARY (default sort US English): Differences in case detected; also for detecting numerals in text strings a≠b, a≠á, and a≠A QUATERNARY: Punctuation differences “I see.” ≠ ‘I see,’

Didn’t You Say Five Levels of Differentiation?
IDENTICAL Anybody need to distinguish cantillation marks in Hebrew in their data?

How to Make Your Sort Case-Insensitive
PROC SORT DATA=NAMES SORTSEQ=LINGUISTIC(STRENGTH=PRIMARY); BY lname; RUN; The default STRENGTH for US English is TERTIARY

PROC SORT NO SORTSEQ options SORTSEQ=LINGUISTIC (STRENGTH=PRIMARY);
MACK MacCarron MacManus Macallen Macarthur Maccaray Maccarron Macallen Macarthur Maccaray MacCarron Maccarron MACK MacManus

Rules-based (LINGUISTIC) Sorting Options
STRENGTH PRIMARY through IDENTICAL ALTERNATE_HANDLING: Special characters in strings CASE_FIRST Upper or lower case comes first? COLLATION Sensitive to language differences, non-alphabetic characters

More Rules-based (LINGUISTIC) Sorting Options
LOCALE Specify locale NUMERIC_COLLATION Also known as “The Address Problem”

The Address Problem No problem, right? 1801 Somewhere Ave.

Not quite what you wanted...
10381 Somewhere Ave. 12425 Somewhere Ave. 1652 Somewhere Ave. 1801 Somewhere Ave. 4177 Somewhere Ave. 4200 Somewhere Ave. 506 Somewhere Ave. 7137 Somewhere Ave. 7262 Somewhere Ave. PROC SORT DATA=address; BY mail1; RUN; Not quite what you wanted...

PROC SORT DATA=address SORTSEQ= LINGUISTIC(NUMERIC_COLLATION=ON); BY mail1;
506 Somewhere Ave. 1652 Somewhere Ave. 1801 Somewhere Ave. 4177 Somewhere Ave. 4200 Somewhere Ave. 7137 Somewhere Ave. 7262 Somewhere Ave. 10381 Somewhere Ave. 12425 Somewhere Ave.

Other Useful PROC SORT Options
NODUPKEY: Eliminate records with duplicate keys DUPOUT: Output duplicate keyed records that are not kept to a dataset EQUALS/NOEQUALS: Controls which of the duplicate keyed records is written to that dataset NODUPLICATES: NO LONGER DOCUMENTED. PROC SQL; SELECT DISTINCT * works better anyway

DATA inspectdups; SET mydata; BY keyvariables; IF NOT (FIRST.keyvar AND last.keyvar) THEN OUTPUT; RUN; No longer necessary!

NOUNIQUEKEY: Eliminate records with unique keys OUT: Output duplicate keyed records to a dataset UNIQUEOUT: Output records with unique keys to a dataset WARNING! ALWAYS USE OUT= and/or UNIQUEOUT= !

Even More PROC SORT Options
REVERSE Collate in reverse order DATECOPY Retains the original date and time of the unsorted dataset PRESORTED Think it’s already sorted? Use this option. Don’t use with ACCESS engines or DBMS! Accessing records in sorted order is not guaranteed

This is NOT your SAS instructor’s PROC SORT...
Summary PROC SORT options have replaced the need for code and macros for frequently encountered situations PROC SORT is National Language Support-enabled Much more capable of handling international data and foreign languages This is NOT your SAS instructor’s PROC SORT...

Questions? Questions?

More Information “Linguistic Collation: Everyone Can Get What They Expect: Sensible Sorting for Global Business Success” ollation.pdf. A shorter overview: pdf Remember If you don’t use it when you need it, you’re not getting your money’s worth!

Contact Information Name: Derek Morgan City/State: St. Louis, MO

Derek Morgan, Principal Statistical Programmer, PAREXEL International

Similar presentations

Presentation on theme: "Derek Morgan, Principal Statistical Programmer, PAREXEL International"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Derek Morgan, Principal Statistical Programmer, PAREXEL International

Similar presentations

Presentation on theme: "Derek Morgan, Principal Statistical Programmer, PAREXEL International"— Presentation transcript:

Similar presentations

About project

Feedback