Presentation is loading. Please wait.

Presentation is loading. Please wait.

Perl Regular Expression in SAS

Similar presentations


Presentation on theme: "Perl Regular Expression in SAS"— Presentation transcript:

1 Perl Regular Expression in SAS
Sander Post Senior Analyst, Statistics Canada

2 Contents What’s PERL? What’s a regular expression?
What are they useful for? Finding Validating Changing

3 What’s PERL? A programming language known for text processing
PERL’s syntax for regular expressions are built into SAS In functions/call routines like PRXPARSE, PRXMATCH, PRXSUBSTR - anything that starts with PRX

4 What’s a regular expression?
From Wikipedia: A regular expression, regex or regexp is …. a sequence of characters that define a search pattern. … used by string searching algorithms for “find” or “find and replace” operations on strings.

5 What’s a regular expression?
It isn’t find and replace in the sense of “find “Hello” and replace with “Hi”” More “here’s a description of what a postal code looks like. Are there any matches of this description in this data?”

6 What’s a regular expression?
Maybe an example helps Postal codes are of the form A1B 2C3 – where ABC are letters, and 123 are numbers There are more restrictions – the first letter only has so many valid values, for example Sometimes the space after the FSA isn’t there Suppose I want to find postal codes in free form text fields

7 What’s a regular expression?
How would you describe a postal code, generically, if looking for it in free form text? [letter] [digit] [letter] [optional space] [digit] [letter] [digit]

8 What’s a regular expression?
In regular expressions, you generally put possibilities in square brackets “any capital letter” can be represented by [A-Z] Similarly, any single digit is [0-9] Optional features are represented with a question mark

9 What’s a regular expression?
So: [A-Z][0-9][A-Z][ ]?[0-9][A-Z][0-9] So, to look for a postal code in free form text, we look for the above

10 What’s a regular expression?
data example1(drop=pattern); set postalcode; * Use PRXMATCH function: syntax prxmatch(regular expression, text); * And see if matching text can be found; fp=prxmatch("/[A-Z][0-9][A-Z][ ]?[0-9][A-Z][0-9]/" , comment); run;

11 What are regexes useful for?
Pattern matching If you know what something should look like, you can find them in free form text And validate them if there are specific criteria And change them to meet the criteria

12 What are regexes useful for?
Invalid character issues include: spaces: canada.ca commas instead of periods: double length character encoding issues: … at instead sander.post at canada.ca Typos: @domain.con instead

13 What are regexes useful for?
If we fix those, is what is left a valid address? What’s a valid address anyways? Not 100% well defined, but there are online documents

14 What are they useful for?
OASUS Spring or Fall YYYY Tuesday, May-15-18 What are they useful for? data validate; set source; \.[A-Z][A-Z][A-Z]?[A-Z]?/"; * note that . is a special character in regexes and needs to be preceded by a \ to be treated as a .; patternID=prxparse(pattern); call prxsubstr(patternID, , position, length); First & last name Company name

15 What are they useful for?
This method eliminates things that look kind of valid but aren’t It does allow some things that are invalid We can make refinements The regex ends in: \.[A-Z][A-Z][A-Z]?[A-Z]? Which means “period”-”letter”-”letter”-optional letter-optional letter So it finds strings ending in “.CA”,”.COM”,”.NET”,”.INFO” But also ending in “.HXQZ”

16 What are they useful for?
We can refine that using a list End with \.(COM|CA|NET|GOV) The remainders reveal potential typos like HOTMAIL.CON or ROGERS.COMN or GMAIL.CM Analysis of remainder can be used to expand the domain list – or download a list of domains from online and use it

17 What are they useful for?

18 What are they useful for?
Example program – matching phone numbers – in different formats


Download ppt "Perl Regular Expression in SAS"

Similar presentations


Ads by Google