Perl Regular Expression in SAS Sander Post Senior Analyst, Statistics Canada
Contents What’s PERL? What’s a regular expression? What are they useful for? Finding Validating Changing
What’s PERL? A programming language known for text processing PERL’s syntax for regular expressions are built into SAS In functions/call routines like PRXPARSE, PRXMATCH, PRXSUBSTR - anything that starts with PRX
What’s a regular expression? From Wikipedia: A regular expression, regex or regexp is …. a sequence of characters that define a search pattern. … used by string searching algorithms for “find” or “find and replace” operations on strings.
What’s a regular expression? It isn’t find and replace in the sense of “find “Hello” and replace with “Hi”” More “here’s a description of what a postal code looks like. Are there any matches of this description in this data?”
What’s a regular expression? Maybe an example helps Postal codes are of the form A1B 2C3 – where ABC are letters, and 123 are numbers There are more restrictions – the first letter only has so many valid values, for example Sometimes the space after the FSA isn’t there Suppose I want to find postal codes in free form text fields
What’s a regular expression? How would you describe a postal code, generically, if looking for it in free form text? [letter] [digit] [letter] [optional space] [digit] [letter] [digit]
What’s a regular expression? In regular expressions, you generally put possibilities in square brackets “any capital letter” can be represented by [A-Z] Similarly, any single digit is [0-9] Optional features are represented with a question mark
What’s a regular expression? So: [A-Z][0-9][A-Z][ ]?[0-9][A-Z][0-9] So, to look for a postal code in free form text, we look for the above
What’s a regular expression? data example1(drop=pattern); set postalcode; * Use PRXMATCH function: syntax prxmatch(regular expression, text); * And see if matching text can be found; fp=prxmatch("/[A-Z][0-9][A-Z][ ]?[0-9][A-Z][0-9]/" , comment); run;
What are regexes useful for? Pattern matching If you know what something should look like, you can find them in free form text And validate them if there are specific criteria And change them to meet the criteria
What are regexes useful for? Invalid character issues include: spaces: sander.post@ canada.ca commas instead of periods: sander.post@canada,ca double length character encoding issues: s¥a¥n¥d¥e¥r¥.¥p¥o¥s¥t¥@¥c … at instead of @: sander.post at canada.ca Typos: @domain.con instead of @domain.com
What are regexes useful for? If we fix those, is what is left a valid e-mail address? What’s a valid email address anyways? Not 100% well defined, but there are online documents
What are they useful for? OASUS Spring or Fall YYYY Tuesday, May-15-18 What are they useful for? data validate; set source; pattern="/[A-Z0-9-_][A-Z0-9-_\.]*[A-Z0-9-_]@[A-Z0-9-_][A-Z0-9-_\.]* \.[A-Z][A-Z][A-Z]?[A-Z]?/"; * note that . is a special character in regexes and needs to be preceded by a \ to be treated as a .; patternID=prxparse(pattern); call prxsubstr(patternID, email , position, length); First & last name Company name
What are they useful for? This method eliminates things that look kind of valid but aren’t SANDER.POST@.CA It does allow some things that are invalid SANDER.POST@CANADA..........CA We can make refinements The regex ends in: \.[A-Z][A-Z][A-Z]?[A-Z]? Which means “period”-”letter”-”letter”-optional letter-optional letter So it finds strings ending in “.CA”,”.COM”,”.NET”,”.INFO” But also ending in “.HXQZ”
What are they useful for? We can refine that using a list End with \.(COM|CA|NET|GOV) [A-Z0-9-_][A-Z0-9-_\.]*[A-Z0-9-_]@[A-Z0-9-_\.]*[A-Z0-9-_]\.(COM|CA|NET|GOV) The remainders reveal potential typos like HOTMAIL.CON or ROGERS.COMN or GMAIL.CM Analysis of remainder can be used to expand the domain list – or download a list of domains from online and use it
What are they useful for?
What are they useful for? Example program – matching phone numbers – in different formats