Perl Regular Expression in SAS

Slides:



Advertisements
Similar presentations
Regular Expressions BKF03 Brian Ciccolo. Agenda Definition Uses – within Aspen and beyond Matching Replacing.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Form Validation CS What is form validation?  validation: ensuring that form's values are correct  some types of validation:  preventing blank.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Finding the needle(s) in the textual haystack
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Post-Module JavaScript BTM 395: Internet Programming.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Structured Programming
Perl Day 4. Fuzzy Matches We know about eq and ne, but they only match things exactly We know about eq and ne, but they only match things exactly –Sometimes.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Regular Expressions Pattern and String Matching in Text.
Validation using Regular Expressions. Regular Expression Instead of asking if user input has some particular value, sometimes you want to know if it follows.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Regular Expressions.
RE Tutorial.
Finding the needle(s) in the textual haystack
Regular Expressions Upsorn Praphamontripong CS 1110
CS 330 Class 7 Comments on Exam Programming plan for today:
Regular expressions, egrep, and sed
Strings and Serialization
Regular expressions, egrep, and sed
Looking for Patterns - Finding them with Regular Expressions
/208/.
Regular Expressions (RegEx)
CSC 594 Topics in AI – Natural Language Processing
Structured Programming
Regular expressions, egrep, and sed
Finding the needle(s) in the textual haystack
Finding the needle(s) in the textual haystack
CSC 594 Topics in AI – Natural Language Processing
Intro to PHP & Variables
Advanced Find and Replace with Regular Expressions
Your team The 10 COY themes.
Selenium WebDriver Web Test Tool Training
CS 1111 Introduction to Programming Fall 2018
Building Java Programs
Topics Designing a Program Input, Processing, and Output
Data Manipulation & Regex
Regular Expressions
Regular expressions, egrep, and sed
Lecture 25: Regular Expressions
Regular expressions, egrep, and sed
Validation using Regular Expressions
Regular expressions, egrep, and sed
REGEX.
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
PHP –Regular Expressions
Presentation transcript:

Perl Regular Expression in SAS Sander Post Senior Analyst, Statistics Canada

Contents What’s PERL? What’s a regular expression? What are they useful for? Finding Validating Changing

What’s PERL? A programming language known for text processing PERL’s syntax for regular expressions are built into SAS In functions/call routines like PRXPARSE, PRXMATCH, PRXSUBSTR - anything that starts with PRX

What’s a regular expression? From Wikipedia: A regular expression, regex or regexp is …. a sequence of characters that define a search pattern. … used by string searching algorithms for “find” or “find and replace” operations on strings.

What’s a regular expression? It isn’t find and replace in the sense of “find “Hello” and replace with “Hi”” More “here’s a description of what a postal code looks like. Are there any matches of this description in this data?”

What’s a regular expression? Maybe an example helps Postal codes are of the form A1B 2C3 – where ABC are letters, and 123 are numbers There are more restrictions – the first letter only has so many valid values, for example Sometimes the space after the FSA isn’t there Suppose I want to find postal codes in free form text fields

What’s a regular expression? How would you describe a postal code, generically, if looking for it in free form text? [letter] [digit] [letter] [optional space] [digit] [letter] [digit]

What’s a regular expression? In regular expressions, you generally put possibilities in square brackets “any capital letter” can be represented by [A-Z] Similarly, any single digit is [0-9] Optional features are represented with a question mark

What’s a regular expression? So: [A-Z][0-9][A-Z][ ]?[0-9][A-Z][0-9] So, to look for a postal code in free form text, we look for the above

What’s a regular expression? data example1(drop=pattern); set postalcode; * Use PRXMATCH function: syntax prxmatch(regular expression, text); * And see if matching text can be found; fp=prxmatch("/[A-Z][0-9][A-Z][ ]?[0-9][A-Z][0-9]/" , comment); run;

What are regexes useful for? Pattern matching If you know what something should look like, you can find them in free form text And validate them if there are specific criteria And change them to meet the criteria

What are regexes useful for? Invalid character issues include: spaces: sander.post@ canada.ca commas instead of periods: sander.post@canada,ca double length character encoding issues: s¥a¥n¥d¥e¥r¥.¥p¥o¥s¥t¥@¥c … at instead of @: sander.post at canada.ca Typos: @domain.con instead of @domain.com

What are regexes useful for? If we fix those, is what is left a valid e-mail address? What’s a valid email address anyways? Not 100% well defined, but there are online documents

What are they useful for? OASUS Spring or Fall YYYY Tuesday, May-15-18 What are they useful for? data validate; set source; pattern="/[A-Z0-9-_][A-Z0-9-_\.]*[A-Z0-9-_]@[A-Z0-9-_][A-Z0-9-_\.]* \.[A-Z][A-Z][A-Z]?[A-Z]?/"; * note that . is a special character in regexes and needs to be preceded by a \ to be treated as a .; patternID=prxparse(pattern); call prxsubstr(patternID, email , position, length); First & last name Company name

What are they useful for? This method eliminates things that look kind of valid but aren’t SANDER.POST@.CA It does allow some things that are invalid SANDER.POST@CANADA..........CA We can make refinements The regex ends in: \.[A-Z][A-Z][A-Z]?[A-Z]? Which means “period”-”letter”-”letter”-optional letter-optional letter So it finds strings ending in “.CA”,”.COM”,”.NET”,”.INFO” But also ending in “.HXQZ”

What are they useful for? We can refine that using a list End with \.(COM|CA|NET|GOV) [A-Z0-9-_][A-Z0-9-_\.]*[A-Z0-9-_]@[A-Z0-9-_\.]*[A-Z0-9-_]\.(COM|CA|NET|GOV) The remainders reveal potential typos like HOTMAIL.CON or ROGERS.COMN or GMAIL.CM Analysis of remainder can be used to expand the domain list – or download a list of domains from online and use it

What are they useful for?

What are they useful for? Example program – matching phone numbers – in different formats