1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005.

1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005

2 Outline Introduction SAS Syntax Meta-Characters Examples

3 Introduction Regular Expressions A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs. Locate a pattern in text strings Obtain the position of the pattern Extract a substring Substitute a string by another

4 Introduction SAS Regular Expressions RX functions: RXPARSE, RXMATCH, RXCHANGE etc. Perl Regular Expressions PRX functions: PRXPARSE, PRXMATCH, PRXCHANGE, PRXPOSN, PRXDEBUG, etc.

5 SAS Syntax: Function PRXPARSE PRXPARSE (perl-regular-expression); To define a Perl regular expression to be used later by other Perl regular expression functions. perl-regular-expression: define the Perl regular expression.

6 SAS Syntax Data _NULL_; ** create the regular expression only once **; if _N_ = 1 then myregex = PRXPARSE(“/cat/”); ** exact match for the word “cat” **; retain myregex; input string $30.; ** matching the regular expression **; position = PRXMATCH (myregex, string); datalines; It is a cat; Does not match a CAT; cat in the beginning; Run; Position: Position: 9 0 1

7 Meta-Characters Position Characters ^ and $ “^cat”: matches the beginning of a string; Matches “cat” and “cats” but not “the cat” “cat$”: matches the end of a string; Matches “the cat” and “cat” but not “cat in the hat” “^cat$”: a string that starts and ends with “cat” -- that could only be “cat” itself! “cat”: a string that has the text “cat” in it. Matches “cat”, “cats”, “the cat”, “catch”

8 Meta-Characters “\d”: matches a digit 0 to 9. “\d\d\d” matches any three-digit number (123,389) “\w”: matches any upper and lower case letters, blank and underscore. “\w\w\w” matches any three-letter word “\s”: matches a white space character or a tab.”\d\s\w” matches “1 a”, “6 x”.

9 Meta-Characters Quantifiers *, + and ? “c(at)*”: matches a string that has a “c” followed by zero or more “at” (“c”, “cat”, “catatat”); “c(at)+”: same, but there's at least one “at” (“cat”, “catat”, etc.); “c(at)?”: same, but there's zero or one “at” (“c”, “cat”); “c?a+t$”: a possible “c” followed by one or more “a” ending with “t” (“cat”, “at”, “aaat”).

10 Meta-Characters Quantifiers “\d{3}”: matches any 3-digit number and is equivalent to “\d\d\d” “\w{3,}”: matches 3- or more letter words and is equivalent to “\w\w\w+” (“cat”, “_NULL_”) “\w{3,5}”: matches 3- or more but no more than 5-letter words (“cat”, “cats”, “catch”)

11 Meta-Characters “.”: matches exactly one character. “c.t” matches “cat”, “cut”, “cot”, “cit”. “c(a|u)t”: matches “cat”, “cut” “c[auo]t”: matches “cat”, “cut”, “cot” “[a-e]”: matches the letters “a” to “e”. “c[a- e]t” matches “cat”, “cbt”, “cct” “[^abc]”: matches any characters except “abc”. “c[^abc]t” matches “cut”, “cot” but not “cat”, “cbt”

12 Ex #1 A Simple Search ** create the regular expression only once **; Retain myregex; If _N_ = 1 then do; myregex = PRXPARSE (“/m[ea]th[ea][dt]one?/i”); /* “e?”: zero or one “e” “i”: ignore case when matching */; end; ** create a flag of whether matching or not **; myflag = min ( (PRXMATCH(myregex, drugname),1); Matched: Matched: methadone, Metheton, methadon, mathatone, METHEDONE, METHADON

13 Function PRXMATCH PRXMATCH ( pattern-id, string); Returns the first position in the string where the regular expression match is found. If the pattern is not found, it returns 0. pattern-id: the value returned from the PRXPARSE function. string: the variable that you are interested in.

14 Ex #2 Validating the format A sample of the data: Hydro-Chlorothiazide 25.5 Ziagen 200mg Zerit mg Insulin 20 cc Dapsone 100 g Kaletra 3 tabs Improperly formatted data: Hydro-Chlorothiazide 25.5 Zerit mg

15 Ex #2 Validating the format ** create the regular expression **; myregex = PRXPARSE (“/^\D+\d{1,4}\.?\d{0,4}\s?(tabs?|caps?|cc|m?g)/i”); /* ”^\D+”: starts with a group of non-digits “\d{1,4}”: followed by one to four digits “\.?”: an optional period “\d{0,4}”: may be followed by up to four more digits “\s?”: an optional space “(tabs?|caps?|cc|m?g|)”: units of measures: tab, tabs, cap, caps, cc, mg, g “/i”: ignore the case */ ** catch poorly formatted data **; If PRXMATCH (myregex, medication) = 0;

16 Ex #3 Extracting Text To extract what the patients are reporting to the investigators: Patient reported headache and nausea. MD noticed rash. Pt. Rptd. Backache. Patient reported seeing spots. Elevated pulse and labored breathing. Headache. Extracted field: headache and nausea Backache seeing spots

17 Function CALL PRXPOSN CALL PRXPOSN (pattern-id, capture-buffer-number, start ); Returns the position and length for a capture buffer. Used in conjunction with PRXPARSE and PRXMATCH. pattern-id: the value returned from the PRXPARSE function. capture-buffer-number: a number indicating which capture buffer is to be evaluated. start: the value of the first position where the particular capture buffer is found. length: the length of the found pattern.

18 Ex #3 Extracting Text ** create the regular expression **; myregex = PRXPARSE (“/(reported|rptd?\.?)(.*\.)/i”); /* “(reported|rptd?\.?)”: 1 st capture buffer. Capture the word “reported”, “rpt”, “rpt.”, “rptd”, “rptd.” “(.*\.)”: 2 nd capture buffer. Followed by any characters until a period is reached “/i”: ignore the case */

19 Ex #3 Extracting Text ** only call PRXPOSN if matching **; if PRXMATCH (myregex, comments) then do; /* get the position and length of the matching of 2 nd capture buffer */ CALL PRXPOSN (myregex, 2, pos, len); /* extract the substring excluding the end period */ pt_comments = substr (comments,pos,len-1); End;

20 Ex #4 Substitute one string for another Replace all the following by “Multi-vitamin”: multivitamin multi-vitamin multi-vita multivit multi-vit multi vitamin

21 Function CALL PRXCHANGE CALL PRXCHANGE (pattern-id, times, old-string >>>); To substitute one string for another. times: the number of times to search for and replace a string. oldstring: the string that you want to replace.

22 Ex #4 Substitute one string for another ** create regular expression **; myregex = PRXPARSE ( “s/multi[- ]?vita?(min)?/Multi-vitamin/”); /* “s/”: indicates that the regular expression will be used in a substitution “[- ]?”: optional “-” or space “a?”: optional “a” “(min)?”: optional “min” */

23 Ex #4 Substitute one string for another ** using the myregex id created above **; CALL PRXCHANGE (myregex, -1, drugname); /* “-1” indicates that the pattern should be changed at every occurrence */

24 Ex #5 Finding digits in random positions Stringx1x2x3x4x5 This 45 lines 98 has 3 s45983.. None here..... 12 34 78 9012347890 Weight 60kg 132pound60132...

25 Function CALL PRXNEXT CALL PRXNEXT (pattern-id, start, stop, position, length); To locate the nth occurrence of a pattern. The next occurrence of the pattern will be identified at each time you call the function start: the starting position to begin the search stop: the last position in the string for the search position: the starting position of the nth occurrence of the pattern length: the length of the pattern

26 Ex #5 Finding digits in random positions ** create the regular expression **; myregex = PRXPARSE(“/\d+/”); ** “\d+”: look for one or more digits **; Start = 1; Stop = length(string); ** get the position and length of the first occurrence **; Call PRXNEXT (myregex, start, stop, string, pos, len);

27 Ex #5 Finding digits in random positions Array x[5]; **continue until no more digits are found (pos=0)**; Do i = 1 to 5 while (pos gt 0); ** extract the current occurrence **; X[i] = input (substr (string, pos, len), 9.); ** get the position and length of the next occurrence **; Call PRXNEXT (myregex, start, stop, string, pos, len); End;

28 Ex #6 Locating zip codes String: John Smith 12 Broad street Flemington, NJ 08822 Philip Judson Apt #1, Building 7 777 Route 730 Kerrville, TX 78028 Dr. Roger Alan 44 Commonwealth Ave. Boston, MA 02116-7364 Zip_code: 08822 78028 02116-7364

29 Function CALL PRXSUBSTR CALL PRXSUBSTR (pattern-id, string, start ); Returns the starting position and the length of the match. string: the string to be searched start: the starting position of the pattern length: the length of the substring

30 Ex #6 Locating zip codes ** create the regular expression **; myregex = PRXPARSE(“/ \d{5}(-\d{4})?/”); /*match a blank followed by 5 digits followed by either nothing or a dash and 4 digits “\d{5}”: matches 5 digits “-”: matches a dash “\d{4}”: matches 4 digits “?”: matches zero or one of the preceding subexpression */

31 Ex #6 Locating zip codes Call PRXSUBSTR (myregex, string, start, length); ** only extract the substring if the pattern is found **; If start gt 0 then ** the start position is after the blank **; zip_code = substrn (string, start+1, length-1);

32 References SAS Functions by Example, Ron Cody An Introduction to Regular Expression with Examples from Clinical Data, Richard Pless, Ovation Research Group, Highland Park, IL How Regular Expression Really Work, Jack N shoemaker, Greensboro, NC3232

1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005.

Similar presentations

Presentation on theme: "1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005.

Similar presentations

Presentation on theme: "1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005."— Presentation transcript:

Similar presentations

About project

Feedback