Download presentation
Presentation is loading. Please wait.
Published byMolly Stuart Modified over 11 years ago
1
PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode
2
Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 2
3
Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 3
4
Problem: Physical abnormalities 4 SUBJIDTRTABNORMALITY 01-011BANEMIA 01-036DANAEMIA 01-026CANEMEA 01-014BANEMIC
5
Problem: Time point variable … 5 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08Per 1 D01 Predose47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D 01 01 hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min
6
…Problem: Time point variable 6 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08Per 1 D01 Predose47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D 01 01 hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min
7
…Problem: Time point variable 7 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08 Per 1 D01 Predose 47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D 01 01 hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min Time_desc Predose Day 1, 0.5 Hour Day 1, 1 Hour Day 1, 2 Hours Day 1 Poststudy
8
8 … Problem: Time point variable PRSDTLTM D01 d01 day1 Time_desc Day 1
9
Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 9
10
10 …Ways to approach the problem Traditional --- Using SAS String Functions INDEX TRANWRD SUBSTR ANYALNUM ANYALPHA ANYDIGIT ANYSPACE NOTALNUM NOTALPHA ANYALNUM NOTUPPER ANYALPHA FIND ANYDIGIT FINDC ANYPUNCT ANYSPACE INDEXC NOTALNUM INDEXW NOTALPHA VERIFY NOTDIGIT CALL CATS CALL CATT CALL CATX TRANSLATE SCAN SCANQ CALL SCAN CALL SCANQ COMPARE COMPLEV CALL COMPCOST SOUNDEX COMPGED SPEDIS MISSING RANK REPEAT REVERSE…………
11
11 Alternative Approach to Problem Introducing REGULAR EXPRESSIONS!!
12
12 Introduction – Regular Expressions Powerful technique for searching and manipulating text data. A mini programming language - pattern matching. 2 types – pattern matching functions in SAS SAS Regular Expressions – SAS Version 6.12 PERL Regular Expressions – SAS Version 9
13
13 Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem
14
14 Step1 - Identify the problem … USUB JID VISITVSDTPRSDTLTMVNTR_ RT VNTR TUN 1117-Oct- 08 Per 1 D01 Predose 47/min 123-Nov- 08 Per 1 D01.5 hr 58/min 123-Nov- 08 Per 1 D 01 01 hr 51/min 123-Nov- 08 Per 1d01 02 hr 49/min 134-Nov- 08 Day153/min 1903-Feb- 09 Poststudy56/min time_desc Predose Day 1, 0.5 Hour Day 1, 1 Hour Day 1, 2 Hours Day 1 Poststudy Problem
15
15 Step2 – Visualize the Required Portion within the source text Required Portion PRSDTLTM Per 1 D01 Predose Per 1.5 hr Per 1 01 hr Per 1 02 hr Poststudy D01 d01 D 01 Day1
16
16 Step 3 – Identify a pattern Pattern PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Preceding Blank D or d Following Blank One/more digits Following Blank 2- Non Digits EXTRACT
17
17 Step 3 – Identify a pattern Pattern PRSDTLTM Prestudy Per 1 D01 Predose D2 Per 1d01 02 hr Per 1 D 01 01 hr 30 min Poststudy D or d Preceding Blank Following Blank One/more digits Following Blank EXTRACT
18
18 Step 3 – Identify a pattern Pattern D or d Preceding Blank Following Blank One/more digits Following Blank EXTRACT PRSDTLTM Per 1 D01 Predose Per 1 D01 Per 1 D 01 01 hr 30 min Per 1d01 02 hr Day2 Poststudy
19
19 Regular Expressions Syntax...at a glance MetacharacterDescription * Matches the previous sub expression zero or more times + Matches the previous sub expression one or more times ? Matches the previous sub expression zero or one times \d Matches a digit (0-9) \D Matches a non-digit \w Matches a word character (upper or lower case letter, blank, or underscore) [abc] Matches any of the characters in the brackets \( Matches (
20
20 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Preceding Blank ("/ /") ? ?? ?
21
21 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy D or d ("/[Dd] ? ?? ?/")
22
22 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy 2-Non Digits ("/[Dd] ? ?? ?/")(\D\D)?
23
23 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Following Blank ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ?
24
24 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy One/more digits ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ? \d+
25
25 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Following blank ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ? \ \\ \d+ + ++ +
26
26 Step 4 – Write the Regular Expression for the pattern Regular Expressions ("/ ?[Dd](\D\D)? ?\d+ +/") PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy
27
27 Step 4 – Write the Regular Expression for the pattern Regular Expressions /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp; * defined to describe the day text pattern; day_exp =PRXPARSE end; run; ("/ ?[Dd](\D\D)? ?\d+ +/"); if _n_ = 1 then do ; Metacharacters
28
28 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem
29
29 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem
30
30 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem
31
31 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem
32
32 Step 5 – Locate the Required Portion Locate Reqd. Portion /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp day_nexp; if _n_ = 1 then do ; * defined to describe the day text pattern; day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end; *Locating the day text pattern in the PRSDTLTM var;CALLPRXSUBSTR(day_exp,PRSDTLTM,dayst,dayln); run; Pattern defn Source Variable Stores Start position of matched string Stores length of matched string
33
33 Step 6 – Use other SAS text functions to further process data /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp day_nexp; if _n_ = 1 then do ; * defined to describe the day text pattern; day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end; * Locating the day text pattern in the PRSDTLTM var; CALL PRXSUBSTR(day_exp,PRSDTLTM, dayst, dayln); * Extracting the day text pattern ; day_txt = substrn(PRSDTLTM,dayst,dayln); run; Source Variable Starting Position Length of matched pattern
34
34 …Output PRSDTLTMday_txt Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Extracted string D01 Day1 d01 D 01
35
Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 35
36
36 Advantages… Compact solution Tremendous flexibility Concise description. Highly unstructured data streams. Multiple matching patterns in one step.
37
Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 37
38
38 Look before you leap Document thoroughly. Understand patterns. Define before use. Define only once in a data step.
39
Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 39
40
40 Support.sas.com Paper TU02- An Introduction to Regular Expressions with Examples from Clinical Data - Richard F. Pless, Ovation Research Group, Highland Park, IL SUGI 29-Tutorials - Paper 265-29 An Introduction to Perl Regular Expressions in SAS 9 Ron Cody, Robert Wood Johnson Medical School, Piscataway, NJ An Introduction to PERL Regular Expression in SAS® James J. Van Campen, SRI International, Menlo Park, CA …References
41
Contact : jayshree.garade@cytel.com manjusha.gode@cytel.com 41 Q & A
42
Thank you 42
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.