PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Variations of the Turing Machine
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt RhymesMapsMathInsects.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
1 The Blue Café by Chris Rea My world is miles of endless roads.
EU market situation for eggs and poultry Management Committee 20 October 2011.
Bright Futures Guidelines Priorities and Screening Tables
Bellwork Do the following problem on a ½ sheet of paper and turn in.
2 |SharePoint Saturday New York City
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
15. Oktober Oktober Oktober 2012.
Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.5 Dividing Polynomials Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Types of selection structures
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Presentation transcript:

PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode

Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 2

Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 3

Problem: Physical abnormalities 4 SUBJIDTRTABNORMALITY BANEMIA DANAEMIA CANEMEA BANEMIC

Problem: Time point variable … 5 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08Per 1 D01 Predose47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min

…Problem: Time point variable 6 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08Per 1 D01 Predose47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min

…Problem: Time point variable 7 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08 Per 1 D01 Predose 47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min Time_desc Predose Day 1, 0.5 Hour Day 1, 1 Hour Day 1, 2 Hours Day 1 Poststudy

8 … Problem: Time point variable PRSDTLTM D01 d01 day1 Time_desc Day 1

Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 9

10 …Ways to approach the problem Traditional --- Using SAS String Functions INDEX TRANWRD SUBSTR ANYALNUM ANYALPHA ANYDIGIT ANYSPACE NOTALNUM NOTALPHA ANYALNUM NOTUPPER ANYALPHA FIND ANYDIGIT FINDC ANYPUNCT ANYSPACE INDEXC NOTALNUM INDEXW NOTALPHA VERIFY NOTDIGIT CALL CATS CALL CATT CALL CATX TRANSLATE SCAN SCANQ CALL SCAN CALL SCANQ COMPARE COMPLEV CALL COMPCOST SOUNDEX COMPGED SPEDIS MISSING RANK REPEAT REVERSE…………

11 Alternative Approach to Problem Introducing REGULAR EXPRESSIONS!!

12 Introduction – Regular Expressions Powerful technique for searching and manipulating text data. A mini programming language - pattern matching. 2 types – pattern matching functions in SAS SAS Regular Expressions – SAS Version 6.12 PERL Regular Expressions – SAS Version 9

13 Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

14 Step1 - Identify the problem … USUB JID VISITVSDTPRSDTLTMVNTR_ RT VNTR TUN 1117-Oct- 08 Per 1 D01 Predose 47/min 123-Nov- 08 Per 1 D01.5 hr 58/min 123-Nov- 08 Per 1 D hr 51/min 123-Nov- 08 Per 1d01 02 hr 49/min 134-Nov- 08 Day153/min 1903-Feb- 09 Poststudy56/min time_desc Predose Day 1, 0.5 Hour Day 1, 1 Hour Day 1, 2 Hours Day 1 Poststudy Problem

15 Step2 – Visualize the Required Portion within the source text Required Portion PRSDTLTM Per 1 D01 Predose Per 1.5 hr Per 1 01 hr Per 1 02 hr Poststudy D01 d01 D 01 Day1

16 Step 3 – Identify a pattern Pattern PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy Preceding Blank D or d Following Blank One/more digits Following Blank 2- Non Digits EXTRACT

17 Step 3 – Identify a pattern Pattern PRSDTLTM Prestudy Per 1 D01 Predose D2 Per 1d01 02 hr Per 1 D hr 30 min Poststudy D or d Preceding Blank Following Blank One/more digits Following Blank EXTRACT

18 Step 3 – Identify a pattern Pattern D or d Preceding Blank Following Blank One/more digits Following Blank EXTRACT PRSDTLTM Per 1 D01 Predose Per 1 D01 Per 1 D hr 30 min Per 1d01 02 hr Day2 Poststudy

19 Regular Expressions Syntax...at a glance MetacharacterDescription * Matches the previous sub expression zero or more times + Matches the previous sub expression one or more times ? Matches the previous sub expression zero or one times \d Matches a digit (0-9) \D Matches a non-digit \w Matches a word character (upper or lower case letter, blank, or underscore) [abc] Matches any of the characters in the brackets \( Matches (

20 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy Preceding Blank ("/ /") ? ?? ?

21 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy D or d ("/[Dd] ? ?? ?/")

22 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy 2-Non Digits ("/[Dd] ? ?? ?/")(\D\D)?

23 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy Following Blank ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ?

24 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy One/more digits ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ? \d+

25 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy Following blank ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ? \ \\ \d

26 Step 4 – Write the Regular Expression for the pattern Regular Expressions ("/ ?[Dd](\D\D)? ?\d+ +/") PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy

27 Step 4 – Write the Regular Expression for the pattern Regular Expressions /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp; * defined to describe the day text pattern; day_exp =PRXPARSE end; run; ("/ ?[Dd](\D\D)? ?\d+ +/"); if _n_ = 1 then do ; Metacharacters

28 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

29 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

30 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

31 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

32 Step 5 – Locate the Required Portion Locate Reqd. Portion /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp day_nexp; if _n_ = 1 then do ; * defined to describe the day text pattern; day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end; *Locating the day text pattern in the PRSDTLTM var;CALLPRXSUBSTR(day_exp,PRSDTLTM,dayst,dayln); run; Pattern defn Source Variable Stores Start position of matched string Stores length of matched string

33 Step 6 – Use other SAS text functions to further process data /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp day_nexp; if _n_ = 1 then do ; * defined to describe the day text pattern; day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end; * Locating the day text pattern in the PRSDTLTM var; CALL PRXSUBSTR(day_exp,PRSDTLTM, dayst, dayln); * Extracting the day text pattern ; day_txt = substrn(PRSDTLTM,dayst,dayln); run; Source Variable Starting Position Length of matched pattern

34 …Output PRSDTLTMday_txt Per 1 D01 Predose Per 1 D01.5 hr Per 1 D hr Per 1d01 02 hr Day1 Poststudy Extracted string D01 Day1 d01 D 01

Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 35

36 Advantages… Compact solution Tremendous flexibility Concise description. Highly unstructured data streams. Multiple matching patterns in one step.

Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 37

38 Look before you leap Document thoroughly. Understand patterns. Define before use. Define only once in a data step.

Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 39

40 Support.sas.com Paper TU02- An Introduction to Regular Expressions with Examples from Clinical Data - Richard F. Pless, Ovation Research Group, Highland Park, IL SUGI 29-Tutorials - Paper An Introduction to Perl Regular Expressions in SAS 9 Ron Cody, Robert Wood Johnson Medical School, Piscataway, NJ An Introduction to PERL Regular Expression in SAS® James J. Van Campen, SRI International, Menlo Park, CA …References

Contact : 41 Q & A

Thank you 42