Introducing Oracle Regular Expressions Session id: 40105 Introducing Oracle Regular Expressions Jonathan Gennick, O'Reilly & Associates Peter Linsley, Oracle Corporation
What are Regular Expressions? A language, or syntax, you can use to describe patterns in text Example: [0-9]{3}-[0-9]{4} That which you can describe, you can find and manipulate Unix ed, grep, perl, and now everywhere!
Why Describe Patterns? Humans have long worked with patterns: Postal and email addresses URLs Phone numbers Often it’s not the data that’s important, but the pattern: Bioinformatics Validate format of URLs and email addresses Correct formatting of phone numbers Would be nice to have a more specific statement regarding bioinformatics.
Pre-Oracle Database 10g Find parks with acreage in their descriptions: SELECT * FROM park WHERE description LIKE '%acre%'; Finds '217-acre' and '27 acres', but also ‘few acres’, ‘more acres than all other parks’, 'the location of a massacre', etc.
Pre-Oracle Database 10g cont. Pattern matching with LIKE Limited to only two operators: % and _ OWA_PATTERN No support for alternation, ASCII only, relatively poor performance Non-native solutions External Procedures Difficult to deploy, maintain, and support Client based solutions Pull all that data down across the network
Oracle Database 10g Four regular expression functions REGEXP_LIKE does pattern match? REGEXP_INSTR where does it match? REGEXP_SUBSTR what does it match? REGEXP_REPLACE replace what matched. POSIX Extended Regular Expressions UNIX Regular Expressions Backreference support added Longest match not supported
REGEXP_LIKE Determine whether a pattern exists in a string Revisiting the acreage problem: SELECT * FROM park WHERE REGEXP_LIKE(description, '[0-9]+(-| )acre'); Finds '217-acre' and '27 acres' REJECTS ‘few acres’, ‘more acres than all other parks’, 'the location of a massacre', etc.
Useful for Constraints Filter allowable data with check constraint Only allow alphabetical characters: CREATE TABLE t1 (c1 VARCHAR2(20), CHECK (REGEXP_LIKE(c1, '^[[:alpha:]]+$'))); INSERT INTO t1 VALUES ('newuser'); 1 row created. INSERT INTO t1 VALUES ('newuser1'); ORA-02290: check constraint violated
Metacharacters Operator Description . match any character a? match 'a' zero or one time a* match 'a' zero or more times a+ match 'a' one or more times a|b match either 'a' or 'b' a{m,n} match 'a' between m and n times [abc] match either 'a' or 'b' or 'c' (abc) match group 'abc' \n match nth group [:cc:] match character class [.ce.] match collation element [=ec=] match equivalence class
REGEXP_INSTR Find out where a match occurs: SELECT REGEXP_INSTR(description, '[0-9]+(-| )acre') FROM park; REGEXP_INSTR(DESCRIPTION,'[0-9]+… --------------------------------- 6 20 …
REGEXP_SUBSTR Determine what text matched: SELECT REGEXP_SUBSTR(description, '[0-9]+(-| )acre') FROM park; REGEXP_SUBSTR(DESCRIPT ---------------------- 217-acre 27 acre …
REGEXP_SUBSTR Cont To extract just the acreage value: SELECT REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+') FROM park; REGEXP_SUBSTR(REGEXP -------------------- 217 27
REGEXP_REPLACE Convert acres to hectares: UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. Convert acres to hectares: UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. 1 2 This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare 87.8199-hectare 1 2 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare 87.8199-hectare This 87.8199-hectare park is wonderful. UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
Oracle Regular Expressions D E M O N S T R A T I O N Oracle Regular Expressions
Performance Pattern matching can be complex Need to compile to state machine Lex and parse Examine all possible branches until match found Compiled once per statement Can be faster than LIKE for complex scenarios Usually faster than PL/SQL equivalent ZIP code checking 5 times faster
Performance Cont. Some poorly-performing expressions: 'a{2}' will be slower than 'aa' '.*b' on input that doesn't contain a 'b' can also be quite time-consuming Mastering Regular Expressions By Jeffrey Friedl Chapter 6, Crafting an Efficient Expression
Using with Indexes Use function-based indexes: CREATE INDEX acre_ind ON park (REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+')); To support regular expression queries: SELECT * FROM park WHERE REGEXP_SUBSTR(REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+') = 217;
Using with Views Hide the complexity from users: CREATE VIEW park_acreage as SELECT park_name, REGEXP_SUBSTR( description, '[0-9]+(-| )acre'), '[0-9]+') acreage FROM park;
Using with PL/SQL REGEXP_LIKE acts as a Boolean function in PL/SQL: IF REGEXP_LIKE(description, '[0-9]+(-| )acre') THEN acres := REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'); ... All other functions act identically in PL/SQL and SQL.
Longest Match vs Greediness Greediness = each element matches as much as possible. For example: SELECT REGEXP_SUBSTR( 'In the beginning','.+[[:space:]]') FROM dual; In the
Longest Match vs Greediness Longest match = find the variations resulting in the greatest number of matching characters: SELECT REGEXP_SUBSTR('bbb','b|bb') FROM dual; b SELECT REGEXP_SUBSTR('bbb','bb|b') FROM dual; bb
Optional Parameters All but REGEXP_LIKE take optional parameters for starting position and occurrence: REGEXP_INSTR (source, pattern, start, occurrence, match) REGEXP_SUBSTR (source, pattern, start, occurrence, match) REGEXP_REPLACE(source, pattern, replace, start, occurrence, match) For example: REGEXP_SUBSTR('description','[^[:space:]]+',1,10)
Match Parameter All functions take an optional match parameter: Is matching case sensitive? Does period (.) match newlines? Is the source string one line or many? The match parameter comes last
Case-sensitivity Case-insensitive search: SELECT * FROM park WHERE REGEXP_LIKE( description, '[0-9]+(-| )acre', 'i');
Newline matching INSERT INTO park VALUES ('Park 6', '640' || CHR(10) || 'ACRE'); SELECT * FROM park WHERE REGEXP_LIKE( description, '[0-9]+.acre', 'in');
Yes! String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick'); SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Ellison'); Yes!
No! String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick') SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Gennick'); No!
Yes! String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick') SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Gennick','m'); Yes!
Locale Support Full Locale Support All character sets All languages Case and accent insensitive searching Linguistic range Character classes Collation elements Equivalence classes
Character Sets and Languages For example, you can search for Ukrainian names beginning with Ґ and ending with к: SELECT * FROM employee WHERE REGEXP_LIKE( surname, '^Ґ[[:alpha:]]*к$','n');
Case- and Accent-Insensitive Searching Respect for NLS settings: ALTER SESSION SET NLS_SORT = GENERIC_BASELETTER; With this sort, case won't matter and an expression such as: REGEXP_INSTR(x,'resume') will find "resume", "résumé", "Résume", etc.
Linguistic Range Ranges respect NLS_SORT settings: a,b,c…z [a-z] NLS_SORT=GERMAN [a-z] a,A,b,B,c,C…z,Z NLS_SORT=GERMAN_CI
Character Classes Character classes such as [:alpha:] and [:digit:] encompass more than just Latin characters. For example, [:digit:] matches: Latin 0 through 9 Arabic-Indic٠through ٩ And more
Collation Elements ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[ch][[:alpha:]]*', 1,1,'i') FROM dual; caballo
Collation Elements ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[[.ch.]][[:alpha:]]*', 1,1,'i') FROM dual; Chico
Equivalence Classes Ignore case and accents without changing NLS_SORT: REGEXP_INSTR(x,'r[[=e=]]sum[[=e=]]') Finds 'resume', 'résumé', and 'rEsumE'
Conclusion String searching and manipulation is at the heart of a great many applications Oracle Regular Expressions provide versatile string manipulation in the database instead of externalized in middle tier logic They are Locale sensitive and support character large objects Available in both SQL and PL/SQL
Next Steps…. Recommended sessions Session #40088 New SQL Capabilities Session #40202 Oracle HTML DB Recommended demos and/or hands-on labs Database Globalization Pod R See Your Business in Our Software Visit the DEMOgrounds for a customized architectural review, see a customized demo with Solutions Factory, or receive a personalized proposal. Visit the DEMOgrounds for more information. Relevant web sites to visit for more information http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html
Shameless Plug Oracle Regular Expressions Pocket Reference Jonathan Gennick & Peter Linsley Free! At the O'Reilly & Associaties Booth