Advanced Regular Expressions Or What’s special about RegEx in MX CFUNITED – The premier ColdFusion conference www.cfunited.com
Your Presenter Michael Dinowitz Head of House of Fusion Publisher of Fusion Authority Founding member of Team Macromedia Doing this since June 95 Called on for the black magic code June 28th – July 1st 2006
Disclaimer & Introduction If you don’t know the basics – get out No real changes from CF 5 or CFMX 6 June 28th – July 1st 2006
Basic additions Greedy vs. Lazy Nested sub expressions + is one or more and as many as it can +? Is one or more but only as many as it needs ++ Same as greedy but does not allow back references (not in CFMX) Nested sub expressions In order of execution from outside it Then left to right June 28th – July 1st 2006
Character Vs. Posix classes Non-special characters become special Uses a backslash (\) to specify being special Shorter than posix classes Harder to ‘read’ for newbies June 28th – July 1st 2006
Basic Character Classes \b – word boundary Any jump from alphanumeric to non-alphanumeric refindnocase('\bbig\b', 'big') \B – any 2 of the same ‘types’ of characters refindnocase('\B', 'big') = 2 June 28th – July 1st 2006
More Character Classes \A - same as ^ (not combined with (?m) \Z – same as $ (not combined with (?m) \n – newline \r – carriage return \t – tab \d – any digit ([0-9]) \D – any non digit ([^0-9]) June 28th – July 1st 2006
More Character Classes \w - Any alphanumeric character ([[:alnum:]] ) \W - Any non-alphanumeric character ([^[:alnum:]] ) \s - Any whitespace character including tab, space, newline, carriage return, and form feed ([\t\n\r\f ]) \S – any non-whitespace character ([^ \t\n\r\f]) June 28th – July 1st 2006
Expression Modifiers At beginning of expression (?i) Causes expression to be case insensitive (same as NoCase version) (?m) Multi-line mode ^ and $ matches line, not entire string Carriage return Chr(13) is ignored as new line June 28th – July 1st 2006
Expression Modifiers (?x) ignores all white space Also allows usage of ## for comments ## will comment to end of line reFind("(?x) one ##first option |two ##second option |three\ point\ five ## note escaped spaces ", "three point five") June 28th – July 1st 2006
Group Modifiers Affects only the group its in Must be at beginning of group (?##) comment Must escape # (?:) does not add group to return collection (?=) Positive look ahead (?!) negative look ahead June 28th – July 1st 2006
Positive Lookahead Tests if the text in the parenthesis exists Does not save the text into return collection Does not ‘consume text’ <a(?=.+href).+?href="([^"]+).+?> June 28th – July 1st 2006
Negative Lookahead Tests if the text in the parenthesis does not exist Does not save the text into return collection Does not ‘consume text’ (<a(?!.+?target) [^>]+>) June 28th – July 1st 2006
Replace conversion Used in REReplace()/REReplaceNoCase() Either converts the ‘next’ character or a specific section of characters \u – converts next character to uppercase \l – converts the next character to lowercase \U…\E – converts block to uppercase \L…\E – converts block to lowercase June 28th – July 1st 2006
Not Supported Positive Lookbehinds Negative Lookbehinds Other features All accessible through the Java RegEx engine Massimo has a CFC pre-built to do this June 28th – July 1st 2006
Resources Chapters in most CFMX books CF-RegEx mailing list This presentation Books: Mastering Regular Expressions, 2nd Edition Teach Yourself Regular Expressions in 10 Minutes Java Regular Expressions Taming the java Dot util Dot regex Engine June 28th – July 1st 2006