Satisfy Your Technical Curiosity Regular Expressions Roy Osherove Methodology & Team System Expert Sela Group The hidden power language
Satisfy Your Technical Curiosity Tools
Satisfy Your Technical Curiosity The Log File
Satisfy Your Technical Curiosity Developer Problem– Make this log file useful Old log file from a *nix system’s entries Converted to and from various formats Searched by users Format may change Search fields can be added, removed or renamed at runtime Date CPUs |ram|cpu HH:mm:ss action user domain.machine 25/05/1998 1|00512|x86 21:49:12 [Search] Anakin Antler.Anita1 25/05/98 1|00512|x86 21:51:15 [Update] Anakin Antler.Anita1 26/05/1998 1|00256|x86 11:02:45 [Search] Darth Cydot.Uk.Gerry2k 26/05/98 1|00256|x86 11:12:49 [Update] Darth Cydot.Uk.Gerry2k 27/05/98 1|00512|x86 15:34:30 [Search] Anakin Anterl.Anita1 12/08/1998 2|01024|x86 10:14:53 [Search] Obi Monaco.Huarez
Satisfy Your Technical Curiosity About 15 minutes later… Done. About 45 minutes later… Home early.
Satisfy Your Technical Curiosity You can be home early too! Regex is easier than you think
Satisfy Your Technical Curiosity What are Regular Expressions? A language to describe a language using “patterns” Think SQL or XPath – for text Originated with Perl and *nix shell scripting Many variations and frameworks exist. Only one for.NET (for now) Used in most languages
Satisfy Your Technical Curiosity Common Regex Uses Text Validation Phones, s, address or any format requirement Text Manipulation Transform text Text Parsing Find in files, site Scraping, data collection
Satisfy Your Technical Curiosity What.NET brings to the plate Full object model Extended syntax Optimization techniques in the framework
Satisfy Your Technical Curiosity.NET Regular Expressions Show up in several places: In the classes of the System.Text.RegularExpressions namespace Via the RegularExpressionValidator validator control (for ASP.NET) Sprinkled in dozens of other places Browser capabilities filter In the WSDL tag And many more
Satisfy Your Technical Curiosity Key Classes within System.Text.RegularExpressions Regex Contains the pattern and matching options Important methods: IsMatch() returns boolean Replace() returns a string Split() returns a string array … Main Use: Validation, Splitting, Replacing text
Satisfy Your Technical Curiosity The Process Pattern Input Regex Matches Splits Text Replace text Options
Satisfy Your Technical Curiosity Validation
Satisfy Your Technical Curiosity Syntax Match exact text as written in the pattern ‘a’ will match all ‘a’ in the text. Except for special symbols:
Satisfy Your Technical Curiosity Enclosing Alternatives with [] The square brackets allow you to specify a list of alternate values. Used in conjunction with the – operator, you can even specify character ranges. [Cc]Capital or lowercase c [A-Z] Any capital letter A through Z [A-Za-z]Any capital or lowercase letter [0-9]Any digit 0 through 9 [A-Za-z0-9]Any letter or digit [0-9.+-&=%]Any digit or special char listed Notice: no escape needed
Satisfy Your Technical Curiosity Controlling Expression Frequency with {} The {} operators allow you to control the frequency of the preceding expression. The expression takes one of these two forms: {occurrences} [A-Za-z]{3} {MinOccurrences, MaxOccurences} [A-Za-z]{1,3}
Satisfy Your Technical Curiosity Basic Frequency Operators ?0 or 1 *0 or more +1 or more So, 3+ Will match 3, 33, 3333 but not 45, 678.
Satisfy Your Technical Curiosity Wildcard Operator:.. matches any non-newline character Unless multiline mode has been turned on for the pattern Examples: A.$ would match a capital A followed by one any character. Will not match Abc A.+ would match a capital A followed by one or more non-newline characters \.htm.? would match ".htm" followed by an optional non-newline character Backslash == escape characters that have reserved meanings in regular expressions
Satisfy Your Technical Curiosity Convenience Expressions \d Any digit \D Any non-digit Must match something else one \s Any whitespace character (such as a space or tab) \S Any character other than a whitespace character \w Any number or letter \W Any character other than a number or letter Many more: Unicode, Hex Values, negative lookups…
Satisfy Your Technical Curiosity Quick Quiz! [A-Za-z]{3} 3 capital or lowercase letters Abc, abc, aBC,1bc [A-Z][a-z]{2,4} A capital letter followed by at least 2 but not more than 4 lowercase letters Abc, Acbde, abcde, ABcde \w{3,8}\.\w{3} 3 to 8 AlphaNumeric characters, followed by a dot and 3 alpha numerics Filename.txt, d0main.com, ,
Satisfy Your Technical Curiosity Splitting and Manipulating
Satisfy Your Technical Curiosity Text Manipulation
Satisfy Your Technical Curiosity The Spammer
Satisfy Your Technical Curiosity (2) Key Classes within System.Text.RegularExpressions MatchCollection - Match MatchCollection stores all the matches found GroupCollection - Group CaptureCollection - Capture Regex.Match() returns Match Regex.Matches() returns MatchCollection … Main Use: Parsing, searching, collecting data
Satisfy Your Technical Curiosity Simple parsing Parsing for s
Satisfy Your Technical Curiosity Grouping (the coolest part)
Satisfy Your Technical Curiosity Grouping (pay attention!) Groups give us object models HTML File Create a capture hierarchy and use it in code [\w\.\-]+\.\w{2,5}
Satisfy Your Technical Curiosity Grouping s & The Regulator
Satisfy Your Technical Curiosity Getting back to the first problem: Make this log file useful Old log file from a *nix system’s entries Converted to and from various formats Searched by users Format may change Search fields can be added, removed or renamed at runtime Date CPUs |ram|cpu HH:mm:ss action user domain.machine 25/05/1998 1|00512|x86 21:49:12 [Search] Anakin Antler.Anita1 25/05/98 1|00512|x86 21:51:15 [Update] Anakin Antler.Anita1 26/05/1998 1|00256|x86 11:02:45 [Search] Darth Cydot.Uk.Gerry2k 26/05/98 1|00256|x86 11:12:49 [Update] Darth Cydot.Uk.Gerry2k 27/05/98 1|00512|x86 15:34:30 [Search] Anakin Anterl.Anita1 12/08/1998 2|01024|x86 10:14:53 [Search] Obi Monaco.Huarez
Satisfy Your Technical Curiosity How do I start? Take a sample of the log file Recognize the data pattern for each entry Use groups to get each line’s values Create a tool that uses this regex to parse a log file The tool will use the returned results to generate the log as XML Load the XML into a DataSet Allow user to print “Select” statements on the DataSet
Satisfy Your Technical Curiosity Parsing a log file
Satisfy Your Technical Curiosity Regulazy Build simple expressions by example No syntax knowledge needed Free Tools.osherove.com
Satisfy Your Technical Curiosity When not to use Regex When its easier and more readable to do it otherwise Not just because it’s “cool” Hard to read Steep learning curve Hard to maintain “Sometimes, when confronted with a problem, you might decide to solve it with Regular Expressions for the wrong reasons. Now you you’ve got two problems.”
Satisfy Your Technical Curiosity Summary Amazing parsing flexibility Good skill to have anywhere Can save you time and nerves With Power comes responsibility Weigh the pros and cons before using
Satisfy Your Technical Curiosity Resources The Regulator tools.osherove.com Regulazy tools.osherove.com Regexlib.com – Regex archive ( + Cheat Sheethttp:// Roy Osherove: Blog:
Satisfy Your Technical Curiosity Thank you! Questions? Roy Osherove: Blog:
Satisfy Your Technical Curiosity