BTANT129 w61 Regular expressions step by step Tamás Váradi
BTANT129 w62 What are they? Regular expressions (regexp) define a pattern, which may match a whole series of strings Powerful, compact, fast Useful for all sorts of text processing tasks
BTANT129 w63 Where can I use them? In text editors/word processors (even in Ms Word to some extent!) like: –Textpad, EditPad Pro (to name but two) Special programs to search a set of files: –grep, egrep, sed (free) –powergrep –Visual REGEXP In programming languages –Perl, Python and other so-called script languages
BTANT129 w64 What about INTEX? Yes, INTEX has a built-in regexp facility But it is a little limited and peculiar (INTEX offers graphs as an alternative) In this lecture, we are going to cover regular expressions as used in the text processing tools mentioned above
BTANT129 w65 Is there a standard variety? More or less There are variants that differ in – notation –features (expressive power, elegance etc) Here we'll concentrate on what you can expect regular expressions to do
BTANT129 w66 First things first Any character will match itself Except characters with a special meaning (metacharacters): \ | ( ) [ { ^ $ * + ?. The pattern is applied from top to bottom left to right, as if a sliding window onto the text
BTANT129 w67 Special characters. will match any one character ? will match the preceding character zero or once (at most once) +will match the preceding character one or any number of times (at least once) * will match the preceding character zero or any number of times {n,m}
BTANT129 w68 Examples.at matches bat, cat, fat, pat, rat c*at matches at and cat and ccat, cccat etc. guess what c* will match and why? c+at matches cat and ccat, cccat etc. but not at c?at matches at and cat,
BTANT129 w69 Anchor points A regexp is matched against the text at any point where the first char of the regexp matches a char in the target text – a sliding window matching is done line-by line by default ^ : match at the beginning $ : match at the end
BTANT129 w610 Groups and alternations (bla)* Sir|Madam
BTANT129 w611 Character classes [aeiou] matches one of the set [^aeiou] matches any other char except one in the set [a-zA-Z0-9] consecutive characters can be referred to with a range Note: whatever the length of the set, it always represents a single character in the pattern – so it's a single character alternation ('or' relation between characters
BTANT129 w612 Extended features \da digit \Da non-digit \sa space, tab, linefeed, newline \Sa non-whitespace \wa word-character \Wa non-wordcharacter \b word-boundary \na newline \ta tabulator
BTANT129 w613 Longest vs. shortest match When using quantifiers with non-literal characters (".","\w","\S" etc.) one can easily get unintended matches.+longest match (default).+? shortest match
BTANT129 w614 The escape character Problem: What if we want to find characters that are special metacharacters for regexp (\ | ( ) [ { ^ $ * + ?. ) Solution: They have to be preceded by "\" to strip them of their special value e.g.: \( \$ \[ \? etc.
BTANT129 w615 Things to do Look up the tutorial at Download one of the tools VisualRegexp, Prowergrep,EditPad Pro and experiment with texts Follow the tutorial of EditPad Pro, which you can find in its Help