Regular Expression Beihang Open Source Club
Beihang Open Source Club A Practical Problem Doubled words Report and highlist lines with doubled words. Word at the end of one line is repeated at the beginning of the next. Capitalization differences. Separated by HTML tags :'...it's <B>very</B> very important...'. Beihang Open Source Club
Beihang Open Source Club #!/usr/bin/perl $/ = ".\n"; while (<>) { next if !s/\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[7m$3\e[m/ig; s/^(?:[^\e]*\n)+//mg; # Remove any unmarked lines. s/^/$ARGV: /mg; # Ensure lines begin with filename. print; } Beihang Open Source Club
powerful, flexible, and efficient text processing. Regular expressions are the key to powerful, flexible, and efficient text processing. Beihang Open Source Club
Regular Expression As A language In shell: *.txt More powerful, more general ==> a generalized pattern language Regular expression: a complete language Two types of characters: Metacharacter => grammer Literal => word Filename pattern: limited metacharacter s!<emphasis>([0-9])+(\.[0-9]+){3})</emphasis>!<inet>$1</inet>! Beihang Open Source Club
Beihang Open Source Club Egrep Extended grep Grep: Global Regular Expression Print % egrep '^(From|Subject): ' mailbox-file Egrep metacharacters Beihang Open Source Club
Beihang Open Source Club Start/End Of Line ^cat matches a line with cat at the beginning. Good habit: interpreting in a rather literal way. ^cat matches if you have the beginning of a line, followed by immediately by c, followed immediately by a, followed immediately by t. cat$ ^$ ^cat$ Beihang Open Source Club
Beihang Open Source Club Character Class Any one of several characters. gr[ea]y sep[ae]r[ae]te and then vesus or <H[123456]> <H[1-6]> [0-9a-zA-Z] Mini language: minus Beihang Open Source Club
Negated Character Class Any character not listed: ^ caret % egrep 'q[^u]' word.list Match a character that's not listed; don't match what's listed Beihang Open Source Club
Beihang Open Source Club Match Any Character 03/19/76, 03-19-76, 03.19.76 03.19.76 03[-./]19[-./]76 03[.-/]19[.-/]76 19 203319 7639 Know your data! Beihang Open Source Club
Beihang Open Source Club Alternation Any one of several subexpressions grey|gray (Geoff|Jeff)(rey|ery) Alternation is constained by parenthesis. '^(From|Subject|Date): ' '^From|Subject|Date: ' Alternation: each alternative can be a full-fledged regex Character class: a single character Beihang Open Source Club
Beihang Open Source Club Word Boundary Anchor a position of a regular expression. Don't actually consume any characters during a match. \<cat\> means, Match if we can find a start-of-word position, followed immediately by c- a-t, followed immediately by an end-of-word position. Find the word cat. Beihang Open Source Club
Beihang Open Source Club Optional Items coloru?r Question mark Only to immediately-preceding item Always successful Example: (July|Jul) (fourth|4th|4) July? (fourth|4th|4) July? (fourth|4(th)?) Beihang Open Source Club
Other Quantifiers: Repetition + (plus): one or more of the immediately preceding item * (star): any number, including none, of the item Quantifier Example: <H3 *> <HR +SIZE *= *14 *> <HR +SIZE *= *[0-9]+ *> Space matters <HR( +SIZE *= *[0-9]+)? *> Beihang Open Source Club
Other Quantifiers: Repetition Intervals: definded range of matches ...{min,max} Example: ...{3,12} [a-zA-Z]{1,5} Beihang Open Source Club
Parentheses and Backreferences Can “remember” text matched by the subexpressoin they enclose Doubled-word problem: \<([a-zA-Z]+) +\1\> \1: metasequence Numbered by opening parentheses: ([a-z])([0-9])\1\2 % egrep -in '\<([a-z]+) +\1\>' files Egrep considers each line in isolation Beihang Open Source Club
Beihang Open Source Club Escape Match metacharater \. escaped period/escaped dot Except in character-class \([a-zA-Z]+\) match a word within parentheses Beihang Open Source Club
Beihang Open Source Club More Flavors Different tools: egrep, Perl, Java, awk... Different versions Goal of regular expression Line Character sequence Terminology: Regex Matching Metacharacter/meeteasequence Beihang Open Source Club
Beihang Open Source Club Even More Terminology: Subexpression Character Understanding how the regex engine really works is the key to really understanding regular expression Beihang Open Source Club
Beihang Open Source Club At Last Not all egrep programs are the same. Three reasons for using parentheses: Constraining alternation (fourth|4(th)?) Grouping Capturing \<([a-zA-Z]+) +\1\> Character classes are special – totally distinct set of metacharacters. Beihang Open Source Club
Beihang Open Source Club More Alternation | and character classes [] are fundamentally different. Negated character class: positive assertion Three types of escaped items: \ and a metacharacter \ and selected non-metacharacter ==> meteasequence \ and others: backslash ignored Question mark and star: don't need to acturally match any character to “match successfully” Beihang Open Source Club
Beihang Open Source Club THE END Beihang Open Source Club