Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular Expression Beihang Open Source Club.

Similar presentations


Presentation on theme: "Regular Expression Beihang Open Source Club."— Presentation transcript:

1 Regular Expression Beihang Open Source Club

2 Beihang Open Source Club
A Practical Problem Doubled words Report and highlist lines with doubled words. Word at the end of one line is repeated at the beginning of the next. Capitalization differences. Separated by HTML tags :'...it's <B>very</B> very important...'. Beihang Open Source Club

3 Beihang Open Source Club
#!/usr/bin/perl $/ = ".\n"; while (<>) { next if !s/\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[7m$3\e[m/ig; s/^(?:[^\e]*\n)+//mg; # Remove any unmarked lines. s/^/$ARGV: /mg; # Ensure lines begin with filename. print; } Beihang Open Source Club

4 powerful, flexible, and efficient text processing.
Regular expressions are the key to powerful, flexible, and efficient text processing. Beihang Open Source Club

5 Regular Expression As A language
In shell: *.txt More powerful, more general ==> a generalized pattern language Regular expression: a complete language Two types of characters: Metacharacter => grammer Literal => word Filename pattern: limited metacharacter s!<emphasis>([0-9])+(\.[0-9]+){3})</emphasis>!<inet>$1</inet>! Beihang Open Source Club

6 Beihang Open Source Club
Egrep Extended grep Grep: Global Regular Expression Print % egrep '^(From|Subject): ' mailbox-file Egrep metacharacters Beihang Open Source Club

7 Beihang Open Source Club
Start/End Of Line ^cat matches a line with cat at the beginning. Good habit: interpreting in a rather literal way. ^cat matches if you have the beginning of a line, followed by immediately by c, followed immediately by a, followed immediately by t. cat$ ^$ ^cat$ Beihang Open Source Club

8 Beihang Open Source Club
Character Class Any one of several characters. gr[ea]y sep[ae]r[ae]te and then vesus or <H[123456]> <H[1-6]> [0-9a-zA-Z] Mini language: minus Beihang Open Source Club

9 Negated Character Class
Any character not listed: ^ caret % egrep 'q[^u]' word.list Match a character that's not listed; don't match what's listed Beihang Open Source Club

10 Beihang Open Source Club
Match Any Character 03/19/76, , 03[-./]19[-./]76 03[.-/]19[.-/]76 Know your data! Beihang Open Source Club

11 Beihang Open Source Club
Alternation Any one of several subexpressions grey|gray (Geoff|Jeff)(rey|ery) Alternation is constained by parenthesis. '^(From|Subject|Date): ' '^From|Subject|Date: ' Alternation: each alternative can be a full-fledged regex Character class: a single character Beihang Open Source Club

12 Beihang Open Source Club
Word Boundary Anchor a position of a regular expression. Don't actually consume any characters during a match. \<cat\> means, Match if we can find a start-of-word position, followed immediately by c- a-t, followed immediately by an end-of-word position. Find the word cat. Beihang Open Source Club

13 Beihang Open Source Club
Optional Items coloru?r Question mark Only to immediately-preceding item Always successful Example: (July|Jul) (fourth|4th|4) July? (fourth|4th|4) July? (fourth|4(th)?) Beihang Open Source Club

14 Other Quantifiers: Repetition
+ (plus): one or more of the immediately preceding item * (star): any number, including none, of the item Quantifier Example: <H3 *> <HR +SIZE *= *14 *> <HR +SIZE *= *[0-9]+ *> Space matters <HR( +SIZE *= *[0-9]+)? *> Beihang Open Source Club

15 Other Quantifiers: Repetition
Intervals: definded range of matches ...{min,max} Example: ...{3,12} [a-zA-Z]{1,5} Beihang Open Source Club

16 Parentheses and Backreferences
Can “remember” text matched by the subexpressoin they enclose Doubled-word problem: \<([a-zA-Z]+) +\1\> \1: metasequence Numbered by opening parentheses: ([a-z])([0-9])\1\2 % egrep -in '\<([a-z]+) +\1\>' files Egrep considers each line in isolation Beihang Open Source Club

17 Beihang Open Source Club
Escape Match metacharater \. escaped period/escaped dot Except in character-class \([a-zA-Z]+\) match a word within parentheses Beihang Open Source Club

18 Beihang Open Source Club
More Flavors Different tools: egrep, Perl, Java, awk... Different versions Goal of regular expression Line Character sequence Terminology: Regex Matching Metacharacter/meeteasequence Beihang Open Source Club

19 Beihang Open Source Club
Even More Terminology: Subexpression Character Understanding how the regex engine really works is the key to really understanding regular expression Beihang Open Source Club

20 Beihang Open Source Club
At Last Not all egrep programs are the same. Three reasons for using parentheses: Constraining alternation (fourth|4(th)?) Grouping Capturing \<([a-zA-Z]+) +\1\> Character classes are special – totally distinct set of metacharacters. Beihang Open Source Club

21 Beihang Open Source Club
More Alternation | and character classes [] are fundamentally different. Negated character class: positive assertion Three types of escaped items: \ and a metacharacter \ and selected non-metacharacter ==> meteasequence \ and others: backslash ignored Question mark and star: don't need to acturally match any character to “match successfully” Beihang Open Source Club

22 Beihang Open Source Club
THE END Beihang Open Source Club


Download ppt "Regular Expression Beihang Open Source Club."

Similar presentations


Ads by Google