Download presentation
Presentation is loading. Please wait.
Published byPercival Leonard Modified over 9 years ago
1
Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
2
What are Regular Expressions? Very small language for describing text. Not a programming language. Incredibly powerful tool for search/replace operations. Arcane art. Ubiquitous.
3
Why Use Regular Expressions? Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary How many times does “sing” appear in a text in all tenses and conjugations? Reformatting dirty data Validating input. Command line work – listing files, grepping log files
4
The Basics A regex is a pattern enclosed within delimiters. Most characters match themselves. /THATCamp/ is a regular expression that matches “THATCamp”. –Slash is the delimiter enclosing the expression. –“THATCamp” is the pattern.
5
/at/ Matches strings with “a” followed by “t”. athat thatatlas aftAthens
6
/at/ Matches strings with “a” followed by “t”. athat thatatlas aftAthens
7
Some Theory Finite State Machine for the regex /at/
8
Characters Matching is case sensitive. Special characters: ( ) ^ $ { } [ ] \ |. + ? * To match a special character in your text, precede it with \ in your pattern: –/ironic [sic]/ does not match “ironic [sic]” –/ironic \[sic\]/ matches “ironic [sic]” Regular expressions can support Unicode.
9
Character Classes Characters within [ ] are choices for a single-character match. Think of a set operation, or a type of or. Order within the set is unimportant. /x[01]/ matches “x0” and “x1”. /[10][23]/ matches “02”, “03”, “12” and “13”. Initial^ negates the class: –/[^45]/ matches all characters except 4 or 5.
10
/[ch]at/ Matches strings with “c” or “h”, followed by “a”, followed by “t”. thatat chatcat fatphat
11
/[ch]at/ Matches strings with “c” or “h”, followed by “a”, followed by “t”. thatat chatcat fatphat
12
Ranges Ranges define sets of characters within a class. –/[1-9]/ matches any non-zero digit. –/[a-zA-Z]/ matches any letter. –/[12][0-9]/ matches numbers between 10 and 29.
13
Shortcuts ShortcutNameEquivalent Class \ddigit[0-9] \Dnot digit[^0-9] \wword[a-zA-Z0-9_] \Wnot word[^a-zA-Z0-9_] \sspace[\t\n\r\f\v ] \Snot space[^\t\n\r\f\v ].everything[^\n] (depends on mode)
14
/\d\d\d[- ]\d\d\d\d/ Matches strings with: –Three digits –Space or dash –Four digits 501-1234234 1252 652.2648713-342-7452 PE6-5000653-6464x256
15
/\d\d\d[- ]\d\d\d\d/ Matches strings with: –Three digits –Space or dash –Four digits 501-1234234 1252 652.2648713-342-7452 PE6-5000653-6464x256
16
Repeaters Symbols indicating that the preceding element of the pattern can repeat. /runs?/ matches runs or run /1\d*/ matches any number beginning with “1”. RepeaterCount ?zero or one +one or more *zero or more {n}{n}exactly n {n,m}{n,m}between n and m times {,m}no more than m times {n,}at least n times
17
Repeaters Strings: 1: “at”2: “art” 3: “arrrrt”4: “aft” Patterns: A: /ar?t/B: /a[fr]?t/ C: /ar*t/ D: /ar+t/ E: /a.*t/F: /a.+t/ RepeaterCount ?zero or one +one or more *zero or more {n}{n}exactly n {n,m}{n,m}between n and m times {,m}no more than m times {n,}at least n times
18
Repeaters /ar?t/ matches “at” and “art” but not “arrrt”. /a[fr]?t/ matches “at”, “art”, and “aft”. /ar*t/ matches “at”, “art”, and “arrrrt” /ar+t/ matches “art” and “arrrt” but not “at”. /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
19
Lab Session I http://gskinner.com/RegExr/ https://gist.github.com/922838 Match the titles “Mr.” and “Ms.”. Find all conjugations and tenses of “sing”. Find all places where more than one space follows punctuation.
20
Lab Reference RepeaterCount ?zero or one +one or more *zero or more {n}{n}exactly n {n,m}{n,m}between n and m times {,m}no more than m times {n,}at least n times ShortcutName \ddigit \Dnot digit \wword \Wnot word \sspace \Snot space.everything
21
Anchors Anchors match between characters. Used to assert that the characters you’re matching must appear in a certain place. /\bat\b/ matches “at work” but not “batch”. AnchorMatches ^start of line $end of line \bword boundary \Bnot boundary \Astart of string \Zend of string \zraw end of string (rare)
22
Alternation In Regex, | means “or”. You can put a full expression on the left and another full expression on the right. Either can match. /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
23
Grouping Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation. The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”. /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
24
Grouping Example What regular expression matches “eat”, “eats”, “ate” and “eaten”?
25
Grouping Example What regular expression matches “eat”, “eats”, “ate” and “eaten”? /eat(s|en)?|ate/ Add word boundary anchors to exclude “sate” and “eating”: /\b(eat(s|en)?|ate)\b/
26
Replacement Regex most often used for search/replace Syntax varies; most scripting languages and CLI tools use s/pattern/replacement/. s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”. s/\bsheeps\b/sheep/ converts –“sheepskin is made from sheeps” to –“sheepskin is made from sheep”
27
Capture During searches, ( … ) groups capture patterns for use in replacement. Special variables $1, $2, $3 etc. contain the capture. /(\d\d\d)-(\d\d\d\d)/“123-4567” –$1 contains “123” –$2 contains “4567”
28
Capture How do you convert –“Smith, James” and “Jones, Sally” to –“James Smith” and “Sally Jones”?
29
Capture How do you convert –“Smith, James” and “Jones, Sally” to –“James Smith” and “Sally Jones”? s/(\w+), (\w+)/$2 $1/
30
Capture Given a file containing URLs, create a script that wgets each URL: –http://bit.ly/DHapiTRANSCRIBEhttp://bit.ly/DHapiTRANSCRIBE becomes: –wget “http://bit.ly/DHapiTRANSCRIBE”
31
Capture Given a file containing URLs, create a script that wgets each URL: –http://bit.ly/DHapiTRANSCRIBEhttp://bit.ly/DHapiTRANSCRIBE becomes –wget “http://bit.ly/DHapiTRANSCRIBE”http://bit.ly/DHapiTRANSCRIBE s/^(.*)$/wget “$1”/
32
Lab Session II Convert all Miss and Mrs. to Ms. Convert infinitives to gerunds –“to sing” -> “singing” Extract last name, first name from (title first name last name) –Dr. Thelma Dunn –Mr. Clay Shirky –Dana Gray
33
Caveats Do not use regular expressions to parse (complicated) XML! Check the language/application-specific documentation: some common shortcuts are not universal.
34
Acknowledgments James Edward Gray II and Dana Gray –Much of the structure and some of the wording of this presentation comes from –http://www.slideshare.net/JamesEdwardGrayII /regular-expressions-7337223http://www.slideshare.net/JamesEdwardGrayII /regular-expressions-7337223
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.