Regular Expressions Used for pattern matching against strings

Regular Expressions Used for pattern matching against strings
A regex describes a pattern Use literal characters such as Frank Use metacharacters to describe variability such as [Ff] Some metacharacters overlap Bash wildcards, we have to be careful in our usage Regex used in such software as Spam filters Natural language understanding programs Available in numerous programming languages In Linux, we often use them with programs like grep/egrep, sed, awk

Regex: Example Assume we want to match against a string like Viagra as found in spam We could enumerate all of the possible variations Clever spammers find ways to express Viagra that are not solely based on the 6 letters thus, a spam filter may not catch it if we just enumerate spellings with upper and lower case letters but a human would still understand Here are several variants of Viagra that a regex could potentially catch: v.i.a.g.r.a, v1agra, vi_ag_ra, ViAgRa, Viagr

Metacharacters Metacharacter Explanation *
Match the preceding character if it appears 0 or more times + Match the preceding character if it appears 1 or more times ? Match the preceding character if it appears 0 or 1 times . Match any one character ^ Match if this expression begins a string $ Match if this expression ends a string [chars] Match if the next character in the string contains any of the chars in [ ] [chari-charj] Match if the next character in the string contains any characters in the range from chari to charj [[:class:]] Match if the next character in the string is a character that is part of the :class: specified. The :class: is a category like upper case letters (:upper:) or digits (:digit:) or punctuation marks (:punct:). [^chars] Match if the next character in the string is not one of the characters listed in [ ] \ The next character should be interpreted literally, used to escape the meaning of a metacharacter {n} Match if the string contains n consecutive occurrences of the preceding character {n,m} Match if the string contains between n and m consecutive occurrences of the preceding character {n,} Match if the string contains at least n consecutive occurrences of the preceding character | Match any of these strings (an “OR”) (…) The items in … are treated as a group, match the entire sequence

Controlling Repetition with *, +, ?
These control the number of occurrences that the preceding character can match * - 0 or more + - 1 or more ? – 0 or 1 0*1 – any sequence of 0 or more 0s followed by a 1 000001, 0001, 01, 1 (no 0s will still match) 0+1 – any sequence of 1 or more 0s followed by a 1 000001, 0001, 01, but not 1 by itself 0?1 – 0 or 1 0 followed by a 1 01 or 1 are the only matches Using * or +, your regex could match an infinite number of strings whereas with ? it can match only a finite number of strings

The . Metacharacter The period matches any single character
b.t – matches a b followed by anything followed by a t such as bat, bet, bit, bot, but, byt, b2t, b t (b, space, t) and bbt it does not match bt or beat The *, + and ? metacharacters can modify the . b.*t – match a b followed by 0 or more of anything followed by a t (bat, bet, bit, bot, bt, beat, bbbbbbt) b.+t – same as b.*t except that it won’t match bt b.?t – same as b.t except that it will also match bt

Where Does the Regex Match?
Consider the regex b?.t+ and the string bbbbbbbattttttt Will it match? It shouldn’t because the string is not literally 0 or 1 b, a character, and 1 or more ts But the string contains a substring that matches The regex attempts to match any substring of the string In this case, it matches the substring starting at the

Controlling Where to Match
Two additional metacharacters force the regex to match a given portion of the string ^regex – match the string only starting at its beginning regex$ - match the string only at the end ^regex$ - match the entire string Consider ^b?.t+ matching against the string from the previous slide It does not match because the ^ forces the regex to match against the beginning of the string The b matches and the . matches (a b) but the next character needs to be a t for the regex to match and its another b

Examples Note that the * includes 0 matches
^0*1*$ will match an infinite number of potential strings any string that starts with some (0 or more) 0s and then ends with some 1s this includes the empty string – the string that consists of nothing the regex ^0+1*$ does not include the empty string, nor does ^0*1+$ or ^0+1+$ ^0?1?$ will match exactly 4 strings: empty string, 0, 1, 01

Matching from a List of Options
The [ ] are used to enumerate a list of options Any character from the list can match one character from the string This allows us to express variable matching of a single character We can list three types of things in [ ] [list] as in [abcde] or [ ] [range] as in [a-e] or [0-9] [[:class:]] where the class is a legal POSIX class as in [[:alpha:]] for any alphabetic character

POSIX Classes Class Meaning :alnum: Any letter or digit :alpha:
:blank: Space and tab :cntrl: Any control character :digit: Any digit :graph: Any non-whitespace :lower: Lower case letter :print: Any printable character :punct: Any punctuation mark including [ and ] :space: Any whitespace (space, tab, new line, form feed, carriage return) :upper: Any upper case letter :xdigit: Any digit or hexadecimal digit

Examples [0-9]+ - match any sequence of digits
^[0-9]+$ - match any string that consists solely of digits ^[0-9]*$ - same but also matches the empty string [abc][abc][abc] – match any sequence of three a’s, b’s, c’s in any combination such as aaa, abc, cba, cbc, bbc or cab ^[[:upper:]][[:lower:]]+ [[:upper:]][[:lower:]]+$ could be used to match any name assuming a name will consist of a first and last name only and each part of the name is one upper case letter followed by lower case letters

Characters That Must NOT Appear
To indicate that the regex should match a string such that the given character(s) does not appear, use [^char(s)] ^[^0-9] – match any string that does not start with a digit ^[^0-9]+$ - match any string that contains no digits ^[^0-9].*[^0-9]$ - match any string that does not start or end with a digit but may have digits in between ^[^0-9].*[0-9].*[^0-9]$ - match any string that contains at least one digit but not at the beginning or end using [^…] is tricky, grep has an easier approach

Matching Literal Characters
What if you want to match a period? b.t – can match b.t but will also match bat To match a period exactly, we have two approaches \. – says “treat the next character literally, not as a metacharacter” or “match a period” [.] – says “match a period” Not all characters work when placed in [ ] such as [ and ], so you would have to use \ in those cases \[ - match a [ \] – match a ] [\[\]] – match either a \[ or a \] as the next character

Example We want to express an arithmetic equation
Some number some operation some number = some number the operations are *, /, +, -, % (mod) notice that * and – are metacharacters [0-9]+ [*/+-%] [0-9]+ = [0-9]+ by placing * and + in [ ], we do not have to worry about them being interpreted as metacharacters but we have a problem with – as it is used to express ranges to use -, we must either use \- or place – at the end of the bracketed list [0-9]+ [*/+%-] [0-9]+ = [0-9]+ NOTE: with a regex, we cannot actually require that the equation be true, for instance, this will match = 43

Controlling Repetition
The * and + allow for any number of repeats What if we want to limit the amount of repetition? {n} indicates “exactly n matches” {n,m} indicates “between n and m matches” {n,} indicates “at least n matches” there is no {,m} as we do not need it (we can get this using {1,m} n and m must be non-negative integers and n <= m [0-9]{3}-[0-9]{2}-[0-9]{4} – matches a social security number

Modifying Groups of Characters
What if you want a metacharacter to modify a sequence rather than a single character? We want 2 or more words, can we use {2,}? a word is defined as letters followed by a blank space as shown below (the quote marks are only there to demonstrate that a blank space exists) ‘[[:alpha:]]+ ’ To modify this regex with {2,} we enclose the entire thing in ( ) ([[:alpha:]]+ ){2,}

Selecting Between Sequences
The [ ] allow you to select one of several characters to match the next character What if you want to express one of several sequences to match? Ex: OH, IN, KY [OIK][HNY] doesn’t work because it matches other patterns such as ON and IY The | means “or” as in OH|IN|KY

Examples 5 digit zip code 9 digit zip code Either
[0-9]{5} 9 digit zip code [0-9]{5}-[0-9]{4} Either ([0-9]{5})|([0-9]{5}-[0-9]{4}) City/state/zip code combination ([A-Z][a-z]+, [A-Z]{2} [0-9]{5}) | ([A-Z][a-z]+, [A-Z]{2} [0-9]{5}-[0-9]{4}) What’s wrong with our city description? Could it match San Antonio or McAllen? How would you fix it?

Challenging Example We want to match any string that has a capital letter followed by lower case letters You might think [A-Z][a-z]+ But this will match ABCabc as well as Cabc We want to ensure that the string starts with only one capital letter ^[A-Z][a-z]+ But this will also match AabcC ^[A-Z][a-z]+$

Putting it All Together: Examples
^[0-9]+ match if the string starts with at least one digit [b-df-hj-np-tv-z]+[aeiou][b-df-hj-np-tv-z]+ match a string of consonants, one vowel and more consonants [Â-Z][A-Z]{4}[Â-Z] match any string that has exactly four upper case letters surrounded by anything such as abcABCDabc what if we want to match four upper case letters by themselves such as ABCD? [Â-Z]*[A-Z]{4}[Â-Z]* … match any three characters no matter what they are use ^…$ to match only three character sequences

^$ the empty string match the various forms of viagra including those that have 0 or 1 characters between the letters such as V!i!a!g!r!a or v*iag*ra ([A-Z][[:alpha:]]+ )?[A-Z][[:alpha:]]+, [A-Z]{2} [0-9]{5}$ possible solution to the city/state/city code in which the city name can be multiple words and have multiple capitalized letters like McAllen notice this only matches a 5 digit zip code, can you enhance this for a 5 or 9 digit zip code?

([(][0-9]{3}[)] )?[0-9]{3}-[0-9]{4} A US phone number with optional area code [0-9]+(.[0-9]+)? a number that may but does not have to include a decimal point – notice this will not match 10. but it will match 10 or 10.0 $[0-9]+\.[0-9]{2} match a dollars and cents amount notice that $ does not require \ if the $ appears anywhere other than the end of the string but we could also use \$ if we want to play safe

grep/egrep Program to match a regex from one or more files
grep = global regular expression print egrep is the same as grep but permits the extended regular expression set which includes { }, ( ) and | format: egrep [options] regex file(s) we will want to put our regex in ‘’ as we explain later for every line that matches the regex, return that entire line of that file a string is considered the full line and grep/egrep attempts to match any substring of that string/line by default, if more than one file is searched, the filename is pre-pended to each line of output

grep/egrep: Examples egrep '$[0-9]+\.[0-9]{2}' * egrep '41099' *
search all files for dollar and cents entries egrep '41099' * search all files for the zip code but note that this will match any line that contains this exact 5-digit sequence whether it is a zip code or not egrep 'Highland Heights, KY ' * match the given city, state, zip code exactly egrep '(KY |KY|KY\.) 41099' * match different possible state/zip codes given that the state might end with . or have an extra blank space

grep/egrep: Search for IP Addresses
An IP address is #.#.#.# where # is any number from 0-255 The first attempt of most students is egrep ‘[0-255].[0-255].[0-255].[0-255]’ /etc/* why is this wrong? [0-255] matches 1 character so it will match a 0, 1, 2, 5 or 5, not any number from 0 to 255 the period matches any character this regex could match or but it would also match as well as 0a1b2c5d but would not match

First fix: change . to either \. or [.] egrep ‘[0-255]\.[0-255]\.[0-255]\.[0-255]’ /etc/* Second fix: change [0-255] to be a regex to match any there digits egrep ‘[0-9]{3}\.[0-9]{3}\.[0-9]{3}\.[0-9]{3}’ /etc/* this unfortunately will not match or because we are insisting that the numbers contain exactly 3 digits Third fix: change [0-9]{3} to [0-9]{1,3} egrep ‘[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}’ /etc/* better? yes but this will also match as well as which is not a valid IP address

To match any number between 0 and 255 we break the sequences down as follows: [0-9] [1-9][0-9] [0-9][0-9] [0-4][0-9] [0-5] So we have to include 5 different options: [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] Our final command is egrep '(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])' /etc/*

networks:loopback networks:link-local ntp.conf:restrict ntp.conf:#restrict mask nomodify notrap ntp.conf:#broadcast autokey # broadcast server ntp.conf:#broadcast autokey # multicast server ntp.conf:#multicastclient # multicast client ntp.conf:#manycastserver # manycast server ntp.conf:#server # local clock openct.conf: # >=linux openct.conf: # >=linux pam_ldap.conf:host pam_ldap.conf:#uri ldap:// / pam_ldap.conf:#uri ldaps:// / Binary file prelink.cache matches resolv.conf:nameserver resolv.conf:nameserver Partial result of our previous egrep command Notice this entry

grep/egrep: Options -c – do not return matching lines, just output number of matches -i – ignore case (this allows you to avoid using upper and lower case options in your regex as in [Yy][Ee][Ss] -I – ignore binary files -n – insert line number prior to filename -h – discard file numbers (used by default if only searching 1 file) -H – include file name (used by default when searching multiple files) -v – invert the match (return all lines that do not match the regex) this is very useful because it allows you to avoid using [^…] in your regex egrep –v ‘\.’ file – return all lines with no periods

grep/egrep: Examples egrep '[[:alpha:]]{30}' /usr/share/dict/words
find all words of 30 (or more) letters in the Linux dictionary egrep '[A-Z][[:alpha:]]{19}' /usr/share/dict/words find all 20 letter words in the dictionary that start with a capital letter because we are not using ^ and $ in these two examples, longer words can be returned egrep ‘[[:punct:]]’ /usr/share/dict/words find all words that contain punctuation marks egrep ‘[[:punct:]].*[[:punct:]]’ /usr/share/dict/words find all words that contain two punctuation marks egrep ‘[[:punct:]][[^:punct:]]+[[:punct:]]’ /usr/share/dict/words find all words that contain two non-consecutive punctuation marks

grep/egrep: Using ‘’ Why do we use ‘’ and what if we don’t?
In some cases it won’t matter, for instance egrep someword * But what if we use egrep someletter* * Here’s where we get intro trouble as the Bash interpreter works on this command before egrep does So the two * are treated as filename expansion Imagine that we do egrep ab* somefile Here we are asking to match any line in any file that has an a followed by zero or more bs but the Bash interpreter performs filename expansion instead imagine that the directory contains the files abort and abrupt.txt this egrep command because egrep abort abrupt.txt somefile which means that egrep looks to match the string “abort” in the files abrupt.txt and somefile

sed A stream editor program
What is a stream? It is a sequence of characters taken from one source sed allows you to manipulate the stream and output an adjusted version Thus, sed searches and replaces sequences of characters in the string sed can actually do a lot more but we restrict our look with just this one type of operation – search and replace form: sed ‘s/pattern/replacement/’ filename pattern is a regex sed will send its output to STDOUT, the terminal window, so you might want to redirect the output to a new file sed ‘s/pattern/replacement/’ filename > filename2

sed: Example We want to replace apt in a file of addresses with the word Apartment sed 's/[Aa]pt/Apartment/' addresses.txt > revised_addresses.txt we might expect apt to also appear as Apt What if apt might be written as Number or #? sed 's/[Aa]pt\|[Nn]umber\|#/Apartment/' addresses.txt > revised_addresses.txt now we are replacing any of Apt, apt, Number, number or # with Apartment notice though that we might see #8 which would now appear as Apartment8

sed: Multiple Patterns
We want to insert a blank space after apartment if the pattern is #, how do we specify different replacements for different patterns? we use multiple pattern/replacement pairs and add –e as an option sed –e 's/[Aa]pt\|[Nn]umber/Apartment/' –e 's/#/Apartment /' addresses.txt > revised_addresses.txt Here we replace Apt, apt, Number and number with Apartment and # with Apartment followed by a space

sed: Deleting Patterns
One easy replacement string is an empty string, indicated as // We can find a pattern and replace it with nothing sed 's/-[0-9][0-9][0-9][0-9]//' addresses.txt > revised_addresses.txt find all 4-digit zip code extensions and drop them The above regex is cumbersome but we can’t use { } as its part of the extended regular expression set unless we add the option –r sed –r 's/-[0-9]{4}//' addresses.txt > revised_addresses.txt

sed: Global Replacements
sed will search line by line for the pattern and if found, replace the pattern with the string it then moves on to the next line what if we might have multiple patterns on one line? We need to indicate a global replacement by adding g after the last / in our sed command sed ‘s/pattern/replacement/g’ file(s) for instance, to replace all of the \t characters in a file with a blank space use sed ‘s/\t/ /g’ file

sed: Global Replacement Example
Replace the words “equals”, “times”, “plus”, “minus”, “divide” in the file math.txt with the actual operator symbols (=, *, +, -, /) sed –e 's/equals/=/' –e 's/times/*/' –e 's/plus/+/' –e 's/minus/-/' math.txt this version will only replace one item per line sed –e 's/equals/=/g' –e 's/times/*/g' –e 's/plus/+/g' –e 's/minus/-/g' math.txt this version replaces all instances throughout the file

sed: The & Placeholder The regex’s pattern could potentially match many different strings Our replacement specifies only one replacement string What if instead of replacing each string we want to manipulate the string (e.g., capitalize it or place a ! after it?) We have to have a mechanism to refer back to the pattern that matched – we use the & as a placeholder to mean “the matched pattern”

sed: & Examples Find any fully capitalized word and place it in parentheses sed ‘/s/ [A-Z]+ /(&)/g’ filename we are looking for a space, upper case letters, space what would happen if we remove the blank spaces? sed ‘/s/\n/!&/’ filename insert a ! before each \n, that is, add ! to the end of each line why didn’t we use &! instead?

sed: Other Uses of & We also can use any of these to modify the string that matched the pattern \u – upper case the first letter of the string \l – lower case the first letter of the string \U – upper case the entire string \L – lower case the entire string sed ‘s/[A-Za-z]+ /\L&/g’ file replace every word that contains any upper case letters with solely lower case letters sed –e ‘s/^[a-z]/\u&/g’ –e ‘s/ [a-z]/\u&/g’ file upper case the first letter of every line and the first letter of every word

sed: More Examples Capitalize all vowels
sed 's/[aeiou]/\u&/g' names.txt Fully upper case all names (assuming names will appear as one capital letter followed by lower case letters) sed 's/[A-Z][a-z]+/\U&/' names.txt Remove all initials sed 's/[A-Z]\.//g' names.txt Replace blank spaces with tabs sed 's/ /\t/g' names.txt Take each line and duplicate it onto a second line (insert \n between the two instances of the line) sed 's/[A-Za-z. ]+/&\n&/' names.txt

sed: Other Placeholders
With &, the replacement impacts the fully matched string what if you want to impact only part of a matched string? We can divide our search regex into parts by placing each part inside $ and $ marks we then refer to the various matched parts as \1, \2, \3 etc in the replacement string

sed: Placeholder Example
Let’s illustrate the use of placeholders by deleting the middle initial of a person’s name sed ‘s/$[A-Z][a-z]+$$ [A-Z]\. $$[A-Z][a-z]+$/ \1 \3/’ file we look for a string that is a capital letter followed by lower case letters, we call this \1 we next look for a space, capital letter, a period and another space, and call this \2 finally we look for a capital letter followed by lower case letters and call this \3 we output \1 \3 (we have to insert the blank space since we are eliminating the two spaces by not outputting \2)

sed: More Placeholder Examples
Rewriting a name from First Middle. Last to Last, First sed 's/$[A-Z][a-z]+$ $[A-Z]\.$ $[A-Z][a-z]+$/\3, \1/' names.txt Revise the above in case there is no middle initial sed -e 's/$[A-Z][a-z]+$ $[A-Z]\.$ $[A-Z][a-z]+$/\3, \1/' -e 's/$[A-Z][a-z]+$ $[A-Z][a-z]+$/\2, \1/' names.txt

awk This program is more a programming language which has the ability to search a file for various patterns or conditions Upon a matching line, it executes some action on that line the action might include assignment statements and output statements The file being operated on is expected to be segmented where each row consists of fields of data awk stands for the last names of the authors

awk: Syntax awk ‘/pattern/ { action(s) }’ filename
awk ‘/pattern1/ {action(s) 1} /pattern2/ {action(s) 2} /pattern3/ {action(s) 3} … /patternn/ {action(s) n}’ filename Each pattern can be a literal string, a regular expression or a comparison using the format field operator value such as $3 > 100 field is the column of the datum being tested in the file using notation $1 for first field, $2 for second field, etc $0 indicates the full line actions can include {print …} as in {print $0}

awk: Examples Assume a file of names, print the first second and third entries of a line that contains an initial (assume $1 is first name, $2 is middle initial and $3 is last name) awk ‘/[A-Z]\./ {print $1,$2,$3}’ names.txt Same except we do not print middle initials awk ‘/[A-Z]\./ {print $1,$3}’ names.txt Count number of entries that have middle initials awk ‘/[A-Z]\./ {count++}’ names.txt notice that we are never printing the result of count, we’ll explore how to do this shortly

awk: Computation Example
We have a file of employee data (empl.dat) firstname lastname hours wages We want to compute each employee’s pay awk ‘{print “$1 $2: hours*wages”}’ empl.dat notice that we have no condition so we are outputting the results of every employee awk ‘$3<=40 {print “$1 $2: hours*wages”}’ empl.dat here we are only printing the pay for employees who did not work overtime awk ‘/Zappa/ {print hours*wages}’ empl.dat only compute and output pay for employee Zappa this could also be awk ‘$2==Zappa {print hours*wages}’ empl.dat

awk: BEGIN and END Clauses
A couple of slides ago we saw {count++} How do we output the result? If we just used {count++; print count} we would print the value count every time we have a match We only want to print it out once, at the END This leads us to having BEGIN and END statements BEGIN – perform the actions once before awk works on the file we will mostly use BEGIN to initialize any variables that do not default to 0 and to output a “header” END – perform the actions once after awk finishes working on the file (even if no lines match) we will mostly use END to compute any final computations (like an average) and output totals and averages

awk: BEGIN/END Examples
awk 'BEGIN {print “Pay results for Zappa”; total = 0} /Zappa/ {temp=$3*$4; print $1 “\t” temp; total = total + temp} END {print “Zappa’s pay is $” total}' sales.txt If no lines match Zappa, the output is simply Pay results for Zappa Zappa’s pay is $0 Example with multiple employees: awk ‘BEGIN {total1=0;total2=0;total3=0} /Zappa/ {total1=total1+$3*$4} /Duke/ {total2=total2+$3*$4} /Keneally/ {total3=total3+$3*$4} END {print “Zappa $” total1 “\n” “Duke $” total2 “\n” “Keneally $” total3}' sales.txt

awk: More Complex Conditions
We can test for multiple conditions in one pattern by using || for “or”, && for “and” and !/.../ for “not” awk '/OH/||/KY/ {counter=counter+1;} END {print “Total number of employees who serve OH or KY: ” counter}' sales.txt Count total number of employees in a file that match either OH or KY

awk: Payroll Example awk 'BEGIN {total_pay=0.0;count=0}
$2>40 {current_pay = ($2-40) * $3 * * $3; total_pay += current_pay; count++; print $1 "\t $" current_pay} $2<=40 {current_pay = $2*$3; total_pay += current_pay; count++; print $1 "\t $" current_pay} END {print "Average pay is $" total_pay/count}' payroll.dat

awk: Using | to awk Count the number of files in a directory that do not start with – (that is, count non-regular files) ls -l | awk 'BEGIN {total=0;count=0} /^-/ {total+=$5;count++} END {print total/count}‘ notice no file name because of the pipe ps aux | awk ‘BEGIN {total=0} /foxr/ {total+=$6} END {print total}’ output the total amount of RSS (shared memory) used by foxr processes

Regular Expressions Used for pattern matching against strings

Similar presentations

Presentation on theme: "Regular Expressions Used for pattern matching against strings"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regular Expressions Used for pattern matching against strings

Similar presentations

Presentation on theme: "Regular Expressions Used for pattern matching against strings"— Presentation transcript:

Similar presentations

About project

Feedback