INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7
Overview One of the most powerful tools in UNIX/Linux is the ability to compare regular expressions –Regular expressions overview –grep –Character classes –Applications 2INFO 320 week 7
Regular expressions overview Mostly from Regular-Expressions.info and the man pages cited 3INFO 320 week 7
Regular expressions? “A regular expression (regex or regexp for short) is a special text string for describing a search pattern” –While developed in UNIX, regular expressions can be also used with little modification in Windows, Perl, PHP, Java, or a.NET languagePerlPHPJavaa.NET language –“little modification?” Yes, you have to be careful which set of regex rules you’re using 4INFO 320 week 7
Regular expressions The down side? –They look like complete and utter gibberish The good news? –There are zillions of cookbook recipes for common uses of them –And with commands (grep, ed, sed), they can be used in scripts 5INFO 320 week 7
Fancy wildcards? The basic idea is that regex are wildcards on steroids We saw that, in bash scripting –A star ‘*’ can substitute for zero or more of any character (except a line break) –A question mark ‘?’ can substitute for exactly one any character HERE IT DOESN’T –We’ll refine our use of brackets [ ] to include or exclude any specific one character 6INFO 320 week 7
Regex syntax Within UNIX, there are variations on regex syntax –GNU grep (our main tool) uses GNU Basic Regular Expressions syntax (BRE)GNU Basic Regular Expressions syntax –GNU egrep uses GNU Extended Regular Expressions syntax (ERE)GNU Extended Regular Expressions syntax –POSIX-compliant systems use POSIX Basic Regular Expressions for grep, or POSIX Extended Regular Expressions for egrepPOSIX Basic Regular ExpressionsPOSIX Extended Regular Expressions 7INFO 320 week 7
BRE (grep) vs ERE (egrep) The only difference is that BRE's will use backslashes to give various characters a special meaning, while ERE's will use backslashes to take away the special meaning of the same characters egrep has the same functions as grep, it’s just a little faster –grep –E is the same as egrep 8INFO 320 week 7
Ed and sed Similar regex rules are used by grep, ed, and sed –ed is a text line editored –sed is used to perform basic transformations on an input text streamsed 9INFO 320 week 7
grep 10INFO 320 week 7
Regular expressions and grep Regular expressions were first implemented in the 1970’s in UNIX for the ‘grep’ command –grep = generate regular expression –egrep = extended grep We’ll focus on grep –grep matches BREs, which were defined by IEEE Std , Section 9.3, Basic Regular Expressions (now dated 2008) IEEE Std INFO 320 week 7
grep syntax The basic form is –grep –options pattern file The normal output from grep is a text list of all the lines which matched the pattern in the file –Notice that pattern s like ‘re- member’ which cross lines are not found! Regex matches cannot span multiple lines 12INFO 320 week 7
grep options Like most UNIX commands, grep has many options (see handout), including –-c shows the count of lines matched, instead of the lines themselves –-i ignores case when matching (!) –-n gives the line number of each line matched –-v gives lines which don’t match the pattern(s) as output 13INFO 320 week 7
grep options You can also include a list of patterns with the –e option Or use a file with patterns using the –f option You can match lines where the whole line matches the pattern, with the –x option 14INFO 320 week 7
Search pattern s As a good habit, put the search pattern in single or double quotes (either works if consistent) –The pattern is a regular expression If you give an empty pattern all lines will be matched –So what does grep –c ‘’ filename do? 15INFO 320 week 7
Metacharacters Regex metacharacters are text strings that have special meaning in this context We’ll look at them in groups –We already mentioned the wildcard ‘*’ which matches zero or more of any character (except newline) –To match any exactly one character, use a period ‘.’ Notice a ‘?’ did this in the context of scripting 16INFO 320 week 7
Metacharacters We can identify words that start or end of a line ‘^’ (the carat) marks the start of the line –‘^Four’ ‘$’ (dollar) marks the end of the line –‘ago$’ –Again, different meaning than in scripting 17INFO 320 week 7
Metacharacters We can identify the start or end of a word ‘\<‘ marks the start of a word –‘\<eat’ would match eats or eating, not feat ‘\>’ marks the end of a word –‘ing\>’ would match loving but not sings 18INFO 320 week 7
Character classes 19INFO 320 week 7
Character classes With a "character class" (or set) you can tell the regex engine to match only one out of several characters –Simply place the possible characters you want to match between square brackets If you want to match an a or an e, use [ae] –You could use this in gr[ae]y to match either gray or grey Very useful if you do not know whether the document you are searching through is written in American or British English From 20INFO 320 week 7
Character classes The order of the characters inside a character class does not matter –The results are identical [ae] or [ea] The characters don’t have to be sequential –[dptjgm583;] is fine –But if you want cite special characters [\^$.|?*+(){} literally, you need to add a backslash before them So [abc\\\?] matches a b c \ or ? 21INFO 320 week 7
Character classes More generally in character classes –‘[]’ matches any one character specified between the brackets –‘[^abc]’ matches any one character NOT specified between the brackets That example means ‘does not have a b or c in it’ Notice the ^ has very different meaning in a character class or as its own metacharacter 22INFO 320 week 7
Character classes Within character classes, ranges of possible characters can be given –[a-z] means any lower case letter –[a-zA-Z] means any upper or lower case letter –[a-zA-Z0-9] could be any character that isn’t a letter or number 23INFO 320 week 7
Metacharacters The pipe means logical OR in an expression, here called alternation –abc(def|xyz) matches abcdef or abcxyz Multiple alternations are allowed –s[i|a|o]ng Notice the parentheses group a string of characters to be treated as one 24INFO 320 week 7
Bracket expressions POSIX has bracket expressions to provide abbreviations for common search terms –For example instead of [a-z] can use [:lower:] –[a-zA-Z] becomes [:alpha:] –[a-zA-Z0-9] becomes [:alnum:] –What does [A-Fa-f0-9] = [:xdigit:] mean? So [^x-z[:digit:]] matches a single character that is not x, y, z or a digit [0-9] From 25INFO 320 week 7
Optional The question mark will attempt match the preceding token zero times or once, in effect making it optionalquestion mark –colou?r matches both colour and color –Nov(ember)? will match Nov and November 26INFO 320 week 7
Repetition The asterisk or star tells the engine to attempt to match the preceding token zero or more times. –‘ ’ matches an HTML tag without any attributes The plus tells the engine to attempt to match the preceding token once or more. –‘ ’ will match a tag with any one or more alphanumeric characters 27INFO 320 week 7
Limiting repetition As a further refinement, it’s possible to specify how many times a string will be repeated, by adding {min,max} instead of a star or plus Max is infinite if not specified, so –* = {0,} + = {1,} and ? = {0,1} –But {0,3} would limit the previous character to appear zero to three times 28INFO 320 week 7
() [] [::]? So in the context of a regex –Parentheses ( ) are used for grouping, to treat a series of characters as one for repetition –Square brackets [ ] define a character class, matches any one character in that class –Square brackets with colons [: :] define a POSIX bracket expression 29INFO 320 week 7
?*+{}? And following any kind of grouping, character class, or bracket expression –? Makes a group repeated zero or one time (optional) –+ makes a group repeated one or more times –* makes a group repeated zero or more times –Curly brackets { } are used for controlling repetition by giving min and max limits 30INFO 320 week 7
Searching for special characters To match a ], put it as the first character after the opening [ or the negating ^ To match a -, put it right before the closing ] To match a ^, put it before the final literal - or the closing ] Put together, []\d^-] matches ], \, d, ^ or - 31INFO 320 week 7
Applications From 32INFO 320 week 7
Ok, now what? Given this terribly complex set of rules for defining a regular expression … so what? Regexes are very handy for searching for specific terms, or validating inputs Here we’ll review a few cookbook examples 33INFO 320 week 7
Trimming Whitespace A mundane example is to use regular expressions to get rid of spaces at the start and end of lines –Search for ^[ \t]+ and replace with nothing to delete leading whitespace –Search for [ \t]+$ and replace with nothing to trim trailing whitespace –[ \t] matches a space or tab 34INFO 320 week 7
Match IP addresses A simplified version is \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b But that will catch illegal IP addresses above 255; to fix that use –\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b –Ok, matching numbers is tough in a text world 35INFO 320 week 7
Numbers are challenging To get a real number –[-+]?[0-9]*\.?[0-9]+ But if you might need exponential notation –[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? 36INFO 320 week 7
Validate addresses If you get a string and want to see if it’s an address, could try –What assumption is made here about case? 37INFO 320 week 7
Validate a date (19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]) Matches a date in yyyy-mm-dd format from between and INFO 320 week 7
Validate credit cards To validate a credit card, need their format, and first strip out spaces & dashes Visa: ^4[0-9]{12}(?:[0-9]{3})?$ –All Visa card numbers start with a 4; new cards have 16 digits, old cards have 13 MasterCard: ^5[1-5][0-9]{14}$ –All MasterCard numbers start with the numbers 51 through 55; all have 16 digits 39INFO 320 week 7
References Regular-expressions.info Grep man page unty/en/man1/grep.1posix.html unty/en/man1/grep.1posix.html Lots of books are also available on regular expressionsLots of books 40INFO 320 week 7