Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

Regular Expressions grep
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller and Ruth Anderson
1 CSE 303 Lecture 7 Regular expressions, egrep, and sed read Linux Pocket Guide pp , 73-74, 81 slides created by Marty Stepp
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Linux+ Guide to Linux Certification, Second Edition
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
Introduction to C Programming
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Scripting Languages Chapter 8 More About Regular Expressions.
Form Validation CS What is form validation?  validation: ensuring that form's values are correct  some types of validation:  preventing blank.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Chapter 4: UNIX File Processing Input and Output.
Last Updated March 2006 Slide 1 Regular Expressions.
Va-scanCopyright 2002, Marchany Unit 6 – Solaris File Security Randy Marchany VA Tech Computing Center.
CST8177 Regular Expressions. What is a "Regular Expression"? The term “Regular Expression” is used to describe a pattern-matching technique that is used.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Linux+ Guide to Linux Certification, Third Edition
Introduction to Unix – CS 21 Lecture 6. Lecture Overview Homework questions More on wildcards Regular expressions Using grep Quiz #1.
CSC 352– Unix Programming, Spring 2015 April 28 A few final commands.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
UNIX Commands RTFM: grep(1), egrep(1) & fgrep(1) Gilbert Detillieux April 13, 2010 MUUG Meeting.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Chapter 5: The Shell The Man in the Middle. In this chapter … The command line Input, output, and redirection Process management Wildcards and expansion.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Linux+ Guide to Linux Certification, Second Edition Chapter 4 Exploring Linux Filesystems.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Used for pattern matching against strings
PROGRAMMING THE BASH SHELL PART III by İlker Korkmaz and Kaya Oğuz
Regular Expressions Copyright Doug Maxwell (
Regular Expressions 'RegEx'.
Looking for Patterns - Finding them with Regular Expressions
Lecture 9 Shell Programming – Command substitution
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Pattern Matching in Strings
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
Regular Expressions and Grep
CSCI The UNIX System Regular Expressions
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Presentation transcript:

Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters to match in the string like “abc” – Metacharacters are characters that specify how we can interpret a sequence of literal characters Example: – [abc]+[def]* - find any sequence of one or more of the letters a, b, c followed by any sequence of zero or more of the letters d, e, f, for instance abacabdddd or aaaaaab or adefdef but not ddddeeefff – why not? Regular expressions are a powerful tool that a Linux user can use to search files for particular types of information

Metacharacters

*, + * - match the preceding character 0 or more times including the empty string + - match the preceding character 1 or more times but not including the empty string – 0* - any number of 0s (including no 0s) – 0*1* - any number of 0s followed by any number of 1s will match , , and the empty string but not , 0000a1111 – will match , , but not , 0000a11111, (no 0s) We can combine the use of * and + in one expression – 0*1+

?,. ? matches the preceding character if it occurs exactly 0 or 1 time – With ?, we limit the number of occurrences 0?1? Will match only the empty string, 0, 1 and 01 – 0?1+ will match , , 1 but not 001, 0 or the empty string. (period) matches any single character – b.t will match a ‘b’ followed by anything followed by a ‘t’ such as bat, bet, bit, bot, but, bbt, bct, btt, bzt, b0t, b#t, etc We can use the *, + and ? to modify the. – b.+t will match any string that has a b followed by 1 or more of any character(s) followed by t as in bat, baat, bbt, bcdet, b t but will not match bt b.*t will match everything that b.+t matches but will also match bt

[…] Match any character that appears in the [ ] – The list of characters in [ ] can be an enumerated list or a range [aeiou] – enumerated list [a-z] – range [b-df-hj-np-tv-z] – both enumerated lists and ranges *, + and ? can modify the [ ] – [a-z]+ will match any sequence of 1 or more lower case letters – [A-Z][a-z]+ will match any sequence of an upper case letter followed by 1 or more lower case letters To match tif/tiff, use: – [tT][iI][fF][fF]?but not – [tT][iI][fF]+[tT][iI][fF][fF]*

[[…]] In some cases, the range or list of characters is already represented using POSIX classes – POSIX – portable operating system interface – a standard that has defined among other things these classes Each class is denoted using :classname: – :alpha::digit: – :alnum: - alphabetic character or digit – :upper::lower: – :punct::cntrl: – :space: - white space (blank, tab, enter) – :print: - any visible character [A-Z][A-Za-z]+ is the same as [[:upper:]][[:alpha:]]+

{n,m} Match the preceding character between n and m times (n & m are integers where n < m) {n} – exactly n times {n, } – at least n times {, m} – no more than m times (including 0) { } can modify [ ] and. – [a-z]{3,4} – 3 or 4 lower case letters – 0.{5}1 – 0 followed by 5 of any character(s) followed by 1 A social security number – [0-9]{3}-[0-9]{2}-[0-9]{4} A phone number – [0-9]{3}-[0-9]{4} – ([0-9]{3}) [0-9]{3}-[0-9]{4}

\ Recall in the previous examples we used ( ) for an area code – ( ) are reserved for another purpose – We used. in our regular expression for IP addresses (but. can match any character) \ preceding a metacharacter is used to “escape” the meaning of the metacharacter – Without \,. matches any character but \. matches only the period – We would have to revise our previous example of an area code to read \([0-9]{3}\) so that we match the ( ) exactly

[^…] The ^ has two uses, here we focus on the use inside the [ ] – Inside of [ ], we use ^ to indicate “do not match” or “match anything except” – [^a] will match a character that is not “a” – [^0-9]+ will match anything that is not some number of digits The use of [^…] can be challenging though – Assume we have the string abCDefg – Unfortunately, the regex [^A-Z]+ will still match this string! Why?

Matching Substrings A regular expression matches a substring of a string – It will try to match any substring of the string, not necessarily the first substring or the entire string Consider the regex 0{1,2}[a-zA-Z0-9]+ – This will match the string 0000abcd0000 because the substring 0abcd appears in the string and the substring 0abcd matches the regex actually, the substring 0a matches Returning to the previous slide – abCDefg contains the substring “a” which matches the expression [^A-Z]+ at least one character that is not an upper case letter

^ and $ We will return to the use of [^…] in a bit What if we want to match a substring of a string such that it begins or ends the string? – The ^ (outside of [ ]) indicates that the regex will only match a substring of a string if the regex matches at the beginning of the string – The $ indicates that the regex will match only at the end of the string – Using both ^ and $ means that the regex will only match the entire string (not substrings) For instance, ^[0-9]+$ will match any string that contains only digits

Examples ^[A-Z][a-z]+ [0-9]{1,2}, [12][0-9][0-9] – Match any string that starts with a date as in March 21, 2004 [A-Z]{2} [0-9]{5}$ – Match any string that ends with 2 upper case letters, a space, and 5 digits (the end of an address) note this does not ensure that the 2 letter state abbreviation is a legal state, it could for instance match AB or ZZ ^[A-Z][a-z]* [A-Z]\. [A-Z][a-z]+$ – Match any string that consists entirely of a capitalized word, an initial and a capitalized word (presumably a person’s full name with middle initial) ^$ – Match the empty string

Using [^…] To make sure that a string contains no digits – We could use ^[^0-9]+$ match anything as long as there is no digit anywhere in the string – Without the use of ^ and $ it is hard to control the [^…] – Notice with the + (^[^0-9]$), we are saying “match a string that starts with a non-digit and then ends” that is, a string of 1 character which is not a digit – ^[^0-9] – does not start with a digit – [^A-Z]{2}$ – does not end with a 2 letter abbreviation – ^[^$]+$ – does not contain a dollar sign notice when used in [ ], the metacharacter being evaluated, $ in this case, does not need to be preceded by \

( ) To apply a metacharacter to a group of characters (rather than just the preceding character), use the group in ( ) Example: match a list of words – A word will be any lower case letters followed by a space – A word will be [a-z]+ A list of words would not be: [a-z]+ + – The second + would apply to only the space, not the entire regex We will use – ([a-z]+ )+ The second + applies to the entire group of characters ([a-z]+ and the space)

| for OR We use […] to match any single character in a list of characters – What if we want to match any one of a group of characters? – Use | to separate each group For instance, we want to match any of IN, KY or OH – [IKO][NYH] does not do this because it would also match IY, IH, KN, KH, ON and OY Use IN|KY|OH – Or use (IN|KY|OH) which is more preferred

Examples Phone numbers with and without area codes – \([0-9]{3}\) [0-9]{3}-[0-9]{4} | [0-9]{3}-[0-9]{4} note: the blank space around the | should not be there but is shown here to make the regex readable 5 and 9 digit zip codes – [0-9]{5} | [0-9]{5}-[0-9]{4} A name with and without a middle initial – [A-Z][a-z]* [A-Z]\. [A-Z][a-z]+ | [A-Z][a-z]* [A-Z][a-z]+ IP address: [0-255].[0-255].[0-255].[0-255] – What’s wrong with this? How about: – [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\. [0-9]{1,3}

IP Addresses [0-9]{1,2} – covers 0-99 Now we need to also cover Note that [0-9]{1,2} is not correct either because we would not normally use 00 or 09, instead just 0 or 9, How can we fix this?

Spam Filters One common use of regex is to build spam filters to search not just for keywords, but variations – Consider we want a regex to spot “viagra” but clever spammers will try to hide the word by using non-standard characters or by altering the spelling v!agra v_i_a_g_r_a We might try any number of regexs to spot this – would catch the first two but not the third – would catch all three

Wildcards in Linux Recall from chapter 9 that we use *, ?, [ ] as wildcards when specifying filenames – The bash interpreter performs filename expansion by attempting to match all files in the current directory to the name listed – This is a process referred to as globbing – But we saw that *, ?, and [ ] are also used in regex – This is confusing! We have to differentiate when we use these characters in such commands as ls, rm, mv, etc from when we use them in regular expressions

Examples The contents of our current directory are :

grep The most common usage of regex in Linux is through the program grep – global regular expression print Usage: grep pattern file(s) – will return every line in the file(s) listed that contain a substring that matches the pattern Very useful for finding content of file(s) that you are interested in – e.g., searching all files in a directory that have IP addresses

Applying grep When the IP address pattern is used in grep for all files in /etc, we get the following (partial) output

Continued You might notice that grep matches lines that contain “mutt ” thinking this is an IP address when it is actually a version name – Our regex was not specific enough although in reality could be an IP address We also see the entry “Binary file /etc/prelink.cache matches” indicate that there was a match of our pattern to a binary file – We generally want to ignore binary files, we cannot view their contents The output also tells us the file(s) that matched – We can add options that eliminate file names or include the line number(s) that matched

Useful grep Options

More on grep grep only uses the standard regular expression set, which does not include some of the metacharacters like { } and ( ) To use the full set of metacharacters, you must use the extended version of grep, either: – egrep – grep –E Also, be aware that if you try the IP address search on /etc as a normal user, you will be given some permission denied errors since you do not have read access to all of /etc

Piping to grep/egrep Imagine that you want to find all files whose permissions start with rwx, you could not do – ls ^rwx – because ls applies wildcards, not regex But you could do this – ls –l | grep ^rwx Similarly, you could pipe ps aux to grep – ps aux | grep foxr – to find all processes owned by foxr – ps aux | 0:00 – find all processes that have used no CPU time