CST8177 Regular Expressions. What is a "Regular Expression"? The term “Regular Expression” is used to describe a pattern-matching technique that is used.

Slides:



Advertisements
Similar presentations
CST8177 awk. The awk program is not named after the sea-bird (that's auk), nor is it a cry from a parrot (awwwk!). It's the initials of the authors, Aho,
Advertisements

CST8177 sed The Stream Editor. The original editor for Unix was called ed, short for editor. By today's standards, ed was very primitive. Soon, sed was.
CSCI 330 T HE UNIX S YSTEM Regular Expressions. R EGULAR E XPRESSION A pattern of special characters used to match strings in a search Typically made.
Regular Expressions grep
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
Regular Expressions. u A regular expression is a pattern which matches some regular (predictable) text. u Regular expressions are used in many Unix utilities.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Scripting Languages Chapter 8 More About Regular Expressions.
Unix Files, IO Plumbing and Filters The file system and pathnames Files with more than one link Shell wildcards Characters special to the shell Pipes and.
UNIX Filters.
Filters using Regular Expressions grep: Searching a Pattern.
Shell Script Examples.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
CSC 352– Unix Programming, Spring 2015 April 28 A few final commands.
Regular Expression Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
Review Please hand in your practicals and homework Regular Expressions with grep.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming regular expressions.
Pattern Matching CSCI N321 – System and Network Administration.
Appendix A: Regular Expressions It’s All Greek to Me.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Perl Day 4. Fuzzy Matches We know about eq and ne, but they only match things exactly We know about eq and ne, but they only match things exactly –Sometimes.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya(
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
UNIX Commands RTFM: grep(1), egrep(1) & fgrep(1) Gilbert Detillieux April 13, 2010 MUUG Meeting.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 6 – sed, command-line tools wrapup.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CSC 352– Unix Programming, Fall 2011 November 8, 2011, Week 11, a useful subset of regular expressions, grep and sed, parts of Chapter 11.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
PROGRAMMING THE BASH SHELL PART III by İlker Korkmaz and Kaya Oğuz
Regular Expressions Copyright Doug Maxwell (
CSE 374 Programming Concepts & Tools
CSC 352– Unix Programming, Spring 2016
Looking for Patterns - Finding them with Regular Expressions
CST8177 sed The Stream Editor.
BASIC AND EXTENDED REGULAR EXPRESSIONS
Lecture 9 Shell Programming – Command substitution
CSC 352– Unix Programming, Fall 2012
Folks Carelli, Instructor Kutztown University
Unix Talk #2 grep/egrep/fgrep (maybe add more to this one….)
Unix Talk #2 (sed).
CSCI The UNIX System Regular Expressions
Presentation transcript:

CST8177 Regular Expressions

What is a "Regular Expression"? The term “Regular Expression” is used to describe a pattern-matching technique that is used into many different environments. A regular expression (commonly called regex, reg exp, or RE, often pronounced rej-exp or rej-ex) can use a simple set of characters with special meanings (called metacharacters) to test for matches quickly and easily.

Regular Expressions (RE) At its most basic, a regex pattern is a sequence of characters that matches the item being compared: Pattern:flower Match:flower And nothing else! A key thing to remember is that a Regular Expression will try to match the first and the longest string of characters that match the pattern. This will sometimes give you a surprising result, one you didn’t expect!

Once Again! A Regular Expression will try to match the first and the longest string of characters that match the pattern. Sometimes this will surprise you.

You may see Regular Expressions with forward slashes around them: /flower/ These slashes are a form of quoting, are not part of the Regular Expression, and may be omitted in most cases (some commands may require them, though, or something similar). Do not confuse Regular Expressions with filespec globbing. Even though some of the forms appear similar, they are not the same thing at all: /a*/ matches any number of a's (even 0) ls a* matches all files in the PWD that begin with a single a (at least 1) And watch out for Extended Regular Expressions, or subsets for special purposes, PCRE, different languages, and other confusions.

Using Regular Expressions Q:So what if we want to match either ‘flower’ or ‘Flower’? A:One way is to provide a simple choice: Pattern:[Ff]lower Match:flower or Flower Unfortunately for the confusion factor, this closely resembles [ ] in filespecs. Remember that it's different, however.

Using Regular Expressions Q:So the [square brackets] indicate that either character may be found in that position? A:Even better, any single character listed in [ ] will match, or any sequence in a valid ASCII range like 0-9 or A-Z. Just like file globs, unfortunately.

Using Regular Expressions Q:Does that mean I can match (say) product codes? A:You bet. Suppose a valid product code is 2 letters, 3 numbers, and another letter, all upper-case, something like this (C for a character, 9 for a number): CC999C

Using Regular Expressions Pattern segments: [A-Z][A-Z] Two Letters (Uppercase) [0-9][0-9][0-9] Three Numbers [A-Z] One Letter (Uppercase) Giving a Pattern: [A-Z][A-Z][0-9][0-9][0-9][A-Z] Good match: BX120R Bad match: BX1204

Using Regular Expressions Q:Good grief! Is there an easier way? A:There are usually several ways to write a Regular Expression, all of them correct. Of course, some ways will be more correct than others (see also Animal Farm by George Orwell) It's always possible to write them incorrectly in even more ways!

Matching Regular Expressions The program used for Regular Expression searches is often some form of grep: grep itself, egrep, even fgrep (fixed strings) and rgrep (recursive), which are also grep options, etc. The general form is: grep [options] regex [filename list] You will also see regexes in sed, awk, vi, and less among other places

General Linux Commands This is indeed the general form for all Linux commands, although there are (as always) some exceptions: command [flags & keywords] [filename list] That is, the command name (which might include a path) is followed by an optional (usually) set of command modifiers (usually) in any order or combination, and ends with an optional (often) list of filenames (or filespecs) In a pipe chain (cmd1 files | cmd2 | cmd3) of filters, the filename list is commonly found only on the first command.

Matching Regular Expressions grep is a filter, and can easily be used with stdin: echo | grep [options] Some useful options (there are more) include: -ccount occurrences only -Eextended patterns (egrep) -iignore case distinctions -ninclude line numbers -qquiet; no output to stdout or stderr -rrecursive; search all subfolders -vselect only lines not matching -w match "words" only (be careful: what constitutes a word?)

Regular Expression Examples Count the number of "robert"s or "Robert"s (or any case combination) in the password file: grep –ic robert /etc/passwd List all lines with "Robert" in all files with "name" as part of the file name, showing the line numbers of the matches in front of each matching line: grep -n "Robert" *name*

Metacharacters.Any single character except newline […]Any character in the list [^…]Any character not in the list *Zero or more of the preceding item ^Start of the string or line $End of the string or line \<Start of word boundary \>End of word boundary \(…\)Form a group of items for tags \nTag number n

\{n,m\}n to m of preceding item (plus others) \The following character is unchanged, or escaped. Note its use in [a\-z], changing it from a to z into a, -, or z. Metacharacters Note that repeating items can also use the forms \{n\}for exactly n items; and \{n,\}for at least n items. Ranges must be in ascending collating sequence. That is [a- z] and [0-9] are valid but [9-0] is not. Note that you may have to set the correct locale. We will use LOCALE=C for our collating sequence.

Extended metacharacters (used, for example, with egrep) +One or more of the preceding item ?None or one of the preceding item |Separates a list of choices (logical OR) (…)Form a group of items for lists or tags \nTag number n {n,m}Between n and m of preceding item Many of the extended metacharacters also exist in regex- intensive languages like Perl (see PCRE). Be sure to check your environment and tools before using any unusual extended expressions.

Tags sed is the stream editor, handy for mass modifications of a file. Tags are often used in sed to keep some part of the regex being searched for in the result. Imagine you have a file of the part numbers above and you need to replace an extra digit with the letter 'x'. The regex for this kind of bad entry is [A-Z]\{2\}[0-9]\{3\}[0-9] so the full command is sed 's+\([A-Z]\{2\}[0-9]\{3\}\)[0-9]+\1X+' \ data > data.1 The 's' operator for sed means substitute, but there are more available. See man sed for details.

Tags Tags in grep may not be as obvious to use. However, here is an example. echo abc123abc | grep "\(abc\)123\1“ Think of tags as the STR or M+ and RCL keys on your calculator STR/M+ as a regex \(...\) \(...\) RCL as a regex \1 \2 You can have up to 9 "memories" or tags in any one regex.

Tags Tags can be used with grep and its variants, but they are often used with tools like sed: sed 's/\([0-9][0-9]*\)/\1\.0/g' \ raw.grades > float.grades will insert.0 after every string of digits in the raw.grades file. There are, as usual, other ways to do this in sed, including: sed 's/[0-9][0-9]*/&\.0/g' \ raw.grades > float.grades I recommend you use the first style for now.

Note on sed sed 's/\([0-9][0-9]*\)/\1\.0/g' raw.grades Note that sed's delimiter is the character that immediately follows the command option s; it could be any character that doesn't appear in the rest of the operand, such as sed 'sX\([0-9][0-9]*\)X&\.0Xg' raw.grades or as first used in the example sed 's+\([0-9][0-9]*\)+&\.0xg' raw.grades The g after the last delimiter is for global, to examine all matches in each line; otherwise, only the first match is used.

[:alnum:]a – z, A - Z, and [:alpha:]a - z and A - Z [:cntrl:]control characters (0x00 - 0x1F) [:digit:]0 - 9 [:graph:]Non-blanks (0x21 - 0x7E) [:lower:]a - z [:print:][:graph:] plus [:space:] [:punct:]Punctuation characters [:space:]White space (newline, space, tab) [:upper:]A - Z [:xdigit:]Hex digits: 0 - 9, a - f, and A - F Bracketed Classes These POSIX classes are often enclosed in [ ] again. Check for a char at the end of a line: /[[:print:]]$/

The previous items are part of the extended set, not the basic set. A few more in the extended set that can be useful: \w[0-9a-zA-Z] \W[^0-9a-zA-Z] \bAny word boundary: \

Examples Area code:([0-9]\{3\}) Spaces: * Exchange:[0-9]\{3\} Dash:- Number:[0-9]\{4\} Basic pattern for a phone number: (xxx) xxx-xxxx ^([0-9]\{3\})_*[0-9]\{3\}-[0-9]\{4\}$ (The underscore _ is used to represent a blank)

Another Example Extended pattern for an address: Personal ID:\w+ Host:\w{2,} Dot:\. TLD:[a-zA-Z]{2,4}

One Last Example Extended pattern for a web page URL (regex folded) (\w{2,}\.)+[a-zA-Z]{2,4}(/\w+)*/\w+\.html?$ Prefix: Host:(\w{2,}\.)+ TLD:[a-zA-Z]{2,4} Path:(/\w+)* Filename:/\w+ Dot:\. Extension:html?