Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)

Slides:



Advertisements
Similar presentations
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
Advertisements

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Regular Expressions. u A regular expression is a pattern which matches some regular (predictable) text. u Regular expressions are used in many Unix utilities.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
UNIX Filters.
Filters using Regular Expressions grep: Searching a Pattern.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Advanced File Processing
System Programming Regular Expressions Regular Expressions
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
Review Please hand in your practicals and homework Regular Expressions with grep.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 5 – Regular Expressions, grep, Other Utilities.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
PROGRAMMING THE BASH SHELL PART III by İlker Korkmaz and Kaya Oğuz
Lesson 5-Exploring Utilities
CSE 374 Programming Concepts & Tools
Regular Expressions Upsorn Praphamontripong CS 1110
Regular expressions, egrep, and sed
CSC 352– Unix Programming, Spring 2016
Regular expressions, egrep, and sed
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
Regular Expression - Intro
Regular expressions, egrep, and sed
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
Grep Allows you to filter text based upon several different regular expression variants Basic Extended Perl.
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Regular Expression Beihang Open Source Club.
CSC 594 Topics in AI – Natural Language Processing
Folks Carelli, Instructor Kutztown University
Guide To UNIX Using Linux Third Edition
Unix Talk #2 grep/egrep/fgrep (maybe add more to this one….)
Unix Talk #2 (sed).
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
CSE 303 Concepts and Tools for Software Development
Regular Expressions and Grep
CIT 383: Administrative Scripting
Regular expressions, egrep, and sed
CSCI The UNIX System Regular Expressions
Regular expressions, egrep, and sed
1.5 Regular Expressions (REs)
Regular expressions, egrep, and sed
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Presentation transcript:

Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)

Regular Expressions Regular expressions (regexes) allow you to describe and parse patterns in text They are extremely useful as implemented in programming languages and other tools, including editors Examples of such tools are grep, find, sed, awk, Perl, Python, Vim, and Emacs

What They Can Do They can help you search for complex text patterns in one or many files This Emacs Lisp regex finds duplicate words \\b\\([^\n\t]+\\)[ \n\t]+\\1\\b They can alter text on the fly In Vim, this uppercases the first word in a sentence, if it was lowercase s/\([.!?]\)\(\s\+\)\([a-z]\)/\1\2\u\2/g

Terminology I Metacharacters vs. Literals Metacharacters have special meaning, while literals just represent themselves Examples of metacharacters include ^, $, ., *, +, ? Quantifiers (how much of something) *, +, ?, {1,3} Character class (matching any one of several) [a-zA-Z] or [^!.?] Alternation (a|b) (read the vertical bar as “or”)

Terminology II Anchors Match a specific, fixed place in the text ^ (beginning of line) $ (end of line) Escape A backslash '\' can be used to remove the special meaning from a metacharacter, or add meaning to a literal Examples \s, \\, \*

Basics The simplest regexes just specify literal text egrep 'hack' resume This just finds and prints all the lines in the file resume that contain the text hack Note that this will print lines containing hacker, hacking and shack The metacharacter . matches any single character, except a newline egrep 'h.ck' resume will match both hick and shack, among others

Quantifiers and Grouping I We can specify the number of times a character or group of characters must match by using the quantifiers * zero or more + one or more ? zero or one {m,n}, {m,} or {m} m to n inclusive, at least m, or exactly m, respectively egrep 'hack*' resume would print lines containing hac, hack, and hackk (any number of k's, including none)

Quantifiers and Grouping II We can constrain the quantifier to a group of characters by using parentheses egrep 'ha(ck)?' resume prints lines containing ha followed by zero or one occurrences of ck So ha and hack are the only two valid matches here egrep 'h(ack)+' resume would match an h, followed by one or more ocurrences of ack (hack, hackack, etc)

Anchors Anchors match a specific point in the pattern, but don't consume a character Two of the most commonly used anchors are ^ and $, for start and end of line, respectively There are also /< and />, for start and end of word egrep '^hack' resume now matches hack, but only at the start of a line egrep 'hack$' resume matches hack, but only at the end of a line egrep '^hack$' resume matches lines with only the word hack in them

Character Classes I You can specify that one of several characters be matched by placing them in brackets [?!.] matches any one of ?, !, or . Note the metacharacters ? and . have lost their special meaning inside the brackets [^?!.] matches anything but ?, !, or . In this case, the ^ has a different meaning, logical not It is just a literal anywhere else but at the front of a character class

Character Classes II [a-z] The dash specifies a range of characters, so this matches a through z, lowercase [-!?.] Put the dash in front if you don't want it to mean “range”, and be just another literal You can quantify character classes, just like groupings [a-z]* matches zero or more lowercase letters

More Special Characters \w matches a word character (alphanumeric and underscore) \s matches a whitespace character \d matches a digit character \b matches a word boundary These are all complemented by \W, \S, \D, and \B Some tools (like grep, sed, awk, and Emacs) use \< and \> anchors for the start and end of a word, respectively

Greediness One thing to be aware of – the quantifiers *, +, ? and {} will eat as much text as possible during a match They are called “greedy” for this reason Given the string “Just Another Perl Hacker”, the pattern /^J*e/ matches Just Another Perl Hacke We can make these quantifiers non-greedy in some implementations by adding a ? So the pattern /^J*?e/ now matches Just Anothe

Remembering Matches I Enclosing portions of the pattern in parentheses will force the regex engine to “remember” the text actually matched, and store it for later use “Later use” can mean later in the same pattern, or after the match is complete You can have more than one parenthesized expression, they are stored in order Later in the same pattern, use \1, \2, etc. After the regex has completed, use $1, $2, etc. (Perl)

Remembering Matches II s/(\d{3})-(\d{3})-(\d{4})/\2-\1-\3/ will swap phone number area code and exchange (first and second expressions) egrep -i '\<([a-z]+) +\1\>' resume This will find all the doubled words in your resume But it does this one line at a time, and so can't find doubled words that cross line boundaries More sophisticated regex engines, like Perl's, can match across lines

Examples Those are the basics of regular expressions Let's see some real-world examples

Perl Example Look through a directory of report files and extract the report name. Assume that the filenames are of the form "author-title-date.pdf". find . -name "*.pdf" | perl -pe 's/.\/\w+- (\w+)-.*/$1/' | sort | uniq find passes it's results to perl with a leading / the -p argument runs through all the arguments supplied as files and prints the result ($_) we can use $1 instead of \1 here because the initial pattern of the s// has already been compiled when the $1 is seen sort sorts the filenames alphabetically uniq removes duplicate lines

Yet Another Perl Example In-place edit of shell scripts, changing all occurrences of doug@unixlore.net to dmaxwell@acm.org, making backup files as you go perl -p -i.bak -e 's!\bdoug@unixlore\.net\b!dmaxwell@a cm.org!g' *.sh We can use almost anything as a substitution delimiter, in place of / Here, we use ! Note that we escape the dots so they don't match any character, just a dot

Sed sed 's/^[ \t]*//' Delete whitespace from the front of each line Use it like this cat file | sed 's/^[ \t]*//' > altered_file OR sed 's/^[ \t]*//' < file > altered_file Sed is a filter, and so by default will accept input on standard input, and output on standard output It won't alter the input file in-place by default This will cat file | sed 's/^[ \t]*//' > file

Awk Awk is also a stdin-to-stdout filter, like sed Awk deals well with columnar data awk '/foo/{print $1,$3}' file Prints the first and third fields of all lines in file that match the regex /foo/ awk '$2~/foo/{print $1,$3}' file Prints the first and third fields of all lines in file whose second field matches /foo/

Grep Grep finds patterns in the lines of files passed as arguments egrep is just grep -E, and handles “extended” regexes egrep 'CRIT.+FW:' /var/log/messages Prints all lines in /var/log/messages that are critical firewall entries egrep -v -i 'crit.+fw:' /var/log/messages Prints all lines in /var/log/messages that do not contain critical firewall entries Case is ignored here with -i

Copyright & License Copyright (c) 2005-2007 Doug Maxwell (http://www.unixlore.net). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is at http://www.gnu.org/copyleft/fdl.html.