Chapter 11: Regular Expressions and Matching The match operator has the following form. m/pattern/ A pattern can be an ordinary string or a generalized.

Slides:



Advertisements
Similar presentations
Python: Regular Expressions
Advertisements

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Asp.NET Core Vaidation Controls. Slide 2 ASP.NET Validation Controls (Introduction) The ASP.NET validation controls can be used to validate data on the.
Scalar Variables Start the file with: #! /usr/bin/perl –w No spaces or newlines before the the #! “#!” is sometimes called a “shebang”. It is a signal.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions. What are regular expressions? A means of searching, matching, and replacing substrings within strings. Very powerful (Potentially)
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
JavaScript, Third Edition
Scripting Languages Chapter 8 More About Regular Expressions.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
PHP Using Strings 1. Replacing substrings (replace certain parts of a document template; ex with client’s name etc) mixed str_replace (mixed $needle,
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Matching in list context (Chapter 11 = ($str =~ /pattern/); This stores the list of the special ($1, $2,…) capturing variables into the.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
Data TypestMyn1 Data Types The type of a variable is not set by the programmer; rather, it is decided at runtime by PHP depending on the context in which.
Instructor: Craig Duckett Lecture 08: Thursday, October 22 nd, 2015 Patterns, Order of Evaluation, Concatenation, Substrings, Trim, Position 1 BIT275:
VBScript Session 13.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2010 All Rights Reserved. 1.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fluency with Information Technology Third Edition by Lawrence Snyder Chapter.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Restrictions Objectives of the Lecture : To consider the algebraic Restrict operator; To consider the Restrict operator and its comparators in SQL.
Basic Scripting & Variables Yasar Hussain Malik - NISTE.
Variable Variables A variable variable has as its value the name of another variable without $ prefix E.g., if we have $addr, might have a statement $tmp.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions in Perl
Variables and Primative Types
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
CSC 594 Topics in AI – Natural Language Processing
Pattern Matching in Strings
Folks Carelli, Instructor Kutztown University
Regular Expression: Pattern Matching
REGEX.
Presentation transcript:

Chapter 11: Regular Expressions and Matching The match operator has the following form. m/pattern/ A pattern can be an ordinary string or a generalized string containing metacharacters. The binding operator, =~, is used to "bind" the matching operator onto a string. "yesterday" =~ m/yes/ Here the pattern is an ordinary three character string. The entire expression evaluates to a Boolean value, true (1) in this case since the pattern yes is a substring of "yesterday".

Since matching expressions result in Boolean values, they are usually used in a conditional. $str="yesterday"; if($str =~ m/yes/) { print "The pattern yes was found in $str.\n"; } For demonstration, we will usually only show the matching expression. Example: $str="yesterday"; $str =~ m/ester/ #true $str =~ m/Ester/#false $str =~ m/yet/#false

Some notes: The !~ is the negated form of the match operator. It returns true if the matching action does not find the pattern in the string. We will more often use the matching operator. if($response !~ m/yes/){ print "yes was not found in your response.\n"; } The matching operator can be simplified syntactically. For example, the following two expressions are equivalent. $str =~ m/yes/ $str =~ /yes/

The match operator can be bound not only onto string literals and variables, but also onto expressions that evaluate to strings. $str1="wilde"; $str2="beest"; $str1.$str2 =~ /debe/ #true Example: A server-side "platform sniff" done by matching against the HTTP_USER_AGENT environment variable. This example features the first pattern which is not merely a sequence of characters. The match $info =~ /(Unix|Linux)/ is true of either Unix or Linux is a substring of whatever is stored in the $info variable. See source file os.cgi.

A regular expression is a set of rules which define a generalized string. For simplicity we call regular expressions patterns. The syntax for a pattern is /pattern/. A pattern is like a double quoted string in that variables are interpolated and escape sequences are interpreted. But a pattern is much more powerful than a string and can contain wildcards, character classes, and quantifiers, just to name a few features which make patterns (regular expressions) much more general than ordinary strings.

Metacharacters Characters which have special meaning in patterns are called metacharacters. [ ] ( ) { } | \ + ?. * ^ $ If used literally inside a pattern, their special meaning must be escaped. if($sentence =~ m/\?/){ print "Your sentence seems to be a question.\n"; }

Normal characters These include ordinary ASCII characters which are not metacharacters. Normal characters include, letters, numbers, the underscore, and a few other characters such % & = ; :, which are not reserved metacharacters in patterns. Normal characters need not be escaped when testing for matches. if($sentence =~ m/;/){ print "Your sentence seems to contain an independent clause.\n"; }

Escaped characters Escaping in patterns works just like escaping characters in ordinary strings.. For example, \* stands for one *, and \( stands for one (. The following tests whether $str contains the three character string "(b)". $str =~ /\(b\)/ Example values for $str which would yield true and false values in the above match. true: "(b)", "(a)(b)(c)" false: "(ab)", "( b )"

Escape sequences that stand for one character Some escaped characters stand literally for only one character, like escaped metacharacters. Some stand for one invisible character, such as a whitespace character. Just like with ordinary strings \n stands for one newline character, and \t stands for one tab character. The following tests whether $str contains two consecutive newline characters. $str =~ /\n\n/ true: "a\n\nb", "a\n\n\n\tb" false: "\na\n", "a\n \nb"

Escape sequences that stand for a class of characters These represent only one character in a pattern, but that one character matches any character in the specified group. \dany digit 0 through 9 \Dany character that is not a digit \wany alphanumeric character: letter, digit, underscore (w comes from word) \Wany character that is not alphanumeric (opposite of \w) \sone whitespace character (blank space, tab, or newline) \SOne non-whitespace character (opposite of \s)

The following tests whether $str contains a four character sequence that looks like a year in the 1900s. $str =~ /19\d\d/ true: "1921", " " false: "191a", " " The following tests whether $str contains a non-whitespace character. (i.e. It is not the empty string or merely a sequence of whitespace characters. ) $str =~ /\S/ true: "x", "()" false: "", " ", "\n"

Wildcard A period. stands for any one character, except a newline. The following tests whether $str contains a three character substring that is c and t with anything in between, except a newline. $str =~ /c.t/ true: "cat", "arc&tangent" false: "ct", "cart", "arc\ntangent"

Escape sequences that match locations These characters do not actually represent a character in a pattern. Rather, they represent locations within patterns. \A beginning of string \Z end of string or before a final newline character \z end of string \b word boundary \B not a word boundary (thus location between two \w type characters)

The following tests whether $str begins with T $str =~ /\AT/ true: "Tom", "The beest" false: "tom", "AT&T" The following tests whether $str begins with The. $str =~ /\AThe/ true: "Thelma", "The beest" false: "That", "the beest" The following tests whether $str contains the word cat but not as part of any bigger word. $str =~ /\bcat\b/ true: "cat", "my cat" false: "cats", "concatenate"

Note: When matching locations, the escape sequence does not "use up" a character. That is, an expression such as $str =~ /ing\z/ only tests for the three character string ing at the end of $str.

Character Classes Square brackets [] in a pattern define a class. The whole class matches only one character, and only if the character belongs to the class. The following tests whether $str contains a three-character string beginning with one of r, b, or c, and followed by at. $str =~ /[rbc]at/ true: "rat", "bat", "cat", "concatenate", "battery" false: "mat", "at"

The escape sequences \d, \w, and \s and their opposites can be used inside a class. A dash (-) can be used between two characters to denote a range of characters. For example, the class [\dA-F] stands for one character that is either a numeric digit or one of the upper case letters A-F. It is equivalent to [ ABCDEF ] The following tests whether $str contains a two-digit hexadecimal number as formatted in query string encoding. $str =~ /%[\dA-F][\dA-F]/ true:"%0A", "data=Hi,%0A%0Dmy name is..." false: "%0a", "%3"

Alternatives The | character serves like an or by creating alternatives. The following tests whether $str contains any of the three patterns. $str =~ /cat|dog|ferret/ true: "cat", "dog", "ferret", "my cat" "cats and dogs", "doggedly" false: "hamster", "dodge the cart" The alternatives are tested from left to right. The alternatives themselves can be more complicated patterns.

Grouping and Capturing Parentheses () are used for grouping in patterns. The following tests whether $str contains one of the three alternatives, then a whitespace, then food. $str =~ /(cat|dog|ferret) food/ true: "cat food","dog food","ferret food" "I like cat food and dog food" false: "cats food", "rat food", "dogfood" With several alternatives, it is often desirable to capture which of the alternatives caused the successful match. That is, a mere truth value indicating a match doesn't indicate which match actually occurred.

The special, built-in variables $1, $2, $3, … automatically capture an alternative that provides a successful match. $str = "Do you have ferret food?"; $str =~ /(cat|dog|ferret) food/ Here, $1 is assigned the value "ferret" since that alternative provides the match. The rest are empty. If more than one match is present, only the left-most match is recorded since alternatives are processed from left to right. $str = "Do you have dog food or ferret food?" ; $str =~ /(cat|dog|ferret) food/ Here, "dog" is assigned to $1, but $2 is empty even though there is a second match.

Multiple groups can populate more of the special variables. $str = "Purina cat chow"; $str =~ /(cat|dog|ferret) (food|chow)/ $1 is assigned the value "cat" and $2 is assigned the value "chow". Captured matches are assigned into the special variables starting from the left-most grouping of alternatives. Groups can be collected into a larger group. $str = "Purina cat chow"; $str =~ /((cat|dog|ferret) (food|chow))/ $1 is assigned "cat chow", $2 is assigned "cat", and $3 is assigned "chow". The left-most behavior is still observed.

Note: After a successful match, the special capturing variables are global variables within the program. if ($data =~ /(cat|dog|ferret) (food|chow)/ ) { print "The match $1 $2 was found."; } So if the $data is "Purina cat chow is now", then the print statement would generate: The match cat chow was found. As global variables, they will contain the captured matches throughout the rest of the program or until their values are replaced by data captured in other matches.

Other special variables There is some degree of "capturing" even when grouping is not used. $` (prematch - that part before the match), $& (match - the matched part) $' (postmatch - the part after the match). After this is executed "I like cats and bats." =~ /[rbc]at/ $& contains "cat" $` contains "I like " $' contains "s and bats" In general, the original string is equivalent to the concatenation of the three special variables. $`. $&. $'

Quantifiers + occurrence one or more times (consecutively) ? occurrence zero or one times (consecutively) * occurrence zero or more times (consecutively) {n} occurrence exactly n times (consecutively) {n,} occurrence at least n times (consecutively) {n,m} occurrence at least n and at most m times (consecutively) A quantifier is always put after the character (or class of characters) to be quantified. /x+/ -- matches one or more x's in a row /[aeiou]{3}/ -- matches any three vowels in a row /c.*t/ -- matches a c followed by a t with 0 or more of any character in between

The following tests whether $str contains at least one b character in between an a and c. $str =~ /ab+c/ true: "abc", "abbc", "abbbc", "aabcc" false: "ac", "aBc" The following tests whether $str contains a sequence of exactly 3 b characters in between an a and c. $str =~ /ab{3}c/ true: "abbbc", "aabbbcc" false: "abbc", "abbbbc" The following tests whether $str contains a sequence of at least 2 b characters in between an a and c. $str =~ /ab{2,}c/ true: "abbc", "abbbc", "aabbbbcc" false: "abc", "aBBc"

It gets interesting when quantifiers are mixed with the special character classes. The following tests to see if $str contains an alphanumeric word (chunk of consecutive alphanumeric characters). $str =~ /\w+/ true: "beest", "1234", "R2D2", "x", "##xyz##" false: "####", "", " " The following tests to see if $str contains one or more consecutive digits (i.e. is there an integer inside). $str =~ /\d+/ true: "1", "121 Elm. St.", "R2D2", "##1##", "3.14" false: "a", "####", "", " "

The following tests to see if $str contains a substring that looks like a (possibly negative) integer. That is, does $str contain zero or one – characters, followed by one or more consecutive digits. $str =~ /-?\d+/ true: "2", "-2", "-3.14", "3-21.7" false: "xyx", "x-y", "4-x" The following tests to see if there is at least one whitespace character in $str. $str =~ /\s+/ true: " ", " ", " xyy", "The End" false: "", "xyz", "TheEnd"

The following matches any two digit hexadecimal number. That is, it matches any occurrence of two consecutive characters from the class [ abcdefABCDEF]. /[\da-fA-F]{2}/ The quantified pattern is equivalent to the longer pattern /[\da-fA-F][\da-fA-F]/. For the next example, suppose we have dates that are roughly formatted, but in the general form month_name day_number, year We wish to create a pattern capable of factoring out inconsistent formatting and capture the three date parts. For example, it should handle both dates below. jan 1,2002 MARCH 22, 02

The following tests whether $date contains (a group of one or more letters, lower or upper-case), followed by one or more spaces, followed by (a group of one or more digits), followed by a comma and then zero or more spaces, followed by (a group of one or more digits). $date =~ /([a-zA-Z]+)\s+(\d+),\s*(\d+)/ Since there are three groups, the month is captured into $1, the day into $2, and the year in $3.

Quantifiers are greedy by default That means a quantified pattern will attempt to match as much as possible. ("Matching is greedy.") The following expression tests for a character. " Title " =~ / / The quantifier's greedyness passes up " ", which would otherwise be a match. So the pattern matches the whole string in this case.

To overcome the greedyness (match as little as possible), an extra ? character is placed after the quantifier. For example, to find HTML tags, the pattern would be used. It basically says test for a character is found. The following would only match " ". " Title " =~ / /

Command modifiers The behavior of the matching operator can be altered by using a command modifier, which is placed after the operator. string_expression =~ /pattern/command_modifier Case insensitive matching The command modifier i specifies that the matching should be done in a case insensitive fashion. if($str =~ /be/i) { print "The string contains either be, Be, bE, or BE."; }