By: Andrew Cory. Grouping Things & Hierarchical Matching Grouping characters – ( and ) Allows parts of a regular expression to be treated as a single.

Slides:



Advertisements
Similar presentations
Perl & Regular Expressions (RegEx)
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Computer Science 1620 Variables and Memory. Review Examples: write a program that calculates and displays the average of the numbers 45, 69, and 106.
5-1 Flow of Control Recitation-01/25/2008  CS 180  Department of Computer Science  Purdue University.
1 Parts of a Loop (reminder) Every loop will always contain three main elements: –Priming: initialize your variables. –Testing: test against some known.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Loops – While, Do, For Repetition Statements Introduction to Arrays
Understanding Arrays and Pointers Object-Oriented Programming Using C++ Second Edition 3.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
JavaScript, Third Edition
Representation and Conversion of Numeric Types 4 We have seen multiple data types that C provides for numbers: int and double 4 What differences are there.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 6 Repetition Statements.
Programming Concepts MIT - AITI. Variables l A variable is a name associated with a piece of data l Variables allow you to store and manipulate data in.
IT253: Computer Organization
Numeric Types, Expressions, and Output ROBERT REAVES.
Finding the needle(s) in the textual haystack
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
By Michael Wolfe. Grouping Things and Hierarchical Matching  In a regexp ab|ac is nice, but it’s not very efficient because it uses “a” twice  Perl.
Week 1 Algorithmization and Programming Languages.
Regular Expressions.
CS 330 Programming Languages 10 / 07 / 2008 Instructor: Michael Eckmann.
Chapter 3: Formatted Input/Output Copyright © 2008 W. W. Norton & Company. All rights reserved. 1 Chapter 3 Formatted Input/Output.
COMP 171: Data Types John Barr. Review - What is Computer Science? Problem Solving  Recognizing Patterns  If you can find a pattern in the way you solve.
Data TypestMyn1 Data Types The type of a variable is not set by the programmer; rather, it is decided at runtime by PHP depending on the context in which.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
INPUT / OUTPUT STATEMENTS
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
3. FORMATTED INPUT/OUTPUT. The printf Function The first argument in a call of printf is a string, which may contain both ordinary characters and conversion.
C++ Basics. Compilation What does compilation do? g++ hello.cpp g++ -o hello.cpp hello.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
COIT29222 Structured Programming 1 COIT29222-Structured Programming Lecture Week 02  Reading: Textbook(4 th Ed.), Chapter 2 Textbook (6 th Ed.), Chapters.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Agenda Perform Quiz #1 (20 minutes) Loops –Introduction / Purpose –while loops Structure / Examples involving a while loop –do/while loops Structure /
Copyright © 2000, Department of Systems and Computer Engineering, Carleton University 1 Introduction An array is a collection of identical boxes.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
PHY 201 (Blum)1 Shift registers and Floating Point Numbers Chapter 11 in Tokheim.
CHAPTER 2 PROBLEM SOLVING USING C++ 1 C++ Programming PEG200/Saidatul Rahah.
Methods Awesomeness!!!. Methods Methods give a name to a section of code Methods give a name to a section of code Methods have a number of important uses.
 Variables can store data of different types, and different data types can do different things.  PHP supports the following data types:  String  Integer.
Michael Kovalchik CS 265, Fall  Parenthesis group parts of expressions together  “/CS265|CS270/” => “/CS(265|270)/”  Groups can be nested  “/Perl|Pearl/”
REPETITION MTS3033 OBJECT ORIENTED PROGRAMMING 1.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Chad’s C++ Tutorial Demo Outline. 1. What is C++? C++ is an object-oriented programming (OOP) language that is viewed by many as the best language for.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Learning Javascript From Mr Saem
Chapter 3: Formatted Input/Output 1 Chapter 3 Formatted Input/Output.
Deterministic Finite Automata Nondeterministic Finite Automata.
OPERATORS IN C CHAPTER 3. Expressions can be built up from literals, variables and operators. The operators define how the variables and literals in the.
Chapter 3 Using Variables, Constants, Formatting Mrs. UlshaferSept
Department of Software & Media Technology
Chapter 4 – C Program Control
Regular Expressions 'RegEx'.
Chapter 2 Basic Computation
Data Types and Expressions
Regular Expressions in Pearl - Part II
Variables ICS2O.
1.5 Regular Expressions (REs)
Class code for pythonroom.com cchsp2cs
Presentation transcript:

By: Andrew Cory

Grouping Things & Hierarchical Matching Grouping characters – ( and ) Allows parts of a regular expression to be treated as a single unit Useful for the creation of multiple words and/or phrases with similar base characters and/or words Ex. /house(cat|keeper)/ =~ /housecat|housekeeper/ Ex. /(a|[bc])d/ =~ ‘ad’, ‘bd’, or ‘cd’ Ex. /(19|20|)\d\d/ =~ matches 19xx, 20xx, or xx

Continued Backtracking: step-by-step process of trying alternatives and seeing if they match, and moving on to the next alternative if it doesn’t Any given regular expression has several paths that result in a different string Backtracking is a trial-and-error method that goes through one character at a time.

Continued Backtracking Example – “abcd” =~ /(af|ab)(ce|c|cd)/; 1 – start with letter “a” 2 – try 1 st alternative 3 – ‘a’ matches, but ‘f’ doesn’t match ‘b’, backtrack to ‘a’ and try 2 nd alternative 4 – ‘a’ and ‘b’ matches the first 2 characters, first group satisfied, next group. 5 – ‘c’ matches, but ‘e’ doesn’t, backtrack to ‘c’, try 2 nd alt. 6 – ‘c’ matches, second group is satisfied, therefore whole expression is satisfied by “abcd” Note – 3 rd alt. in the 2 nd group matches too, but is irrelevant: the string already satisfied the regular expression.

Extracting Matches Parentheses not only group, they also extract and separate parts of strings that match the given condition I.e. if ($time =~ /(\d\d):(\d\d):(\d\d)/) { $hours = $1; $minutes = $2; $seconds = $3; } ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

Continued Nested grouping in a regular expression results in more separation Ex. /(ab(cd|ef)((gi)|j))/; $1 = ab $2 = cd|ef $3 = gi|j $4 = gi Backreferences – related to matching variables $1, $2, etc., but can only be used inside the regular expression Useful for repeating phrases Ex. /(\w\w\w)\1/ =~ ‘booboo’, or ‘murmur’

Continued Positions of string portions that match the conditions are also stored in arrays Ex. $x = “Mmm…donut”; $x =~ /^(Mmm)\.\.\.(donut)/; Foreach $expr (1..$#-) { print “$expr: ‘${$expr}’ at ($-[$expr],$+[$expr])\n” Output: 1: ‘Mmm’ at (0,3) 2: ‘donut’ at (6,11)

Continued Strings that have no groupings but are still searched for are still stored in separate variables $` is the string before the match $& is the string that matched $’ is the string after the match Ex. $x = “I like chips”; $x =~ /like/; $` = “I “ $& = “like” $’ = “ chips”

Matching Repetitions Quantifier characters ?, *, +, and {} are used to match words or syllables of any length without massive amounts of repetition Definitions a? = matches ‘a’ one or zero times a* = matches ‘a’ any number of times a+ = matches ‘a’ one or more times (at least once) a{n,m} = matches at least n times, not more than m times a{n, } = matches at least n or more times a{n} = matches exactly n times

Continued Examples /[a-z]+\s+\d*/ = a lowercase word, some space, and any number of digits (ajc 93, jgro ) /(\w+)\s\1/ = a doubled word of any length with a space inbetween (jon jon, hidalgo hidalgo) /y(es)?/i = ‘y’, ‘Y’, or ‘yes’

Continued Perl will always try to match as much of a given string as possible to a regular expression so long as the regular expression holds true I.e. the ‘?’ operator will be matched to the string with whatever precursor present, if not it stops using it Ex. $x = “the cat in the hat”; $x =~ /^(.*)(at)(.*)$/; $1 = ‘the cat in the h’ $2 = ‘at’ $3 = ‘’

Continued Quantifiers that grab as much of the string as possible are known as ‘maximal match’ or ‘greedy’ quantifiers 4 important regular expression principles Principle 1: any regexp will be matched at the earliest possible position in the string Principle 2: The leftmost alternation that matches in a group will be the one used (a|b|c) Principle 3: Matching quantifiers will match as much of the string as possible while holding true to the regexp Principle 4: The leftmost greedy quantifier has more priority over other existing greedy quantifiers

Continued Examples $x = “The programming republic of Perl”; $x =~ /^(.+)(e|r)(.*)$/ $1 = ‘The programming republic of Pe’ $2 = ‘r’ $3 = ‘l’ $x =~ /.*(m{1,2})(.*)$/ $1 = ‘m’ $2 = ‘ing republic of Perl’

Continued Sometimes returning the minimal piece of a string is essential, thus, ‘minimal match’ or ‘non-greedy’ quantifiers ??, *?, +?, and {}? were created. Definitions a?? = match ‘a’ 0 or 1 times, 0 first, then 1 a*? = match ‘a’ any number of times, as few as possible a+? = match ‘a’ 1 or more times, as few as possible a{m,n}? = match n times, no more than m, as few as pos. a{n, }? = match n times, as few as possible a{n}? = match n times, same thing as a{n}

Continued Examples: same as above, different operators! $x = “The programming republic of Perl”; $x =~ /^(.+?)(e|r)(.*)$/ $1 = ‘Th’ $2 = ‘e’ $3 = ‘ programming republic of Perl’ $x =~ /.*?(m{1,2})(.*)$/ $1 = ‘mm’ $2 = ‘ing republic of Perl’

Continued Note: Principle 3 (matching quantifiers) may be manipulated for non-greedy quantifiers so that the leftmost quantifier matches the least amount of the string as possible

Continued Quantifiers are susceptible to backtracking Ex. $x = “the cat in the hat” $x =~ /^(.*)(at)(.*)$/; $1 = ‘the cat in the h’ $2 = ‘at’ $3 = ‘’ 1 Start with the first letter, ‘t’ 2 The first quantifier starts, matches whole string 3 ‘a’ does not match the end of the string, backtrack once 4 ‘a’ does not match the last letter ‘t’, backtrack once more 5 match ‘a’, then the ‘t’ 6 move on to the 3 rd element. Already at the end of the string, assign it as an empty string

Continued Error alert! Nested indeterminable quantifiers are dangerous things Ex. /(a|b+)*/; In the above example, the first repetitions searches with b+ of whatever length (up to infinite), and then again searches with the * thereafter with whatever length (infinite) If a match is not found early in the process, Perl will attempt to find EVERY possibility before halting (massive amount of memory used)

Building a Regexp Step one: decide what we want to match and what we want to exclude. Ex. A regexp that matches numbers will reject any string, and accept both integers and floating point #’s Step two: break the problem down into smaller parts Smaller parts are easier to work with Ex. Any integer: /[+-]?\d+/ \d+ represents a digit [+-] represents a number’s sign (positive/negative)

Continued Ex. Floating point Has a sign, decimal point, fractional part, and an exponent, i.e. 25.4E-72 /[+-]?(\d+\.\d|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; 1 st part ([+-]?) is the sign of the number 2 nd part (\d+\.\d|\d+\.|\.\d+|\d+) is the several different ways a floating point number can be (2.54, 346.,.395, 500) 3 rd part ([eE][+-]?\d+)? is the exponential part, which is represented by e or E followed by a sign, then a decimal of any size (e-5, E9000)

Continued The //x modifier in Perl allows one to write complex regexps with as much spacing as the programmer wants /^ [+-]? ( \d+\.\d+ |\d+\. |\.\d+ |\d+ ) ([eE][+-]?\d+)? $/x;

Continued The downside to the //x modifier: certain symbols must be typed differently Spacing Since //x ignores spaces as relevant regexp input, spaces must be typed in as ‘\ ‘ or ‘[ ]’ Pound Signs Similar instance as spaces, they are typed out as ‘\#’ or ‘[#]’ using //x

Continued Example – /^ [+-]?\ *#an infinite amount of spaces has been added (#between the sign and the floating point # \d+ (#the coding for the floating point has been re- \.\d*#worked since most of the conditions )?#started similarly. |\.\d+ ) ([eE][+-]?\d+)? $/x;