Regular Expressions: Theory and Perl Implementation

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
CSE S. Tanimoto Regular Expressions 1 Regular Expressions: Theory and Perl Implementation Outline: 1. Theoretical Definitions and Examples 2. Acceptance.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 Overview Regular expressions Notation Patterns Java support.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Scripting Languages Chapter 8 More About Regular Expressions.
XP Tutorial 14 New Perspectives on HTML, XHTML, and DHTML, Comprehensive 1 Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Last Updated March 2006 Slide 1 Regular Expressions.
Copyright © Cengage Learning. All rights reserved.
Tutorial 14 Working with Forms and Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Grammars CPSC 5135.
CSE 311 Foundations of Computing I Lecture 17 Structural Induction: Regular Expressions, Regular Languages Autumn 2011 CSE 3111.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
SCRIBE SUBMISSION GROUP 8 Date: 7/8/2013 By – IKHAR SUSHRUT MEGHSHYAM 11CS10017 Lexical Analyser Constructing Tokens State-Transition Diagram S-T Diagrams.
CSE 311 Foundations of Computing I Lecture 17 Structural Induction Spring
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
CPS 506 Comparative Programming Languages Syntax Specification.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CSE 341 S. Tanimoto Perl Perl: User Defined Functions sub funny_print { my ($name, $age) print
CS 330 Class 7 Comments on Exam Programming plan for today:
Chapter 8 – Regular Expressions
Theory of Computation Lecture #
Lecture 19 Strings and Regular Expressions
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
In this session, you will learn about:
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Lecture 9 Shell Programming – Command substitution
CSE 311 Foundations of Computing I
Week 14 - Friday CS221.
Review: Compiler Phases:
Compiler Construction
Regular Expressions: Theory and Perl Implementation
Lecture 25: Regular Expressions
1.5 Regular Expressions (REs)
Regular Expressions: Theory and Perl Implementation
Regular Expressions in Java
Regular Expression: Pattern Matching
Lecture 5 Scanning.
REGEX.
Presentation transcript:

Regular Expressions: Theory and Perl Implementation Outline: 1. Theoretical Definitions and Examples 2. Acceptance by Finite Automata 3. Perl’s Syntax 4. Other pattern matching functionality in Perl 5. Program Example CSE 341 -- S. Tanimoto Regular Expressions

Alphabets and Sets of Strings An alphabet  = {a1, a2, ..., an} is a set of characters. A string over  is a sequence of zero or more elements of . Example. If  = {0, 1, 2} then 2201 is a string over . No matter what  is, the empty string  is a string over . A set of strings over  is a set of zero or more strings, each of which is a string over . Example. If  = {0, 1, 2} then {, 111, 121, 0} is a set of strings over . CSE 341 -- S. Tanimoto Regular Expressions

A Recursive Definition for Regular Expressions A regular expression for an alphabet  is a certain kind of pattern that describes a set of strings over . Any character c in  is a regular expression representing {c} If E, E1 and E2 are regular expressions over  then so are E1 E2 -- representing the set concatenation of E1 and E2. E1 | E2 -- representing alternation of E1 and E2. ( E ) -- representing E grouped with parentheses. E+ -- rep. one or more instances of E concatenated. E* -- zero or more instances of E CSE 341 -- S. Tanimoto Regular Expressions

Regular Expression Examples Let  = {a, b}. a = {a} ab = {ab} a | b = {a, b} a+ = {a, aa, aaa, ... } ab* represents the set of strings having a single a followed by zero or more occurrences of b. That is, it’s {a, ab, abb, abbb, ... } a (b | c) = {ab, ac} (a | b) (c | d) = {ac, ad, bc, bd} aa* = a+ = {a, aa, aaa, ... } CSE 341 -- S. Tanimoto Regular Expressions

Extended Regular Expressions Let letters = a | b | c | d Let digits = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Let identifiers = letters ( letters | digits )* Thus we can use a name to represent a set of strings and use that name in a regular expression. CSE 341 -- S. Tanimoto Regular Expressions

CSE 341 -- S. Tanimoto Regular Expressions Finite Automaton b a a start state accepting state corresponding regular expression: ab*a Example: process the string abba Now try abbb Finite number of states, but number of strings is not necessarily finite. CSE 341 -- S. Tanimoto Regular Expressions

Equivalence of Finite Automata and Regular Expressions ab a | b a* a b a b a CSE 341 -- S. Tanimoto Regular Expressions

Regular Expressions in Perl In Perl, regular expressions are used to specify patterns for pattern matching. $sentence = "Winter weather has arrived." if ($sentence =~ /weather/) { print "Never bet on the weather." ; } # $string =~ /Pattern/ The result of this kind of pattern matching is a true or false value. CSE 341 -- S. Tanimoto Regular Expressions

A Perl Regular Expression for Identifier $identifier = "[a-z][a-z0-9]*"; $sentence = "012,cse341 341,ABC]*"; if ($sentence =~ /$identifier/) { print "Seems to be an identifier here." ; } $ident2 = "[a-zA-Z][a-zA-Z0-9]*"; $reservedWord = "begin|end"; CSE 341 -- S. Tanimoto Regular Expressions

CSE 341 -- S. Tanimoto Regular Expressions Specifying Patterns /Pattern/ # Literal text; # true if it occurs anywhere in the string. /^Pattern/ # Must occur at the beginning. "Pattern recognition is alive" =~ /^Pattern/ "The end" =~ /end$/ \s whitespace \S non-whitespace \w a word char. \W a non-word char. [a-zA-Z_0-9] \d a digit \D a non-digit \b word boundary \B not word boundary CSE 341 -- S. Tanimoto Regular Expressions

Specifying Patterns (Cont.) $test = "You have new mail -- 5-21-03"; if ($test =~ /^You\s.+\d+-\d+-\d+/ ) { print "The mail has arrived."; } if ($test =~ m( ^ You \s .+ \d+ - ) { print "The mail has arrived."; } CSE 341 -- S. Tanimoto Regular Expressions

Extracting Information $test = "You have new mail -- 5-21-03"; if ($test =~ /^You\s.+(\d+)-(\d+)-(\d+)/ ) { print "The mail has arrived on "; print "day $2 of month $1 in year $3.\n"; } # Parentheses in the pattern establish # variables $1, $2, $3, etc. to hold # corresponding matched fragments. CSE 341 -- S. Tanimoto Regular Expressions

CSE 341 -- S. Tanimoto Regular Expressions Search and Replace $sntc = "We surfed the waves the whole day." $sntc =~ s/surfed/sailed/; print $sntc; # We sailed the waves the whole day. $sntc =~ s/the//g; # We sailed waves whole day. # g makes the replacement “global”. CSE 341 -- S. Tanimoto Regular Expressions

Interpolation of Variables in Replacements $exclamation = "yeah"; $sntc = "We had fun." $sntc =~ s/w+/$exclamation/g; print $sntc; # yeah yeah yeah. # a pattern can contain a Perl variable. CSE 341 -- S. Tanimoto Regular Expressions

Example of (Crude) Lexical Analysis $ident = "[a-zA-Z][a-zA-Z0-9]*"; $int = "[\-]?[0-9]+"; $op = "[\-\+\*\/\=]|mod"; $exp = "begin x = 5; print sqrt(x); end"; $exp =~ s/$ident/ID/g; $exp =~ s/$int/N/g; $exp =~ s/$op/OP/g; print $exp; ID ID OP N; ID ID(ID); ID CSE 341 -- S. Tanimoto Regular Expressions

Processing Assignment Submissions Using Forms and Files 1. Form file 2. Perl script to process data from form. 3. Perl script to “compile” data into an index page. CSE 341 -- S. Tanimoto Regular Expressions

CSE 341 -- S. Tanimoto Regular Expressions The HTML Form <html><head> <title>Submission for CSE 341 Miniproject Topic Proposals</title> </head><body> <h1>CSE 341 Miniproject Topic Proposal Submission Form</h1> Write a topic-proposal web page, and then fill out this form and submit it by Thursday, February 24 at 5:00 PM. (The web page should follow these <a href="http://www.cs.washington.edu/education/courses/341/00wi/MP-topic-proposal-guidelines.html"> guidelines</a>.) <br><form method=post action="http://cubist.cs.washington.edu/~tanimoto/341-student/process-topic-proposal.pl"> CSE 341 -- S. Tanimoto Regular Expressions

CSE 341 -- S. Tanimoto Regular Expressions The HTML Form (2 of 2) <br>Possible name of project: <input type=text name=projectname value="" size=40> <br>Name of Possible partner (optional): <input type=text name=partner value=""> <br>URL of a web page that describes your proposal: <input type=text name=proposalurl value="" size=40> <br>If you plan to submit another topic proposal because you are very uncertain about whether to stick with this one, check this box: <input type=checkbox name=uncertain value="No"> <br><input type=submit name=submit value="Submit"> </form> </body></html> CSE 341 -- S. Tanimoto Regular Expressions

Perl Script to Process Data From Form #! /usr/bin/perl # Process the miniproject topic proposal form inputs # S. Tanimoto, 20 Feb 2000 use CGI qw/:standard/; use strict; print header; my $projectname = param("projectname"); my $uncertain = param("uncertain"); my $partner = param("partner"); my $proposal_url = param("proposalurl"); my $student_username = $ENV{"REMOTE_USER"}; my $now = localtime(); $projectname =~ s/[^a-zA-Z0-9\-\~]//g; $partner =~ s/[^a-zA-Z0-9\-\~]//g; $proposal_url =~ s/[^a-zA-Z0-9\-\~]//g; CSE 341 -- S. Tanimoto Regular Expressions

Perl Script to Process the Data (2 of 2) my $output_line = "STUDENT_USERNAME=$student_username; " . "PROPOSAL_URL=$proposal_url; " . "PROJECT_NAME=$projectname; " . "PARTNER=$partner; " . "UNCERTAIN=$uncertain; " . "DATE=$now; "; if (! (open(OUT, ">>MP-topic-proposal-data.txt"))) { print("Error: could not open topic file for output."); print("Please notify instructor and/or try again later."); print end_html; exit 0; } print OUT $output_line, "\n"; close OUT; print h1("Your miniproject topic proposal has been received. Thanks!"); CSE 341 -- S. Tanimoto Regular Expressions

Perl Script to “Compile” the Data #!/usr/bin/perl # make-MP-index-of-proposed-topics.pl use strict; use CGI qw/:standard/; open(INFILE, "<MP-topic-proposal-data-sorted.txt") || die("Could not open the file MP-topic-proposal-data-sorted.txt.\n"); print<<"EOT"; <html><head><title>CSE 341 MP Topic Proposal Index</title> </head><body> <h1>CSE 341 MP Topic Proposal Index</h1> EOT print "<table><tr><td>Student username</td><td>Proposal Page</td><td>Partner</td><td>Certainty</td><td>When</td></tr>\n"; my $projectname; my $uncertain; my $partner; my $proposal_url; my $student_username; my $date; CSE 341 -- S. Tanimoto Regular Expressions

Perl Script to “Compile” the Data (2 of 3) while (<INFILE>) { if ( /STUDENT_USERNAME=([^\;]+);\s/){$student_username =$1; } else { $student_username =""; } if ( /PROJECT_NAME=([^\;]+);\s/){$projectname =$1; } else { $projectname =""; } if ( /PROPOSAL_URL=([^\;]+);\s/){$proposal_url =$1; } else { $proposal_url =""; } if ( /PARTNER=([^\;]+);\s/){$partner =$1; } else { $partner =""; } if ( /UNCERTAIN=([^\;]+);\s/){$uncertain =$1; } else { $uncertain =""; } if ( /DATE=([^\;]+);/){$date = $1; } else { $date = ""; } if ($proposal_url =~ /http/ ) {} else { $proposal_url = "http://" . $proposal_url; } if ($uncertain eq "No") { $uncertain = ""; } else { $uncertain = "Uncertain"; } CSE 341 -- S. Tanimoto Regular Expressions

Perl Script to “Compile” the Data (3 of 3) my $link = "<a href=\"$proposal_url\">$projectname</a>"; print "<tr><td>$student_username</td><td>$link</td><td>$partner</td><td>$uncertain</td><td>$date</td></tr>\n"; } print "</table>\n"; print "</body></html>\n"; CSE 341 -- S. Tanimoto Regular Expressions