Scripting Languages Course 5 Diana Trandab ă ț Master in Computational Linguistics - 1 st year 2015-2016.

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

Perl & Regular Expressions (RegEx)
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
PERL Part 3 1.Subroutines 2.Pattern matching and regular expressions.
Regular Expressions.
7.1 Some Eclipse Tips Try Ctrl+Shift+L Quick help (keyboard shortcuts) Try Ctrl+SPACE Auto-complete Source→Format ( Ctrl+Shift+F ) Correct indentation.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular Expressions. What are regular expressions? A means of searching, matching, and replacing substrings within strings. Very powerful (Potentially)
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Last Updated March 2006 Slide 1 Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
PHP Workshop ‹#› Data Manipulation & Regex. PHP Workshop ‹#› What..? Often in PHP we have to get data from files, or maybe through forms from a user.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Input Validation with Regular Expressions COEN 351.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Matching in list context (Chapter 11 = ($str =~ /pattern/); This stores the list of the special ($1, $2,…) capturing variables into the.
Scripting Languages Diana Trandab ă ț Master in Computational Linguistics - 1 st year
CPTG286K Programming - Perl Chapter 7: Regular Expressions.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Introduction to Perl Pawel Sirotkin , Riga.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
CS346 Regular Expressions1 Pattern Matching Regular Expression.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Copyright Doug Maxwell (
Scripting Languages Perl – course 3
RE Tutorial.
Regular Expressions.
Looking for Patterns - Finding them with Regular Expressions
Lecture 19 Strings and Regular Expressions
Regular Expressions in Perl
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
Intro to PHP & Variables
Sentinel logic, flags, break Taken from notes by Dr. Neil Moore
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Sentinel logic, flags, break Taken from notes by Dr. Neil Moore
Number and String Operations
Advanced Find and Replace with Regular Expressions
PHP.
CSCI 431 Programming Languages Fall 2003
CSE 1020:Software Development
Building Java Programs
Data Manipulation & Regex
Chapter 3: Selection Structures: Making Decisions
Boolean Expressions to Make Comparisons
Chapter 3: Selection Structures: Making Decisions
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Presentation transcript:

Scripting Languages Course 5 Diana Trandab ă ț Master in Computational Linguistics - 1 st year

2 What is a regular expression? A regular expression (regex) is simply a way of describing text. Regular expressions are built up of small units which can represent the type and number of characters in the text Regular expressions can be very broad (describing everything), or very narrow (describing only one pattern). Very useful for text processing!!!

Perl regular expressions 3  Perl matching character: =~  Perl non-matching character: !~  The regular expression must be in backslashes: /regex/  The program below accepts any password that contains the characters „42“ anywhere # A "while" loop with regular expressions while ($password !~ /42/) {# While the entered line doesn’t contain 42 print "Access denied.\n"; print "Please enter password: "; $password = <>; chomp($password); } print "Correct password! Welcome.";

Perl regular expressions 4 if you want to match a “/”, you need to escape it \/ (backslash, forward slash) you can change the delimiter to some other character, but you probably won’t need to m|pattern| any ‘modifiers’ to the pattern go after the last / i : case insensitive /[a-z]/i o : compile once g : match in list context (global) m or s : match over multiple lines

Perl regular expressions 5  Simple string: some text  One of a number of symbols: [aA]  Matches a or A  Also possible: [tT]he, matching the or The  One of a continuous string of symbols: [a-h][1-8]  Matches any two-character string from a1 to h8  Special characters  ^ matches the beginning of a line  $ matches the end of a line

Exercise Change the password program to allow only passwords which contain uppercase letters, e.g. password # rejected passWORD # accepted

Quantifiers 7  More special characters  Wildcard: the dot. Matches any single character  b.d matches bad, bed, bid, bud…  Don‘t forget: it also matches forbid, badly…  + matches one or more of the previous character  re+d matches red and reed (and also reeed and so on!)  * matches zero or more occurrences of the previous character  bel* matches be, bel and bell (and belll…)  ? matches zero or one occurrences of the previous character  soo?n Matches son or soon

Character classes 8  \d: digits  Rule \d+ matches Rule 1, Rule 2,..., Rule  \w: “word characters” – letters, digits, _  \w \w – any two “words” separated by a blank  \s: any whitespace (blanks, tabs)  ^\s+\d – any line where the first character is a digit  Capitalize the symbols to get the opposite  \S is anything but whitespace, \D are non-digits…

9 Anchors Anchors force a pattern match to a certain location ^ : start matching at beginning of string $ : start matching at end of string \b : match at word boundary (between \w and \W) Example: /^\d\d\d\d-\d\d\d-\d\d\d$/ : matches only complete phone numbers of the form or The same as /^\d{4}-\d{3}-\d{3}$/

10 Example if your pattern is /\d\d\d-\d\d\d\d/ – You could match – But not: j-5555

11 Grouping You can group atoms together with parentheses /cat+/ matches cat, catt, cattt /(cat)+/ matches cat, catcat, catcatcat Use as many sets of parentheses as you need

12 Alternation You can specify patterns which match either one thing or another. – /cat|dog/ matches either ‘cat’ or ‘dog’ – /ca(t|d)og/ matches either ‘catog’ or ‘cadog’

13 Precedence Just like with mathematical operations, regular expressions have an order of precedence – Highest : Parentheses and grouping – Next : Repetition (+,*, {4}) – Next : Sequence (/abc/) – Lowest : Alternation ( | )

Exercise Write a program which asks the user for his address. Check if the address is syntactically correct. Possible rules: – Must contain character – At least one symbol before it – Must contain a dot – At least two symbols and. – At least two symbols after. – No fancy symbols like {§* Do you accept addresses with more than one dot?

15 Remembering Stuff Being able to match patterns is good, but limited. We want to be able to keep portions of the regular expression for later. – Example: $string = ‘phone: ’ We want to keep the phone number only Just figuring out that the string contains a phone number is insufficient, we need to keep the number as well.

16 Memory Parentheses (pattern memory) Since we almost always want to keep portions of the string we have matched, there is a mechanism built into perl. Anything in parentheses within the regular expression is kept in memory. $string = 'phone: '; if ($string =~ /^phone\:(.+)$/){ print "$1“; } Perl knows we want to keep everything that matches ‘.+’ in the above pattern the : is preceded by an escape character

17 Getting at pattern memory Perl stores the matches in a series of default variables. The first parentheses set goes into $1, second into $2, etc. – This is why we can’t name variables ${digit} – Memory variables are created only in the amounts needed. If you have three sets of parentheses, you have ($1,$2,$3). – Memory variables are created for each matched set of parentheses. If you have one set contained within another set, you get two variables (inner set gets lowest number) – Memory variables are only valid in the current scope

18 An example of pattern memory $string = shift; if ($string =~ /^phone\:(\d{3}-\d{4})$/){ $phone_number = $1; } else { print “Enter a phone number!\n” }

19 Some tricky bits You can assign pattern memory directly to your own variable names: – ($phone) = $value =~ /^phone\:(.+)$/; Read from right to left. Bind (apply) this pattern to the value in $value, and assign the results to the list on the left – ($front,$back) = $value =~ /^phone\:(\d{3})-(\d{4})/;

20 List or scalar context? A pattern match returns 1 or 0 (true or false) in a scalar context, and a list of matches in array context. There are a lot of functions that do different things depending on whether they are used in scalar or list context. $count # returns the number of elements $revString = reverse $string # returns a reversed = # returns a reversed list

21 Practical Example of Context $phone = $string =~ /^.+\:(.+)$/; – $phone contains 1 if pattern matches, 0 if not – scalar context!!! ($phone) = $string =~ /^.+\:(.+)$/; – $phone contains the matched string – list context!!!

22 Finding all instances of a match Use the ‘g’ modifier to the regular expression = $sequence =~ /(TATTA)/g; – think g for global – Returns a list of all the matches (in order), and stores them in the array – If you have more than one pair of parentheses, your array gets values in sets ($1,$2,$3,$1,$2,$3...)

23 Perl is Greedy In addition to taking all your time, perl regular expressions also try to match the largest possible string which fits your pattern – /ga+t/ matches gat, gaat, gaaat – ‘Doh! No doughnuts left!’ =~ /(d.+t)/ $1 contains ‘doughnuts left’ If this is not what you wanted to do, use the ‘?’ modifier – /(d.+?t)/ # match as few characters between d and t as you can

24 Making parenthesis forgetful Sometimes you need parenthesis to make your regex work, but you don’t actually want to keep the results. You can still use parentheses for grouping. /(?:group)/ – yet another instance of character reuse. \d? means 0 or 1 instances \d+? means the fewest non zero number of digits (don’t be greedy) (?:group) means look for the group of atoms in the string, but don’t remember it.

25 Substitute function s/pattern1/pattern2/; Looks kind of like a regular expression – Patterns constructed the same way Inherited from previous languages, so it can be a bit different. – Changes the variable it is bound to!

26 Using s Substituting one word for another – $string =~ s/dogs/cats/; If $string was “I love dogs”, it is now “I love cats” Removing trailing white space – $string =~ s/\s+$//; If $string was ‘ATG ‘, it is now ‘ATG’ Adding 10 to every number in a string – $string =~ s/(\d+)/$1+10/ge; If $string was “I bought 5 dogs at 2 bucks each”, it is now: – “I bought 15 dogs at 12 bucks each” Note pattern memory!! g means global (just like a regex) e is special to s, evaluate the expression on the right

27 tr function translate or transliterate tr/characterlist1/characterlist2/; Even less like a regular expression than s substitutes characters in the first list with characters in the second list $string =~ tr/a/A/; # changes every ‘a’ to an ‘A’ – No need for the g modifier when using tr.

28 Using tr Creating complimentary DNA sequence – $sequence =~ tr/atgc/TACG/; Sneaky Perl trick for the day – tr does two things. 1. changes characters in the bound variable 2. Counts the number of times it does this – Super-fast character counter $a_count = $sequence =~ tr/a/a/; replaces an ‘a’ with an ‘a’ (no net change), and assigns the result (number of substitutions) to $a_count

Exercise 29 Make a text file and fill it with a Wikipedia article – Count the number of definite and indefinite articles – Count the number of numbers and digits (using regular expresions) – Insert a tag before every number

Great! See you next time!