7.1 Last time on: Pattern Matching. 7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.

Slides:



Advertisements
Similar presentations
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
Advertisements

Regular Expressions Software Tools. Slide 2 What is a Regular Expression? A regular expression is a pattern to be matched against a string. For example,
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
6.1 Pattern Matching. 6.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man”
Regular Expression (1) Learning Objectives: 1. To understand the concept of regular expression 2. To learn commonly used operations involving regular expression.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
6ex.1 Pattern Matching. 6ex.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man”
COS 381 Day 19. Agenda  Assignment 5 Posted Due April 7  Exam 3 which was originally scheduled for Apr 4 is going to on April 13 XML & Perl (Chap 8-10)
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
Regular Expression Learning Objectives:
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Regular Expressions.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
7.1 Some Eclipse Tips Try Ctrl+Shift+L Quick help (keyboard shortcuts) Try Ctrl+SPACE Auto-complete Source→Format ( Ctrl+Shift+F ) Correct indentation.
13.1 Wrapping up Running Other Programs 13.3 You may run programs using the system function: $exitValue = system("blastall.exe..."); if ($exitValue!=0)
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
6.1 Short foreach revision. 6.2 $arr[2]$arr[1]$arr[3]$arr[4] Loops: foreach The foreach loop passes through all the elements of an array = (2,3,4,5,6);
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions. What are regular expressions? A means of searching, matching, and replacing substrings within strings. Very powerful (Potentially)
7.1 Last time on: Pattern Matching. 7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
6b.1 Pattern Matching. 6b.2 We often want to find a certain piece of information within the file, for example: Pattern matching 1.Find all names that.
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
Last Updated March 2006 Slide 1 Regular Expressions.
Tutorial 14 Working with Forms and Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Input Validation with Regular Expressions COEN 351.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
January 23, 2007Spring Unix Lecture 2 Special Characters for Searches & Substitutions Shell Scripts Hana Filip.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
By: Andrew Cory. Grouping Things & Hierarchical Matching Grouping characters – ( and ) Allows parts of a regular expression to be treated as a single.
CPTG286K Programming - Perl Chapter 7: Regular Expressions.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming regular expressions.
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
– Introduction to Perl 12/12/ Introduction to Perl - Searching and Replacing Text Introduction to Perl Session 7 ·
1 Perl, Beyond the Basics: Regular Expressions, Subroutines, and Objects in Perl CSCI 431 Programming Languages Fall 2003.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
2.1 Scalar data - revision numeric e-14 ( = 6.35 × )‏ operators: + (addition) - (subtraction) * (multiplication) / (division)
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
1 Perl & R Programming workshop Rappaport building, Medicine April 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Lecture 9 Shell Programming – Command substitution
CSCI 431 Programming Languages Fall 2003
CSE 1020:Software Development
CIT 383: Administrative Scripting
String Processing 1 MIS 3406 Department of MIS Fox School of Business
Regular Expression: Pattern Matching
Presentation transcript:

7.1 Last time on: Pattern Matching

7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will be true for “ hello ” and for “ the cat ” but not for “ good bye ” or “ Hercules ”. You can ignore case of letters by adding an “ i ” after the pattern: m/he/i (matches for “ hello ”, “ Hello ” and “ hEHD ”) There is a negative form of the match operator: if ($line !~ m/he/)... Pattern matching

7.3 Replacing a sub string (substitute): $line = "the cat on the tree"; $line =~ s/he/hat/; $line will be turned to “ that cat on the tree ” To Replace all occurrences of a sub string add a “ g ” (for “globally”): $line = "the cat on the tree"; $line =~ s/he/hat/g; $line will be turned to “ that cat on that tree ” Pattern matching

7.4 m/./ Matches any character except “\n” You can also ask for one of a group of characters: m/[abc]/ Matches “a” or “b” or “c” m/[a-z]/ Matches any lower case letter m/[a-zA-Z]/ Matches any letter m/[a-zA-Z0-9]/ Matches any letter or digit m/[a-zA-Z0-9_]/ Matches any letter or digit or an underscore m/[^abc]/ Matches any character except “a” or “b” or “c” m/[^0-9]/ Matches any character except a digit Single-character patterns

7.5 Perl provides predefined character classes: \d a digit (same as: [0-9] ) \w a “word” character (same as: [a-zA-Z0-9_] ) \s a space character (same as: [ \t\n\r\f] ) To force the pattern to be at the beginning of the string add a “^”: m/^>/ Matches only strings that begin with a “ > ” “$” forces the end of string: m/\.pl$/ Matches only strings that end with a “.pl ” And together: m/^\s*$/ Matches all lines that do not contain any non-space characters Single-character patterns And their negatives: \D anything but a digit \W anything but a word char \S anything but a space char

7.6 Generally – use {} for a certain number of repetitions, or a range: m/ab{3}c/ Matches “ abbbc ” m/ab{3,6}c/ Matches “ a ”, 3-6 times “ b ” and then “ c ” ? means zero or one repetitions: m/ab?c/ Matches “ ac ” or “ abc ” + means one or more repetitions: m/ab+c/ Matches “ abc ” ; “ abbbbc ” but not “ ac ” A pattern followed by * means zero or more repetitions of that patern: m/ab*c/ Matches “ abc ” ; “ ac ” ; “ abbbbc ” Use parentheses to mark more than one character for repetition: m/h(el)*lo/ Matches “ hello ” ; “ hlo ” ; “ helelello ” Repetitive patterns

7.7 Let's take a look at the adeno12.gb GenBank record….adeno12.gb Matches annotation of a coding sequence in a Genbank DNA/RNA record: CDS m/^\s*CDS\s+\d+\.\.\d+/ Allows also a CDS on the minus strand of the DNA: CDS complement( ) m/^\s*CDS\s+(complement\()?\d+\.\.\d+\)?/ You favorite GenBank examples Note: We could just use m/^\s*CDS/ - it is a question of the strictness of the format. Sometimes we want to make sure.

7.8 We can extract parts of the string that matched parts of the pattern that are marked by parentheses: my $line = " CDS "; if ($line =~ m/CDS\s+(\d+)\.\.(\d+)/ ) { print "regexp:$1,$2\n";regexp:87,1109 my $start = $1; my $end = $2; } Extracting part of a pattern

7.9 More RegEx Coach Use the i and g tick box as m//i and m//g The 1, buttons, to see what is expected to enter $1, $2.. $10 In selection mode you can see the match to your selection

7.10 This week on: More Pattern Matching

7.11 We could enforce a word boundary, similar to enforcing line start/end with ^ and $ : m/\bJovi/ will match “ Jovi ” and “ bon Jovi ” but not “ bonJovi ” m/fred\b/ will match “ fred ”, “ fred. ” and “ milfred ” but not “ fredrick ” \B is the reverse – m/fred\B/ will match “ fredrick ” but not “ fred ” Enforce word start/end

7.12 If a pattern can match a string in several ways, it will take the maximal substring: $line = "fred xxxxxxxxxx john"; $line =~ will become “ john ” and not “ john ” You can make a minimal pattern by adding a ? to any of * / + / ? / {} : $line = "fred xxxxxxxxxx john"; $line =~ Only one x will be replaced: “ john ” Patterns are greedy

7.13 If a pattern can match a string in several ways, it will take the maximal substring: $line = " JOURNAL J. Virol. 68 (1), (1994)"; $line =~ m/^\s*JOURNAL.*\((\d+)\)/; $1 is "1994"; Using the minimal pattern by adding a ? : $line = " JOURNAL J. Virol. 68 (1), (1994)"; $line =~ m/^\s*JOURNAL.*?\((\d+)\)/; $1 is "1"; Patterns are greedy

7.14 If one of several patterns may be acceptable in a pattern, we can write: m/CDS\s(\d+\.\.\d+|\d+-\d+|\d+,\d+)/ Multiple choice (or) Note: similar to m/CDS\s\d+(\.\.|-|,)\d+/ will match “ CDS ”, “ CDS ” and “ CDS 231,345 ” Note: here $1 will be “ ”, “ ” or “ 231,345 ”, respectively

7.15 Variables can be interpolated into regular expressions, as in double-qouted strings: $name = "Yossi"; $line =~ m/^$name\d+/ This pattern will match: "Yossi25", "Yossi45"  Special patterns can also be given in a variable: If $name was "Yos+i" then the pattern could match: "Yosi5" and "Yossssi5" Variables in patterns

7.16 Say we need to search some blast output: ref|NT_ |Mm15_39661_34 Mus musculus chromosome 15 genomic e-45 ref|NT_ |Mm6_39393_34 Mus musculus chromosome 6 genomic c ref|NT_ |Mm9_39517_34 Mus musculus chromosome 9 genomic c ref|NT_ |Mm8_39502_34 Mus musculus chromosome 8 genomic c for the score of a hit that is named by the user. We can write: m/^ref|$hitName.*(\d+)\s+\S+\s*$/ If $hitName was " NT_039353", we get $1 = 38 Variables in patterns

7.17 The split function actually treats its first parameter as a regular expression: $line = "13 5;3 -23 = split(/\s+/, $line); print 13 5; split (revisited)

7.18 More RegEx Coach Choose the split window in the Regex Coach to see how the string will be spitted Split is marked by |

7.19 All the matches from $1, $2,.. can be saved in an array: my $line = " , "; = ($line =~ is “ (" ") = ($line =~ is “ ("4815", "5781") my ($start, $end) = ($line =~ m/(\d+)-(\d+)/); $start is 4815 $end is 5781 Assignment of matching into an array

7.20 All the matches from $1, $2,.. can be saved in an array: my $line = , ; = ($line =~ is “ (" ", " ") = ($line =~ is “ ("4815", "5781", "5825", "6153") This can be very useful for finding repetitive pattern in a sequence. Global matching for repetitive patterns Global matching: all instances in lines will be matched

7.21 The extracted parts of the pattern can be used inside a substitution: $line = " CDS "; $line =~ s/(\d+)\.\.(\d+)/$1-$2/; CDS $line = "I'm John Lennon"; $line =~ s/([A-Z][a-z]+)\s+([A-Z][a-z]+)/$1_$2/; I'm John_Lennon Using memories in substitution

7.22 The pattern extracted can be use in substitution $line = " CDS "; $line =~ s/(\d+)\.\.(\d+)/$2..$1/; $line is : " CDS " $line = " CDS join( , )"; $line =~ s/(\d+)\.\.(\d+)/$2..$1/g; $line is : " CDS join( , )" Using memories in substitution

7.23 The extracted parts can also be used inside the same match: m/(\d+)-(\d+),\2-\d+/ will match“ , ” but not “ , ” m/(.)\1+/ will match any character that is repeated at least twice $line = "kasjfjjjjsja"; if ($line =~ m/((.)\2+)/) { print "regexp: $1, $2\n"; } regexp: jjjj, j Using memories in matching only \2 (not $2 ) will get the current extracted pattern. ( $2 refers to the previous matching)

7.24 Perl saves the positions of matches in the special The variables $-[0] and $+[0] are the start and end of the entire match The rest hold the starts and ends of the memories (brackets): $line = " CDS "; $line =~ m/CDS\s+(\d+)\.\.(\d+)/; print " \n \n"; starts: ends: Position of match

7.25 A special type of substitution allows to “Transliterate” (i.e. replace) a set of characters to different set: $seq = "AGCATCGA"; $seq =~ tr/ATGC/TACG/; $seq is now "TCGTAGCT" (What is the next step in order to get the reverse complement of the sequence?) NOTE: each single character in “from” is replaced by its corresponding character in “to” Guess what this will do: $lines =~ tr/A-Z/a-z/); (Change all letters to small ones) Transliterate from to

7.26 You can get the number of changes as a return value of tr/// : $seq = "AGCATCAG"; $count = ($seq =~ tr/GC/CG/); $count is 4 $seq is "ACGATGAC"; $count = ($sky =~ tr/*/*/); Count the stars in $sky Transliterate

7.27 Class exercise 7a 1.Get from the user a DNA sequence and change every A and G to U (pUrines) and every C and T to Y (pYrimidines). 2.Like question 1, but in addition print the number of pyrimidines ( C s and T s) Continuing with the GenBank record of the adenovirus genome: 3*.Get the year of publication from the user (using ), find in the adenovirus record papers published in that year and print the JOURNAL line. For example if the user types " 1994 " print: " J. Virol. 68 (1), (1994) " but not: " J. Virol. 67 (2), (1993) "