6ex.1 Pattern Matching. 6ex.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man”

Slides:



Advertisements
Similar presentations
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
Advertisements

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
6.1 Pattern Matching. 6.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man”
Regular Expression (1) Learning Objectives: 1. To understand the concept of regular expression 2. To learn commonly used operations involving regular expression.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
COS 381 Day 19. Agenda  Assignment 5 Posted Due April 7  Exam 3 which was originally scheduled for Apr 4 is going to on April 13 XML & Perl (Chap 8-10)
4ex.1 More loops. 4ex.2 Loops Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.\n"; while ($num != 31) {
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
7.1 Last time on: Pattern Matching. 7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
11ex.1 Modules and BioPerl. 11ex.2 sub reverseComplement { my ($seq) $seq =~ tr/ACGT/TGCA/; $seq = reverse $seq; return $seq; } my $revSeq = reverseComplement("GCAGTG");
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
7.1 Some Eclipse Tips Try Ctrl+Shift+L Quick help (keyboard shortcuts) Try Ctrl+SPACE Auto-complete Source→Format ( Ctrl+Shift+F ) Correct indentation.
13.1 Wrapping up Running Other Programs 13.3 You may run programs using the system function: $exitValue = system("blastall.exe..."); if ($exitValue!=0)
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
6.1 Short foreach revision. 6.2 $arr[2]$arr[1]$arr[3]$arr[4] Loops: foreach The foreach loop passes through all the elements of an array = (2,3,4,5,6);
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
7.1 Last time on: Pattern Matching. 7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
6b.1 Pattern Matching. 6b.2 We often want to find a certain piece of information within the file, for example: Pattern matching 1.Find all names that.
Scripting Languages Chapter 8 More About Regular Expressions.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Last Updated March 2006 Slide 1 Regular Expressions.
Tutorial 14 Working with Forms and Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
C Programming Lecture 4 : Variables , Data Types
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
By Michael Wolfe. Grouping Things and Hierarchical Matching  In a regexp ab|ac is nice, but it’s not very efficient because it uses “a” twice  Perl.
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
By: Andrew Cory. Grouping Things & Hierarchical Matching Grouping characters – ( and ) Allows parts of a regular expression to be treated as a single.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
20-753: Fundamentals of Web Programming 1 Lecture 10: Server-Side Scripting II Fundamentals of Web Programming Lecture 10: Server-Side Scripting II.
6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
1 Perl & R Programming workshop Rappaport building, Medicine April 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Upsorn Praphamontripong CS 1110
CS 330 Class 7 Comments on Exam Programming plan for today:
Looking for Patterns - Finding them with Regular Expressions
CSCI 431 Programming Languages Fall 2003
Regular Expression: Pattern Matching
Presentation transcript:

6ex.1 Pattern Matching

6ex.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man” in the phone book 2.Extract the accession, description and score of every hit in the output of BLAST 3.Extract the coordinates of all open reading frames from the annotation of a genome All these examples are patterns in the text. * We will see a wide range of the pattern-matching capabilities of Perl, but much more is available – I strongly recommend using documentation/tutorials/google to expand your horizons Ariel Beltzman Eyal Privman Rakefet Shultzman Score E Sequences producing significant alignments: (bits) Value ref|NT_ |Mm15_39661_34 Mus musculus chromosome 15 genomic e-45 ref|NT_ |Mm6_39393_34 Mus musculus chromosome 6 genomic c ref|NT_ |Mm9_39517_34 Mus musculus chromosome 9 genomic c ref|NT_ |Mm8_39502_34 Mus musculus chromosome 8 genomic c CDS CDS complement( )

6ex.3 Finding a sub string (match): if ($line =~ m/he/)... remember to use slash and not back-slash (\) Will be true for “hello” and for “the cat” but not for “good bye” or “Hercules”. You can ignore case of letters by adding an “i” after the pattern: m/he/i (matches for “hello”, “Hello” and “hEHD”) There is a negative form of the match operator: if ($line !~ m/he/)... Pattern matching

6ex.4 Replacing a sub string (substitute): $line = "the cat on the tree"; $line =~ s/he/hat/; $line will be turned to “ that cat on the tree ” To Replace all occurrences of a sub string add a “g” (for “globally”): $line = "the cat on the tree"; $line =~ s/he/hat/g; $line will be turned to “ that cat on that tree ” Pattern matching

6ex.5 m/./ Matches any character except “\n” You can also ask for one of a group of characters: m/[abc]/ Matches “a” or “b” or “c” m/[a-z]/ Matches any lower case letter m/[a-zA-Z]/ Matches any letter m/[a-zA-Z0-9]/ Matches any letter or digit m/[a-zA-Z0-9_]/ Matches any letter or digit or an underscore m/[^abc]/ Matches any character except “a” or “b” or “c” m/[^0-9]/ Matches any character except a digit For example: if ($line =~ m/class\.ex[1-9]/) Will be true for “ class.ex3.1.pl ” ; “ my class.ex8.1c ”… Single-character patterns

6ex.6 m/./ Matches any character except “\n” You can also ask for one of a group of characters: m/[abc]/ Matches “a” or “b” or “c” m/[a-z]/ Matches any lower case letter m/[a-zA-Z]/ Matches any letter m/[a-zA-Z0-9]/ Matches any letter or digit m/[a-zA-Z0-9_]/ Matches any letter or digit or an underscore m/[^abc]/ Matches any character except “a” or “b” or “c” m/[^0-9]/ Matches any character except a digit For example: if ($line =~ m/class\.ex[1-9]\.[^3]/) Will be true for “ class.ex3.1.pl ” ; “ my class.ex8.1c ”… but false for “ class.ex3.3 ” Single-character patterns

6ex.7 Perl provides predefined character classes: \d a digit (same as: [0-9] ) \w a “word” character (same as: [a-zA-Z0-9_] ) \s a space character (same as: [ \t\n\r\f] ) For example: if ($line =~ m/class\.ex\d\.\S/) Will be true for “ class.ex3.1 ” and “ class.ex8.(at home) ”… but false for “ class.ex3. ” (because of the space) Single-character patterns And their negatives: \D anything but a digit \W anything but a word char \S anything but a space char

6ex.8 RegEx Coach An easy to use tool for testing regular expressions

6ex.9 Class exercise 7a 1.Write the following regular expressions. Test them with a script that reads a line and prints "yes" if it matches and "no" if not. a)Match a name beginning with a capital letter followed by three lower case letters b)Replace every digit in the line with a #, and print the result c)Match "is" in either small or capital letters d)Remove all such appearances of "is" from the line, and print it

6ex.10 A pattern followed by * means zero or more repetitions of that patern: m/ab*c/ Matches “ abc ” ; “ ac ” ; “ abbbbc ” + means one or more repetitions: m/ab+c/ Matches “ abc ” ; “ abbbbc ” but not “ ac ” ? means zero or one repetitions: m/ab?c/ Matches “ ac ” or “ abc ” Generally – use {} for a certain number of repetitions, or a range: m/ab{3}c/ Matches “ abbbc ” m/ab{3,6}c/ Matches “ a ”, 3-6 times “ b ” and then “ c ” Use parentheses to mark more than one character for repetition: m/h(el)*lo/ Matches “ hello ” ; “ hlo ” ; “ helelello ” Repetitive patterns

6ex.11 To force the pattern to be at the beginning of the string add a “^”: m/^>/ Matches only strings that begin with a “ > ” “$” forces the end of string: m/\.pl$/ Matches only strings that end with a “.pl ” And together: m/^\s*$/ Matches all lines that do not contain any non-space characters Enforce line start/end

6ex.12 m/\d+(\.\d+)?/ Matches numbers that may contain a decimal point: “ 10 ”; “ 3.0 ”; “ 4.75 ” … m/^NM_\d+/ Matches Genbank RefSeq accessions like “ NM_ ” m/^\s*CDS\s+\d+\.\.\d+/ Matches annotation of a coding sequence in a Genbank DNA/RNA record: “ CDS ” m/^\s*CDS\s+(complement\()?\d+\.\.\d+\)?/ Allows also a CDS on the minus strand of the DNA: “ CDS complement( ) ” Some examples Note: We could just use m/^\s*CDS/ - it is a question of the strictness of the format. Sometimes we want to make sure.

6ex.13 Class exercise 7b 2.Write the following regular expressions. Test them with a script that reads a line and prints "yes" if it matches and "no" if not. a)Match a name beginning with a capital letter followed by any number of lower case letters b)Match a date such as: 12/8/2005 and 3/12/1987

6ex.14 We can extract parts of the string that matched parts of the pattern by parentheses: $line = "1.35"; if ($line =~ m/(\d+)(\.\d+)/ ) { print "$1\n"; 1 print "$2\n";.35 } Extracting part of a pattern

6ex.15 We can extract parts of the string that matched parts of the pattern that are marked by parentheses: $line = " CDS "; if ($line =~ m/CDS\s+(complement\()?((\d+)\.\.(\d+))\)?/ ) { print "regexp:$1,$2,$3,$4.\n"; Use of uninitialized value in concatenation... regexp:, ,4815,5888. $start = $3; $end = $4; } Extracting part of a pattern

6ex.16 Usually, we want to scan all lines of a file, and find lines with a specific pattern. E.g.: foreach $line { if ($line =~ m/CDS\s+(\d+)\.\.(\d+)/ ) { $start = $1; $end = $2; } } Finding a pattern in an input file

6ex.17 Class exercise 7c 3.Write the following regular expressions. Test them with a script that reads a line and prints "yes" if it matches and "no" if not. a)Match a first name followed by a last name, and print the last name b)Match a FASTA header line and print the whole line except for the “>” c)As in Q3b, but print the header only until the first white space

6ex.18 Class exercise 8a Write a script that extracts and prints the following features from a Genbank record of a genome (Use the example of an adenovirus genome which is available from the course site) 1. Find the JOURNAL lines and print only the page numbers 2. Find lines of protein_id in that file and extract the ids (add to previous script) 3. Find lines of coding sequence annotation (CDS) and extract the separate coordinates (get each number into a separate variable; add to previous script). Try to match all CDS lines! (This question is in home ex. 4)

6ex.19 If one of several patterns may be acceptable in a pattern, we can write: s/CDS (\d+\.\.\d+|\d+-\d+|\d+,\d+)/ will match “ CDS ”, “ CDS ” and “ CDS 231,345 ” Note: here $1 will be “ ”, “ ” or “ 231,345 ”, respectively Multiple choice

6ex.20 Variables can be interpolated into regular expressions, as in double-qouted strings: $name = "Yossi"; $line =~ m/^$name\d+/ This pattern will match: “ Yossi25 ”, “ Yossi45 ” * Special patterns can also be given in a variable: If $name was “ Yos+i ” then the pattern could match “ Yosi5 ” and “ Yossssi5 ” Variables in patterns

6ex.21 Say we need to search some blast output: ref|NT_ |Mm15_39661_34 Mus musculus chromosome 15 genomic e-45 ref|NT_ |Mm6_39393_34 Mus musculus chromosome 6 genomic c ref|NT_ |Mm9_39517_34 Mus musculus chromosome 9 genomic c ref|NT_ |Mm8_39502_34 Mus musculus chromosome 8 genomic c for the score of a hit that is named by the user. We can write: m/^ref|$hitName.*(\d+)\s+\S+\s*$/ If $hitName was NT_039353, we get 38 Variables in patterns

6ex.22 The split function actually treats its first parameter as a regular expression: $line = "13 5;3 -23 = split(/\s+/, $line); print 13#5;3#-23#8 split

6ex.23 The extracted parts of the pattern can be used inside a substitution: $line = " CDS "; $line =~ s/(\d+)\.\.(\d+)/$1-$2/ ); CDS $line = "I'm John Lennon"; $line =~ s/([A-Z][a-z]+)\s+([A-Z][a-z]+)/$1_$2/ ); I'm John_Lennon Using memories in substitution

6ex.24 $line = " CDS "; $line =~ s/(\d+)\.\.(\d+)/$2..$1/; $line is now: CDS $line = " CDS join( , )"; $line =~ s/(\d+)\.\.(\d+)/$1..$2/g; $line is now: CDS join( , ) Using memories in substitution

6ex.25 The extracted parts can also be used inside the same match: m/(\d+)-(\d+),\2-\d+/ will match“ , ” but not “ , ” m/(.)\1+/ will match any character that is repeated at least twice $line = "kasjfjjjjsja"; if ($line =~ m/((.)\2+)/) { print "regexp:$1,$2.\n"; } regexp:jjjj,j. Using memories in matching

6ex.26 Perl saves the positions of matches in the special The variables $-[0] and $+[0] are the start and end of the entire match The rest hold the starts and ends of the memories (brackets): $line = " CDS "; $line =~ m/CDS\s+(\d+)\.\.(\d+)/; print " \n \n"; starts: ends: Position of match

6ex.27 If a pattern can match a string in several ways, it will take the maximal substring: $line = "fred xxxxxxxxxx john"; $line =~ will become “ john ” and not “ john ” You can make a minimal pattern by adding a ? to any of */+/?/{}: $line = "fred xxxxxxxxxx john"; $line =~ Only one x will be replaced: “ john ” Patterns are greedy

6ex.28 A special type of substitution allows to “translate” (i.e. replace) a set of characters to different set: $seq = "AGCATCGA"; $seq =~ tr/ATGC/TACG/; $seq is now "TCGTAGCT" (What is the next step in order to get the reverse complement of the sequence?) Translate

6ex.29 In ex. 6.1 we wanted to enforce the capital letter to be the beginning of a word. We could enforce a word boundary, similar to enforcing line start/end with ^ and $ m/\bJovi/ will match “ Jovi ” and “ bon Jovi ” but not “ bonJovi ” m/fred\b/ will match “ fred ” and “ fred. ” but not “ fredrick ” \B is the reverse – m/fred\B/ will match “ fredrick ” but not “ fred ” Enforce word start/end

6ex.30 Class exercise 8b Continuing with the record of the adenovirus genome: 4.Get a journal name and the year of publication from the user, find this paper in the adenovirus record and print the pages of this paper in the journal 5*.Get the first and last names of an author from the user, find the paper in the adenovirus record and print the year of publication. Can you find the paper by Kei Fujinaga?