6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…

Slides:



Advertisements
Similar presentations
Arrays A list is an ordered collection of scalars. An array is a variable that holds a list. Arrays have a minimum size of 0 and a very large maximum size.
Advertisements

CS 330 Programming Languages 10 / 14 / 2008 Instructor: Michael Eckmann.
6.1 Pattern Matching. 6.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man”
Regular Expression (1) Learning Objectives: 1. To understand the concept of regular expression 2. To learn commonly used operations involving regular expression.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
CS311 – Today's class Perl – Practical Extraction Report Language. Assignment 2 discussion Lecture 071CS Operating Systems I.
4.1 Controls: Ifs and Loops. 4.2 Controls: if ? Controls allow non-sequential execution of commands, and responding to different conditions else { print.
6ex.1 Pattern Matching. 6ex.2 We often want to find a certain piece of information within the file: Pattern matching 1.Find all names that end with “man”
COS 381 Day 19. Agenda  Assignment 5 Posted Due April 7  Exam 3 which was originally scheduled for Apr 4 is going to on April 13 XML & Perl (Chap 8-10)
4ex.1 More loops. 4ex.2 Loops Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.\n"; while ($num != 31) {
Scripting Languages Chapter 6 I/O Basics. Input from STDIN We’ve been doing so with $line = chomp($line); Same as chomp($line= ); line input op gives.
Scalar Variables Start the file with: #! /usr/bin/perl –w No spaces or newlines before the the #! “#!” is sometimes called a “shebang”. It is a signal.
CS 330 Programming Languages 10 / 11 / 2007 Instructor: Michael Eckmann.
7.1 Last time on: Pattern Matching. 7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
5.1 Previously on... PERL course (let ’ s practice some more loops)
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
7.1 Some Eclipse Tips Try Ctrl+Shift+L Quick help (keyboard shortcuts) Try Ctrl+SPACE Auto-complete Source→Format ( Ctrl+Shift+F ) Correct indentation.
13.1 Wrapping up Running Other Programs 13.3 You may run programs using the system function: $exitValue = system("blastall.exe..."); if ($exitValue!=0)
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
6.1 Short foreach revision. 6.2 $arr[2]$arr[1]$arr[3]$arr[4] Loops: foreach The foreach loop passes through all the elements of an array = (2,3,4,5,6);
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
7.1 Last time on: Pattern Matching. 7.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
4.1 Revision. 4.2 if, elsif, else It’s convenient to test several conditions in one if structure: print "Please enter your grades average:\n"; my $number.
6b.1 Pattern Matching. 6b.2 We often want to find a certain piece of information within the file, for example: Pattern matching 1.Find all names that.
5.1 Revision: Ifs and Loops. 5.2 if, elsif, else It’s convenient to test several conditions in one if structure: print "Please enter your grades average:\n";
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Last Updated March 2006 Slide 1 Regular Expressions.
An Introduction to Textual Programming
Tutorial 14 Working with Forms and Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Practical Extraction & Report Language PERL Joseph Beltran.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
Books. Perl Perl (Practical Extraction and Report Language) by Larry Wall Perl 1.0 was released to usenet's alt.comp.sources in 1987 Perl 5 was released.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
JavaScript, Part 2 Instructor: Charles Moen CSCI/CINF 4230.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607 Office Hours – Tuesday and.
Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
More about Strings. String Formatting  So far we have used comma separators to print messages  This is fine until our messages become quite complex:
Computer Programming for Biologists Class 6 Nov 21 th, 2014 Karsten Hokamp
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
Department of Electrical and Computer Engineering Introduction to Perl By Hector M Lugo-Cordero August 26, 2008.
5.1 Revision: Ifs and Loops. 5.2 if, elsif, else It’s convenient to test several conditions in one if structure: print "Please enter your grades average:\n";
Python Let’s get started!.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Introduction to Objective Caml. General comments ML is a purely functional language--there are (almost) no side effects There are two basic dialects of.
2.1 Scalar data - revision numeric e-14 ( = 6.35 × )‏ operators: + (addition) - (subtraction) * (multiplication) / (division)
Python 1 SIGCS 1 Intro to Python March 7, 2012 Presented by Pamela A Moore & Zenia C Bahorski 1.
Part 4 Arrays: Stacks foreach command Regular expressions: String structure analysis and substrings extractions and substitutions Command line arguments:
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Upsorn Praphamontripong CS 1110
Python programming exercise
Unit 3: Variables in Java
Presentation transcript:

6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…

6.2 Comments following the last exercise Use chomp to remove \n from inputs Add remarks and document your code (see nice_code_example.pl) nice_code_example.pl as you treat any other array Use the $! to give the correct error after failing to open file. e.g. die "failed to open file '$file' $!". Make sure your outputs are as requested Debug Debug & Debug!!! Let us know if one of the questions cause you troubles Make sure you understand the solutions on the course web- site and ask if something remain unclear.

6.3 if The order of conditions: if ((substr($fastaline,0,1) ne ">") and (defined $fastaline)) What will happen if $fastaline is undefine? Use of uninitialized value $fastaline in split… The solution: if ((defined $fastaline) and (substr($fastaline,0,1) ne ">")) 12

6.4 $arr[2]$arr[1]$arr[3]$arr[4] Loops: foreach The foreach loop passes through all the elements of an array = (2,3,4,5,6); my $mul = $arr[0] foreach my $num { $mul = $mul *$num; } undef $mul 2720

6.5 Some Eclipse Tips Try Ctrl+Shift+L Quick help (keyboard shortcuts) Try Ctrl+SPACE Auto-complete Source→Format ( Ctrl+Shift+F ) Correct indentation You can maximize a single view of Eclipse. Debug Debug & Debug!!! Break points... The (default) location of your files are: At home: D:\eclipse\perl_ex Computer class: C:\eclipse\perl_ex Remove auto-complete of (),{},"" etc.: Windows -> Preferences -> Perl EPIC -> Editor make changes in "Smart typing"...

6.6 Pattern matching

6.7 We often want to find a certain piece of information within the file, for example: Pattern matching 1.Exract GI numbers or accessions from Fasta 2.Extract the coordinates of all open reading frames from the annotation of a genome 3.Extract the accession, description and score of every hit in the output of BLAST All these examples are patterns in the text. We will see a wide range of the pattern-matching capabilities of Perl, but much more is available – you are welcome to use documentation/tutorials/google. >gi| |ref|NP_ | thr operon … >gi| |ref|YP_ | hypothetical … >gi| |ref|NP_ | citrate … >gi| |ref|NP_ | thr operon … >gi| |ref|YP_ | hypothetical … >gi| |ref|NP_ | citrate … Score E Sequences producing significant alignments: (bits) Value ref|NT_ |Mm15_39661_34 Mus musculus chromosome 15 genomic e-45 ref|NT_ |Mm6_39393_34 Mus musculus chromosome 6 genomic c ref|NT_ |Mm9_39517_34 Mus musculus chromosome 9 genomic c CDS CDS complement( )

6.8 Finding a sub-string (match) somewhere in a string: if ($line =~ m/he/)... remember to use slash ( / ) and not back-slash Will be true for “ hello ” and for “ the cat ” but not for “ good bye ” or “ Hercules ”. You can ignore case of letters by adding an “ i ” after the pattern: m/he/i (matches for “ the ”, “ Hello ”, “ Hercules ” and “ hEHD ”) There is a negative form of the match operator: if ($line !~ m/he/)... Regular expression

6.9 m/./ Matches any character (except “ \n ”) You can also match one of a group of characters: m/[atcg]/ Matches “a” or “t” or “c” or “g” m/[a-d]/ Matches “a” though “d” (a, b, c or d) m/[a-zA-Z]/ Matches any letter m/[a-zA-Z0-9]/ Matches any letter or digit m/[a-zA-Z0-9_]/ Matches any letter or digit or an underscore m/[^atcg]/ Matches any character except “a” or “t” or “c” or “g” m/[^0-9]/ Matches any character except a digit Single-character patterns

6.10 TATTAA TATAATA CTATATAATAGCTAGGCGCATG ✗ ✔ ✔ For example: if ($line =~ m/TATAA[AT]/) Will be true for? Single-character patterns TATTAA TATAATA CTATATAATAGCTAGGCGCATG

6.11 Perl provides predefined character classes: \d a digit (same as: [0-9] ) \w a “word” character (same as: [a-zA-Z0-9_] ) \s a space character (same as: [ \t\n\r\f] ) For example: if ($line =~ m/class\.ex\d\.\S/) Single-character patterns And their negatives: \D anything but a digit \W anything but a word char \S anything but a space char ✔ ✗ ✔ class.ex3.1.pl class.ex3. my class.ex8.(old) class.ex3.1.pl class.ex3. my class.ex8.(old)

6.12 ? means zero or one repetitions of what’s before it: m/ab?c/ Matches “ ac ” or “ abc ” + means one or more repetitions of what’s before it: m/ab+c/ Matches “ abc ” ; “ abbbbc ” but not “ ac ” A pattern followed by * means zero or more repetitions of that patern: m/ab*c/ Matches “ abc ” ; “ ac ” ; “ abbbbc ” Generally – use { } for a certain number of repetitions, or a range: m/ab{3}c/ Matches “ abbbc ” m/ab{3,6}c/ Matches “ a ”, 3-6 times “ b ” and then “ c ” m/ab{3,}c/ Matches “ a ”, “ b ” 3 times or more and then “ c ” Use parentheses to mark more than one character for repetition: m/h(el)*lo/ Matches “ hello ” ; “ hlo ” ; “ helelello ” Repetitive patterns

6.13 Question: What did one regular expression say to the other? Answer:.* Credit: We are now ready for some bad humor

6.14 TATAAAGAATG ACTATAATAAAAATG TATAATGATGTATAATATG ✔ ✔ ✗ For example: if ($line =~ m/TATAA[AT][ATCG]{2,4}ATG/) Will be true for? Repetitive patterns TATAAAGAATG ACTATAATAAAAATG

6.15 Consider the following code: print "please enter a line...\n"; my $line = ; chomp($line); if ( $line =~ m/-?\d+/ ) { print "This line seems to contain a number...\n"; } else { print "This is certainly not a number...\n"; } Example code

6.16 Consider the following code: open(my $in, "<", "numbers.txt") or die "cannot open numbers.txt"; my $line = ; while (defined $line) { if ( $line =~ m/-?\d+/ ) { print "This line seems to contain a number...\n"; } else { print "This is certainly not a number...\n"; } $line = ; } Example code

6.17 RegEx Coach An easy-to-use tool for testing regular expressions: Also in eclipse Window -> Show View -> Other... from the Eclipse menu select EPIC -> RegExp view from the list.

6.18 Class exercise 6a Write the following regular expressions. Test them with a script that reads a line from STDIN and prints "yes" if it matches and "no" if not. 1.Match a name containing a capital letter followed by three lower case letters 2.Match an NLS (nuclear localization signal) that starts with K followed by K or R followed by any character followed by either K or R. 3.Match an NLS that starts with K followed by K or R followed by any character except D or E, followed by either K or R. Match either lowercase or uppercase letters 4*.Match a line that contains in it at least characters between quotes (without another quote inside the quotes).

6.19

6.20 Replacing a sub string (substitute): $line = "the cat on the tree"; $line =~ s/he/hat/; $line will be turned to “ that cat on the tree ” To Replace all occurrences of a sub string add a “ g ” (for “globally”): $line = "the cat on the tree"; $line =~ s/he/hat/g; $line will be turned to “ that cat on that tree ” Pattern matching

6.21 Perl provides predefined character classes: \d a digit (same as: [0-9] ) \w a “word” character (same as: [a-zA-Z0-9_] ) \s a space character (same as: [ \t\n\r\f] ) And a substitute example for $line = "class.ex3.1.pl"; $line =~ s/\W/-/; class-ex3.1.pl $line =~ s/\W/-/g; class-ex3-1-pl Single-character patterns And their negatives: \D anything but a digit \W anything but a word char \S anything but a space char

6.22 Class exercise 6b 1.Write the following regular expressions substitutions. For each string print it before the substitution and after it a)Replace every T with U in a DNA sequence. b)Replace every digit in the line with a #, and print the result. c)Replace any number of white space charactres (new-line, tab or space) by a single space. d*)Remove all appearances of "is" from the line (both lowercase and uppercase letters), and print it.

6.23 To force the pattern to be at the beginning of the string add a “ ^ ”: m/^>/ Matches only strings that begin with a “ > ” “ $ ” forces the end of string: m/\.pl$/ Matches only strings that end with a “.pl ” And together: m/^\s*$/ Matches empty lines and all lines that contains only space characters. Enforce line start/end

6.24 m/\d+(\.\d+)?/ Matches numbers that may contain a decimal point: “ 10 ”; “ 3.0 ”; “ 4.75 ” … m/^NM_\d+/ Matches Genbank RefSeq accessions like “ NM_ ” OK… now let's do something more complex… Some examples

6.25 Let's take a look at the adeno12.gb GenBank record….adeno12.gb Matches annotation of a coding sequence in a Genbank DNA/RNA record: CDS m/^\s*CDS\s+\d+\.\.\d+/ Allows also a CDS on the minus strand of the DNA: CDS complement( ) m/^\s*CDS\s+(complement\()?\d+\.\.\d+\)?/ Some GenBank examples Note: We could just use m/^\s*CDS/ - it is a question of the strictness of the format. Sometimes we want to make sure.

6.26 We can extract parts of the pattern by parentheses: $line = "1.35"; if ($line =~ m/(\d+)\.(\d+)/ ) { print "$1\n"; 1 print "$2\n"; 35 } Extracting part of a pattern

6.27 We can extract parts of the string that matched parts of the pattern that are marked by parentheses: my $line = " CDS "; if ($line =~ m/CDS\s+(\d+)\.\.(\d+)/ ) { print "regexp:$1,$2\n";regexp:87,1109 my $start = $1; my $end = $2; } Extracting part of a pattern

6.28 Usually, we want to scan all lines of a file, and find lines with a specific pattern. E.g.: my ($start,$end); foreach $line { if ($line =~ m/CDS\s+(\d+)\.\.(\d+)/ ) { $start = $1; $end = $2; } } Finding a pattern in an input file

6.29 We can extract parts of the string that matched parts of the pattern that are marked by parentheses. Suppose we want to match both $line = " CDS complement( )"; and $line = " CDS "; if ($line =~ m/CDS\s+(complement\()?((\d+)\.\.(\d+))\)?/ ) { print "regexp:$1,$2,$3,$4.\n"; $start = $3; $end = $4; } Use of uninitialized value in concatenation... regexp:complement(, ,4815,5888. regexp:, ,6087,8109. Extracting a part of a pattern

6.30 Write a script that extracts and prints the following features from a Genbank record of a genome (Use adeno12.gb)adeno12.gb 1.Print all the JOURNAL lines 2.Print all the JOURNAL lines, without the word JOURNAL, and until the first digit in the line (hint in white: match whatever is not a digit). 3.Find the JOURNAL lines and print only the page numbers 4.Find lines of protein_id in that file and extract the ids (add to your script from the previous question). 5.Find lines of coding sequence annotation (CDS) and extract the separate coordinates (get each number into a separate variable). Try to match all CDS lines… (This question is part of home ex. 4). Class exercise 6c