Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Perl & R Programming workshop Rappaport building, Medicine 26-27 April 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry.

Similar presentations


Presentation on theme: "1 Perl & R Programming workshop Rappaport building, Medicine 26-27 April 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry."— Presentation transcript:

1 1 Perl & R Programming workshop Rappaport building, Medicine 26-27 April 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology

2 2 Day 1 - PERL  Intended for students with no experience in programming  Oriented towards programming tasks for life sciences  Once you finish this course (and do the exercises carefully), you will be able to program small Perl scripts for a range of purposes  PERL syllabus 1. Introduction to Perl, Editors, Scalar Data 2. File Handling, Arrays and Hashes 3. Pattern Matching, Regular expressions

3 3 1) Introduction to Perl, Editors, Scalar Data

4 4 Why biologists need computers?  Collecting and managing data  http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/  Searching databases  http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ncbi.nlm.nih.gov/BLAST/  Interpreting data  Protein function prediction  Gene expression  Understanding genomes

5 5 A real life example Shmulik >perl1 TAGGAAGACTGCGGTAAGTCGTGATCTGAGCGGTTCCGTTACAGCTGCTA CCCTCGGCGGGGAGAGGGAAGACGCCCTGCACCCAGTGCTGAATCGCTGC AG... >perl157 http://www.ncbi.nlm.nih.gov/BLAST/ Score E Sequences producing significant alignments: (bits) Value ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45 ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c... 38 0.71 ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c... 36 2.8 ref|NT_039462.4|Mm8_39502_34 Mus musculus chromosome 8 genomic c... 36 2.8 ref|NT_039234.4|Mm3_39274_34 Mus musculus chromosome 3 genomic c... 36 2.8 ref|NT_039207.4|Mm2_39247_34 Mus musculus chromosome 2 genomic c... 36 2.8 >ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic contig, strain C57BL/6J Length = 64849916 Score = 186 bits (94), Expect = 1e-45 Identities = 100/102 (98%) Strand = Plus / Plus Query: 1 taggaagactgcggtaagtcgtgatctgagcggttccgttacagctgctaccctcggcgg 60 ||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| Sbjct: 23209391 taggaagactgcggtgagtcgtgatctgagcggttccgtaacagctgctaccctcggcgg 23209450...

6 6 What is Perl ? Perl = Practical Extraction and Report Language  Perl is a programming language which can be used for a large variety of tasks.  A typical simple use of Perl would be for extracting information from a text file and printing out a report or for converting a text file into another form.  Programs written in Perl are called Perl scripts  Perl is implemented as an interpreted (not compiled) language. Thus, the execution of a Perl script tends to use more CPU time than a corresponding C program, for instance.

7 7 Why yes Perl ? 1. Free and Open source 2. Perl is a cross-platform programming language. 3. It’s easy to learn for non-expert programmers‏ 4. Has strong text manipulation capabilities 5. Has a huge collection of free and reusable Perl modules 6. Can easily handle files and directories 7. Very popular programming language Why not Perl ? 1. Need to install modules 2. Not a natural syntax 3. Slow execution

8 8 Basic Perl resources  Getting Perl:  http://www.perl.org http://www.perl.org  Editors  gedit - http://projects.gnome.org/gedit/plugins.html http://projects.gnome.org/gedit/plugins.html  Documentation:  perldoc - http://perldoc.perl.org/perl.html http://perldoc.perl.org/perl.html  Or – google your question!

9 9 Running Perl on Windows Common dos commands: d:change to other drive (d in this case) cd my_dir change directory cd..move one directory up dirlist files (dir /p to view it page by page) helplist all dos commands help dirget help on a dos command Running perl: perl [ – w] program_name Perl – v This command checks if Perl is installed  Run Perl scripts from a command prompt (a dos window).  Start a DOS window by clicking: Start -> Run -> cmd

10 10 Your first Perl script print 'Hello world'; NOTES: A Perl statement always ends with a semicolon ";"  The print function outputs some information, by default to the terminal screen. C:\Files> perl hello.pl Hello world C:\Files>_ FILE hello.pl DOS window

11 11 Adding comments # The " hello world " program =begin Here I can write any text I want without any # ’ s. More comment text =cut print 'Hello world'; # script line # comment one liner Comments: The # symbol, and anything from it to the end of the line is ignored (with the exception of the first #! line when running on Unix). If you want to insert a comment of multiple lines, use =begin and =cut.

12 12 Class exercise 1.1 1. Open a DOS window using the Start -> Run… -> cmd 2. Open the available editor (gedit or even editpad). 3. Print the string print "hello world"; on a new file and save it as something.pl 4. Go to the directory where the script you just saved is, using DOS commands. 5. Run the script, using the perl script.pl command 6. What did you get?

13 13 Data TypeDescription scalarA single number or string value (or a reference)‏ arrayAn ordered list of scalar values. hashHolds an unordered list of key-value couples. Data types Data types in PERL

14 14 Scalar Data

15 15 Numerical scalar values  A scalar is either a string or a number.  Numerical values can be integer, floats or scientific numbers 3 -20 3.1415 1.3e4 (= 1.3 * 10 4 = 1300)‏ 6.35e-14 ( = 6.35 * 10 -14 )‏

16 16 String scalar values Single-quoted strings '' print 'hello world'; hello world Double-quoted strings "" print " hello world"; hello world print "hello \t world"; helloworld

17 17 Scalar Variables #Variable declaration $priority; #Numerical assignment $priority = 1; #String assignment $priority = 'high'; #Assign the value of variable $b to $a $a = $b;  Scalar variables can store only scalar values.  The start with a $ symbol and then a name  they contain only one single piece of data.

18 18 Variables - notes and tips  Notes:  Give meaningful names to variables: $name is better than $n  Use an explicit declaration of the variables using the my( ) function  Perl has a long list of scalar special variables ($_, $1, $2,…). So don’t use short names, or check they are not special variables!  Variable names in Perl are case-sensitive. This means that the following variables refer to different values: $varname = 1; $VarName = 2; $VARNAME = 3;

19 19 Interpolating variables into strings #Single-quoted strings $a = 9.5; print 'a is $a'; Result = a is $a #Double-quoted strings $a = 9.5; print "a is $a"; Result = a is 9.5  Interpolation, meaning "introducing or inserting something", is the name given to replacing the name of a variable with the value of that variable.  In Perl, any string that is built with double quotes (" ") will be interpolated.

20 20 \ - The escape character Examples print ' a backslash-t: \t a'; a backslash-t: \t a print " a tab character: \t a " ; a tab character:a Doublequote \"\"\"\" Backslash \\\\\\\\ Tab \t\t\t\t Newline \n\n\n\n MeaningConstruct  Backslash is an "escape" character that gives (or neutralizes) the next character a special meaning.  It only works inside double quotes

21 21 Operators

22 22 Numerical Operators  An operator takes some values (operands), operate on them, and produce a new value. ‏ Operator Description + - * / addition, substraction, mult. and division ** exponentiation ++ -- increment (add 1) and decrement print 1 + 1; 2 print (1+1) ** 3; 8

23 23 String Operators OperatorDescription. concatenate x replicate print ('swiss'. 'prot'); swissprot print ('swiss'. 'prot'. '-') x 3; swissprot-swissprot-swissprot

24 24 Type decision in PERL  Perl decides the type of a value depending on operators (9 x 2) + 1 ('9' x 2) + 1 '99' + 1 99 + 1 100 (9 + 5). 'a' 14. 'a' '14'. 'a‘ '14a'

25 25 Class exercise 1.2  Write a Perl script that does the following: 1. Prints the string "hello world! hello Perl!" 2. Concatenates and prints the words "this", "is" and "ubiquitin" (with spaces between them)‏ 3. As in 2 but using variables for each word and using tabs /t between each word. 4. Produces and prints the line: 666:666:666:god help us! without printing any number 6 and with only one : in your script! 5. Makes two numerical variables, and calculates and prints the division, multiplication and sum between this two variables each in different lines.  To run it go to the right directory and use ‘perl –w script.pl’ command

26 26 The length function  The length function returns the length of a string: print length ("length");  Remember to read about functions either on the internet or using perldoc: C:\perl> perldoc -f length length EXPR length Returns the length in *characters* of the value of EXPR. If EXPR is omitted, returns length of $_. Note that this cannot be used on an entire array or hash to find out how many elements these have. For that, use "scalar @array" and "scalar keys %hash" respectively. Note the *characters*: if the EXPR is in Unicode, you will get the number of characters, not the number of bytes. To get the length in bytes, use "do { use bytes; length(EXPR) }", see bytes.

27 27 The perldoc utility  You can use the perldoc utility to get help. Information about a help topic: perldoc topic perldoc perlintro – the introduction perldoc perlretut – a tutorial on regular expressions perldoc perltoc - Start at the table of contents Information about a function: perldoc – f function_name perldoc – f print You can also google your question!

28 28 Reading input print "What is your name?\n"; my $name = ; print "Hello $name!"; #Note that the input includes the Enter ("\n"): This results in the following script run: C:\> perl stdin-script.pl What is your name? Yossi ‏ Hello Yossi !  Filehandle operator <> operator allows a script to read input from the user:

29 29 Reading input – chomp function print "What is your name?\n"; my $name = ; chomp $name; print "Hello $name!"; What is your name? Yossi Hello Yossi!  So use the chomp function, which removes the terminal "\n" (if there is one):

30 30 The substr function  The substr function extracts a substring out of a string. substr EXPR, OFFSET, LENGTH, REPLACEMENT EXPR - a string value OFFSET - a position on the string (starting to count from 0) LENGTH - a length. #Example: $str = "university"; $sub = substr ($str, 3, 5); $sub is now "versi", $str remains unchanged. Note: If length is omitted, everything to the end of the string is returned. Note: You can also use variables as the position and length parameters.

31 31  Write a script that: 1. Assigns your e-mail address into a variable and prints it 2. Reads a name of several protein from the command line ( ) and prints it. 3. Reads a line aa sequence (SYYTREELEVSD...) and prints its length and the first and last 5 aa‏. Use the substr function and a with the first 3 parameters. 4. Remember to use the chomp function and if necessary the perldoc utility to understand the function details. Class exercise 1.3

32 32 2) File Handling, Arrays and Hashes

33 33 Data TypeDescription scalarA single number or string value (or a reference)‏ Example 9, -17, 3.1415, 'hello' arrayAn ordered list of scalar values. Example (9, 'hello', 3.5)‏ hashHolds an unordered list of key-value couples. Example (9=>'hello', 'bye'=>8)‏ Data types

34 34 Lists and arrays  A list is an ordered set of scalar values: (1,2,3,"fred")  An array is a variable that holds a list. An array variable name always start with a @: @a = (10, "ALA", 300,"fred");

35 35 Lists and arrays @a = (10, "ALA", 300,"fred");  You can access or change array elements or the full array. print $a[1]; ALA $a[0] = "*"; print @a;*ALA300fred

36 36 Lists and arrays  You can easily get a sub-array: @a = (1,2,3,"ALA","LYS"); print @a;123ALALYS print $a[4];LYS  You can easily get a sub-array: @a = (1,2,3,"ALA","LYS"); print @a;123ALALYS print $a[4];LYS @sub_a = @a[2..3] print @sub_a;3ALA @sub_a = @a[2..3] print @sub_a;3ALA  You can extend an array as much as you like: $a[5] = 'GLY'; print @a 123ALALYSGLY

37 37 Lists and arrays – the qw() and scalar functions  The 'quote word' function qw() is used to generate a list of words. Uses embedded whitespace as the word delimiters. Assigning to arrays: my @a = (3..6); (3,4,5,6) my @b = qw (a b cat d);("a","b","cat","d") my ($a,$b,@c) = (1..5);$a=1; $b=2; @c=(3,4,5)  Counting array elements: print scalar (@b); 4

38 38 Reading and printing arrays  You can read lines from the standard input in list context: my @a = ; @a will store all the lines entered until the user hits ctrl-z.  You can interpolate arrays and array elements into strings: @b = (1.3, 4.5, 6.7); print @b; 1.34.56.7 print "@b";1.3 4.5 6.7 $" controls the output field separator for the print operator. This value is printed between each of an array list values interpolated into a double-quoted string. Default is a space.

39 39 Class exercise 2.1 Write a script that: 1.Defines an arrary of 10 elements, including numbers and string scalars 2.Add to this array a new element in the 12 th place 3.Print the 2, 3 and 11 place, what is the 11 th place value? 4.Make a sub-array from the elements 4 to 8 and save it in a new array 5.Print the two arrays you just created with and without interpolation, what is the difference? 6.Print the number of elements of each array 7.Read a third array from the command line (use ), which has three elements, and count the length of each elements and print it!

40 40 Array functions

41 41 push & pop  push ARRAY, LIST – Pushes the values of LIST onto the end of ARRAY. my @a = (1, 2, 3, 4, 5); print @a;12345 push (@a, 6); print @a;123456  pop ARRAY - Pops and returns the last value of the array, shortening the array by one element. @a = (1,2,3,4,5); my $x = pop (@a); print $x;5 print @a;1234

42 42 shift & unshift  shift ARRAY - Shifts the first value of the array off and returns it my @a = (1, 2, 3); my $x = shift (@a); print $x;1 print @a;23  unshift ARRAY, LIST - Does the opposite of a "shift". Prepends list to the front of the array, and returns the new number of elements in the array. my @a = (1, 2, 3); print @a;123 unshift (@a, 0); print @a;0123

43 43 split & join  split /PATTERN/, EXPR, LIMIT - Splits the string EXPR into a list of strings and returns that list. my @a = split (/;;/, "score;;4.56;;p-value;;0.004"); print $a[1];score print "@a"; score 4.56 p-value 0.004  join EXPR,LIST - Joins the separate strings of LIST into a single string with fields separated by the value of EXPR, and returns it. my $str = join (":", @a); print "$str\n";score:4.56:p-value:0.004

44 44 Reversing lists  reverse LIST – In list context, returns a list value consisting of the elements of LIST in the opposite order. my @a = ("yossi","bracha","moshe"); print join (";", reverse(@a));moshe;bracha;yossi my @a = ("yossi","bracha","moshe"); print join (";", reverse(@a));moshe;bracha;yossi

45 45 Sorting lists  sort LIST - In list context, this sorts the LIST and returns the sorted list value. Default sorting is alphabetical: my @a = sort ("yossi","bracha","moshe"); @a is ("bracha","moshe","yossi") my @a = sort ("yossi","bracha","moshe"); @a is ("bracha","moshe","yossi") my @b = sort (1,3,9,81,243); @b is (1,243,81,9) my @b = sort (1,3,9,81,243); @b is (1,243,81,9)  Numerical sorting is requires the operator  Numerical sorting is requires the operator @s = sort {$a $b} @a;

46 46 Class exercise 2.2 Write a script that: 1.Reads a number from the first line of standard input, and then reads additional lines and prints only the line selected by that number and stop reading. 2.Reads a list of numbers separated by spaces (single line of input), splitting them into an array, and prints those numbers in reverse order. 3.Then add to the array from 2 two element, one at the beginning and one at the end of the array, print the new elements array separated by two dashes (--) and then join the array elements into one string and print it.

47 47 True and False in PERL  Every expression in PERL has a numerical value, which can be considered true or false.  Generally, PERL functions return a true value if they succeed and false if they fail

48 48 Controls: if ?  Controls allow non-sequential execution of commands, and responding to different conditions. if (expression is true) { __________________; __________________; __________________; # do if true __________________; # do if true __________________; __________________;} else { __________________; __________________; __________________; # do if false __________________; # do if false __________________; __________________;}

49 49 if, elsif, else  It’s possible to test several conditions in a single if structure: print "enter number: "; my $n = ; chomp ($n); if ($n > 10) { print "large number\n"; print "large number\n";} elsif ($n > 0) { print "small number\n"; print "small number\n";} else { print "very small number\n"; print "very small number\n";}

50 50 Comparison operators StringNumericComparison eq==Equal ne!= Not equal lt< Less than gt> Greater than le<= Less than or equal to ge>= Greater than or equal to if ($age == 18)... if ($name eq "Yossi")... if ($name ne "Yossi")... if ($name lt "n")...

51 51 Logical operators if ( (condition1) && (condition2 ) ) {...} if ( ! (condition) ) {…} #checks condition false  These are often used when you need to check more than one condition. Here they are: Example: $number= ; if (($number 0)) { print "good number!!"; }

52 52 While loop Commands inside a loop are executed repeatedly (iteratively): while (expression is true) { _________1_________; _________1_________; _________2_________; # do if expression is true _________2_________; # do if expression is true _________3_________; # then go back and recheck _________3_________; # then go back and recheck } Print "continue with program";

53 53 foreach loop foreach $element (@array) { __________________; __________________; __________________; # iterate over array elements __________________; # iterate over array elements __________________; # assigning them to $element __________________; # assigning them to $element}

54 54 Loops  Examples while ($name ne "Yossi") { chomp($name = ); print "Hello $name!\n"; } @names = qw (fabian jorge romelia jorjina); foreach $name (@names) { print "Hello $name!\n"; }

55 55 Class exercise 2.3 Write a script that: 1.Read a list from the STDIN using a while loop and prints only the lines that are a false value (enter some of them of course) 2.Given a list, go over it with foreach, and search for a specific value or string, and stop the search when you found it.

56 56 Files

57 57  It is common to give parameters within the command-line for a program or a script. They will be stored in a special array @ARGV: my $inFile = $ARGV[0]; my $outFile = $ARGV[1]; my $inFile = $ARGV[0]; input.fasta my $outFile = $ARGV[1];output.txt  Or my ($inFile, $outFile) = @ARGV; Command line parameters C:\ perl findProtein.pl input.fasta output.txt

58 58 open function opens the file for reading, and links it to a filehandle. close function closes this connection. open IN, "file-name.txt"; #opens connection with file $line = ; #read first line $line = ; #second line, etc. close IN; #closes connection And then read lines from the filehandle, as we did with. To check that the open didn’t fail (e.g. if the file doesn’t exists): open IN, "$file" or die "can't open file $file"; Reading files – open, close and die functions

59 59 open IN, "EHD.fasta";#opens a connection to file EHD.fasta a) Read all lines at once and assign them into an array variable @filelines = ; # read all lines at once! chomp @filelines; #chomp all lines @filelines = ; # read all lines at once! chomp @filelines; #chomp all lines b) Iterate over lines until end of file, using a while loop while ($line = ) { while ($line = ) { print "$line\n"; print "$line\n"; } Reading files, 2 additional ways

60 60  Open a file for writing using > to write and >> to append and link it to a filehandle: open OUT, " > EHD.analysis"; open OUT, " >> EHD.analysis"; print OUT "The mutation is in exon $exonNumber\n"; close OUT;  NOTE: If a file by that name already exists it will be overwritten! Writing to files

61 61  You can also ask question about a file or a directory name. Files operators take one argument, either a filename or a filehandle, and tests the associated file to see if something is true about it. -e"file" exists -r "file" is readable -w "file" is writable by you -z "file" has zero size -s "file" has non-zero size (returns size) -f "file" is a file -d "file" is a directory -T "file" is a text file For example: if (-e "file") { print "The file named file exists!\n"; } File Test Operators

62 62  You can use full path name, it is safer and clearer to read open IN, '<D:\Eyal\PERL\p53.fasta';  Remember to use \\ in double quotes open IN, "<D:\\Eyal\\PERL\\$name.fasta";  You can also use / (usually…) open IN, "<D:/Eyal/PERL/$name.fasta"; Working with file paths

63 63  Perl allows easy access to the files in a directory by "globbing": foreach $fileName ( ) { open IN, $fileName or die "can't open file $fileName"; foreach $line ( ) { _______________#do something... _______________#do something... close IN; }} Note: the "glob" gives a list of the file names in the directory. Reading directories

64 64 Class exercise 2.4 1.Write a script that reads the name of a file (can be itself) through the command line 2.Check that this is a file with size not-zero and readable and if so, print all its content 3.Add to this file a couple of lines (check that you don’t delete it!)

65 65 Hash (associative arrays)

66 66  An associative array (or simply – a hash) is an unordered set of key=>value pairs, in which each key is associated with a value.  A hash variable name always start with a % symbol: %hash = ( "a"=>5,"bob"=>"zzz", 50=>"Johnny", ); Hash – an associative array bob (key) zzz (value) 50 (key) Johnny (value) a (key) 5 (value) HASH

67 67 %h = ("a"=>5, "bob"=>"jones", 50=>"tree");  You can access a value by its key: print $h{50};tree $h{bob} = "aaa";#adds a new value to the hash  You can ask whether a certain key exists in a hash: if ( exists ($h{50}) ) {...}  You can delete a certain key-value pair in a hash: delete ($h{50}); Working with Hashes

68 68 1. The keys function yields a list of all the current keys in a given hash. %AA = ('ARG' => 'R', 'LEU' => 'L', 'ASP' => 'D'); @names = keys (%AA); print "3 letter aa: @names\n";#3 letter aa: ASP LEU ARG 2. Printing all keys and values of a hash: foreach $name (keys (%AA)) { print "$name = $AA{$name} " ; #ASP = D ; LEU = L ; ARG = R print "$name = $AA{$name} " ; #ASP = D ; LEU = L ; ARG = R} 3. The elements are given in an arbitrary order, so if you want a certain order use sort: foreach $key (sort (keys(%h)))... Iterating over hash elements

69 69 Class exercise 2.5 1.Write a script that reads a file with a list of protein names and lengths: AP_000081 181 SE_000174 99 IO_000138 145 P0_000118 44 2.Stores the names of the sequences as hash keys, with the length of the sequence as the value. 3.Ask if a certain protein exists in the hash and prints its length (choose one that do exists). 4.Print the proteins name and lengths with a lengths over 100 aa (or any other length provided at the command line). 5.Print all keys sorted by name

70 70 3) Pattern Matching with Regular Expressions

71 71  While working on text files, we often want to find a certain piece of information within the file: 1.Find all names that end with "man" from the phone book 2.Extract the name, accession and score of all hits from the output of blast 3.Extract the coordinates of all open reading frames from the annotation of a genome 4.Find the line that says "The overall score of the tree is …" in the output of a program that builds phylogenetic trees  We will see some of the pattern-matching capabilities of Perl, but much more is available. Pattern matching

72 72  The matching operators allows to find a substring within a larger string. For example: $line =~ m/ string / ;  The operator =~ m// will return true if string substring will be withing the $line string.  Example, the following will return true: $line = "Blast score hit AOT-34 3.45"; print "match" if ($line =~ m/score/); The matching operators

73 73  Replacing a sub string (substitute): $line = "the cat on the tree"; $line =~ s/he/hat/; print $line #"that cat on the tree"  To Replace all occurrences of a sub string add a "g" (for "globally"): $line = "the cat on the tree"; $line =~ s/he/hat/g; $line will be turned to "that cat on that tree" Pattern substitution s///

74 74  But in many cases will need to match a more varial pattern, for example "class1", or "class8", etc. Character/sMeaning. (usage m/./) any character except "\n" \d a digit (same as: [0-9]) \D not a digit \w a "word" character (same as: [a-zA-Z0-9_]) \W not a word char \s a space character (same as: [ \t\n\r\f]) \S not a space char For example: m/class\.ex\d/; Will be true for "class.ex3" and "class.ex8 " … but false for "class.exe" (why?) Single character classes

75 75 Character/sMeaning [abc]Matches "a" or "b" or "c" [a-z]Matches any lower case letter [a-zA-Z]Matches any letter [0-9]Matches any digit [^abc]Matches any character except "a" or "b" or "c" [^0-9]Matches any character except a digit For example: $line =~ m/ex[1-9]./; will be true for "ex3a" ; "ex8* "… but false for "execute3.2", can you say why? m/ex[1-9]\.[1-9]/ m/ex[1-9]\.[1-9]/ Pattern Matching Metacharacters

76 76 *means zero or more repetitions of that patern: m/ab*c/ Matches "abc" ; "ac" ; "abbbbc" + means one or more repetitions: m/ab+c/ Matches "abc" ; "abbbbc" but not "ac" ? means zero or one repetitions: m/ab?c/ Matches "ac" or "abc" { }Generally – use {} for a certain number of repetitions, or a range: m/ab{3,6}c/ matches "a", 3-6 times "b" and then "c" ( )Use parentheses for repetition: m/h(el)*lo/ Matches "hello" ; "hlo" ; "helelello" Quantitative character classes

77 77  For example, to force the pattern to be at the beginning of the string we use the following characters: ^ Matches only at the beginning of the string $ Matches only at the end of the string And they are many more…

78 78 m/\d+(\.\d+)?/ Matches numbers that may contain a decimal point: "10"; "3.0"; "4.75" m/^NM_\d+/ Matches Genbank RefSeq accessions like "NM_079608" m/^\s*CDS\s+\d+\.{2}\d+/ Matches annotation of a coding sequence in a Genbank DNA/RNA record:" CDS 87..1109" m/^\s*CDS\s+(complement\()?\d+\.\.\d+/ Allows also a CDS on the minus strand of the DNA: " CDS complement(4815..5888)" m/^\s*CDS/ We could just use this - it is a question of the strictness of the format. Some examples

79 79  We can extract parts of the string that matched parts of the pattern by parentheses () and save them into special variables $1, $2, etc. $1 $2 $1 $2 m/(match1) (match1)/; m/(match1) (match1)/;  For example $line = "1.35"; if ($line =~ m/(\d+) \. (\d+)/ ) { print $1; 1 print $2; 35 } Extracting and using part of a pattern – ()

80 80  The extracted parts of the pattern can be used in a substitution: $line = " CDS 4815..5888"; $line =~ s/(\d+)\.\.(\d+)/$1-$2/; $line = " CDS 4815..5888"; $line =~ s/(\d+)\.\.(\d+)/$1-$2/; print $line; CDS 4815-5888 Advanced substitution

81 81  The split function actually treats its first parameter as a regular expression: $line = "13 5.3 -23 8"; @numbers = split (/\s+/, $line); print "@numbers";13 5.3 -23 8 print join ('#', @numbers); 13#5.3#-23#8  Remember to use perldoc –f to understand the details of each function split function revised

82 82  If a pattern can match a string in several ways, it will take the maximal substring: $line = "fred xxxxxxxxxx john"; $line =~ s/x+/@/; print $line; will print "fred @ john" and not "fred @xxxxx john"  You can make a minimal pattern by adding a ? to any of */+/?/{}: $line = "fred xxxxxxxxxx john"; $line =~ s/x+?/@/; Here, only one x will be replaced: "fred @xxxxxxxxx john" Patterns are greedy

83 83  If one of several patterns may be acceptable in a pattern, we can use |: /CDS (\d+\.\.\d+|\d+-d+|\d+,d+)/ This expression will match: "CDS 231..345" or "CDS 231-345" or "CDS 231,345" Multiple choice

84 84  Variables can be interpolated into regular expressions, as in double-quoted strings: $name = "Yossi"; $line =~ m/^$name\d+/; This pattern will match: "Yossi5", "Yossi45", etc.  Special patterns will be also interpolated and used: $name = "Yos+i"; $line =~ m/^$name\d+/; Then the pattern could match "Yosi5" and "Yossssi5" Variables in patterns

85 85  Say we need to extract the hit score from some blast output: $GenBank = 'ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45'; $hitName = "NT_039621"; $GenBank =~ m/^ref\|$hitName.*\s+(\S+)/; print $1; This will print 1e-45 Variables in patterns, example

86 86  A special type of substitution allows to translate a set of characters to different set one by one: tr/1234/1234/; tr/1234/1234/; So for example: $seq = "AGCATCGA"; $seq =~ tr/ATGC/TACG/; print $seq; "TCGTAGCT" Translate tr///

87 87 Class exercise 3.1 1.Write a script that will read all FASTA files in a given directory, and print the names of all sequences. (names of FASTA files should have a ".fasta" suffix). 2.Change the script so that the names from each file will be written to a file named as the input file with an added extension ".names"

88 88 Class exercise 3.2  Write the following regular expressions and them with a line read from the STDINP and prints the results of matching that line with the pattern. 1. Match a name beginning with a capital letter followed by lower case letters only 2. Match a first name followed by a last name, and extract the last name 3. Replace the space between the names with an underscore 4. Match a FASTA header line and extract the whole line except for the ">" 5. Match a date such as: 12/8/2005

89 89 Class exercise 3.3  Write a script that extracts and prints the following features from a Genbank record of a genome (Use the example of an adenovirus genome following this link, or download it from the course site. The script should do the following:adenovirus genome 1.Find lines describing the references the and extract the titles (REFERENCE, AUTHORS, etc.). 2.Find lines of protein_id in that file and extract the ids 3.Find lines of coding sequence annotation (CDS) and extract the coordinates. Make sure you get all of them!

90 90 Class exercise 3.4  Continuing with the record of the adenovirus genome: 1.Get a journal name and the year of publication from the user, find this paper in the adenovirus record and print the pages of this paper in the journal 2.Get the first and last names of an author from the user, find the paper in the adenovirus record and print the year of publication. Can you find the paper by Kei Fujinaga?

91 91 Class exercise 3.5  Continuing with the record of the adenovirus genome: 1.Extract the whole translated protein ("translation" lines) and write out the proteins in fasta format.

92 92 Modules and more PERL capabilities

93 93  A module or a package is a collection of subroutines, usually stored in a separate file with a *.pm suffix (Perl Module).  The subroutines of a module should deal with a well-defined task. For example the module Fasta.pm may contain a group of subroutines that read and write and manipulate fasta files.  In order to write a script that uses a module add a use line at the beginning of the script: use MODULE; Using modules

94 94  The best place to search for Perl modules that can make your life easier is:http://www.cpan.org/ http://www.cpan.org/  Once you identified your preferred module, the easiest way to download and install a module is to use cpan module  And then use the install command with the required module Installing modules from the internet D:\>cpan cpan shell -- CPAN exploration and modules installation Enter 'h' for help. cpan> install module

95 95  BioPerl modules are called Bio::XXX…  You can see all available modules in: http://bio.perl.org/ http://bio.perl.org/  with documentation and examples for how to use them – which is the best way to learn this…  An online course: http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/ http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/  But installation and usage is not so easy, you can read the documentation, and we may try to do an advance course for you….  What can we do with BioPerl  Reading and manipulating sequence files,  Running and parsing BLAST, Alignments, etc. BLAST  Manipulating PDB files, etc. BioPerl

96 96  Generally, you can execute any command of the operating system: $systemReturn = system ("delete fred.txt"); $systemReturn = system ("copy fred.txt george.txt");  When checking the value returned by a system call, usually 0 means no errors: if ($systemReturn != 0) { die "can't copy fred.txt"; } Calling system commands


Download ppt "1 Perl & R Programming workshop Rappaport building, Medicine 26-27 April 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry."

Similar presentations


Ads by Google