1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM

1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM fernan@unsam.edu.ar

2Fernán Agüero Where we are today

3Fernán Agüero A bioinformatic experiment An experiment at the computer is no different than an experiment at the bench: –You search for an answer to a specific question –Experiments should be reproducible by someone else using similar methods Identify the problem –Which catalytic mechanism does enzyme X use? Identify the tools to answer the question –Sequence similarity searches –multiple sequence alignments –Identification of conserved motifs, and domains –Modelling the 3D protein structure Define criteria for success of the experiment –Almost all computational methods will provide a hit, or give you an answer –You need to distinguish noise from signal, discard false positives, etc. –You need to unserstand the programs and tools you use, the algorithmic base, whether they rely on external data for training, etc.

A bionformatic experiment Identify the problem –You have a new genome –Search for kinases that are conserved in other organisms –Do a multiple sequence alignment of best hits –Search for conserved motifs Describe the problem in computational terms –BLAST query sequence vs database X –Filter BLAST HITS (Score > 100) ‏ –Filter BLAST HITS (Annotation has the word ‘kinase’) ‏ –Store Best hits in FASTA format –Run ClustalW on stored sequences –Run BLOCKS on the multiple sequence alignment –Manually inspect the conserved blocks 4Fernán Agüero x 5000 sequences

5Fernán Agüero Programming computers Anyone with experience in the design of wet lab experiments can program a computer An experiment in the lab begins with a question, that leads you to a testable hypothesis The experiment helps you to test/discard that hypothesis In the computer, the programs you write should be designed to test hypothesis Learning a programming language might seem like a daunting task, but it’s similar to learning to use a new tool, or another natural language (english, french) ‏

First steps: programming a computer to automate tasks 6Fernán Agüero Sequences BLAST Good hit? Is this a kinase? Save No NO Yes

7Fernán Agüero Un experimento bioinformático … Select the data sets –In a wet lab experiment you rely on reagents. Generally you know when they were prepared, who prepared them, how they were prepared, etc. –In a dry lab experiment, the same type of information is essential. You need to know your data sources Database: When was the information compiled? What was the criteria for accepting entries into the dataset? –Take down notes and save the output of programs in your bioinformatic experiments!

8Fernán Agüero What is Perl Perl is a programming language Interpreted High Level Dynamic Created by Larry Wall –Perl is a language to get your job done! In Perl –There is more than one way to do it

9Fernán Agüero Why Perl Perl has been designed with text processing in mind –Filter text, generate reports Ideally suited for sequence analysis –All sequences are text –Convert formats easily with Perl GenBank FASTA Clustalw –Analyze and process files Find restriction enzyme sites Motifs Vector Trim sequences Etc. © Lincoln Stein

10Fernán Agüero A basic Perl script #!/usr/bin/perl statement; one long statement using many lines; exit; # optional Absolute PATH to the Perl interpreter /usr/local/bin/perl ; /opt/bin/perl Cynthia Gibas and Per Jambeck. O'Reilly & Associates (2001), ISBN 1565926641 Statements are ended by semicolons which perl whereis perl will give you the PATH to your Perl

11Fernán Agüero A basic Perl program/script A Perl Program is –A plain text file –Containing statements written in the Perl language Any text editor can be used –Vi, vim, emacs, xemacs, nedit, gedit, pico, nano, ee –Textpad, PSPad (Windows) ‏ –BBEdit, TextWrangler (Mac OS X) ‏ –Remember: a Word Processor IS NOT a text editor In Unix a text file can be a program (i.e. it can be executed) ‏ –Giving execution privileges: 'chmod +x program.pl' –Telling the operating system about the interpreter responsible for understanding the statements in the file #!/usr/bin/perl ‏

Running our perl script Save the file –Usually we use '.pl' for perl scripts –program.pl Give the file execution privileges –chmod +x program.pl Execute it from the command line –./program.pl

Variables –Allow you to store data in your program –And make operations with the data Numeric Values Text Values In Perl there are different types of variables –To store unidimensional data 1 value –To store many values Lists –To store associations between values Key: value Hashes, associative arrays 13Fernán Agüero

14Fernán Agüero Variables Scalars ($) ‏ –Unidimensional –Can hold any type of data Text Integers Floating point numbers –They are prefixed with $ $var = “GGATCCGGGACCAAAA”; # assign a string $val = 42; # assign a number ($a, $b, $c) = (“me”, “my”, “mine”); # assign all at once print $a; # would print “me” ($l, $r) = ($r, $l); # swap values

15Fernán Agüero Variables (contd) ‏ Arrays (or Lists) (@) ‏ –In Perl, an array/list is an indexed collection of values. Values can be scalar values of any type (text, numbers) ‏ –The first index starts at position 0 (zero). –They are prefixed with @ @list = (“juan”, “jose”, “fred”); # assign 3 elements print $list[0]; # print first element of $list push @list, “roberto”; # adds string at the end print $list[3]; # prints “roberto” $first = shift @list; # get leftmost value $last = pop @list; # get rightmost value

16Fernán Agüero Variables (contd) ‏ Hashes (%) ‏ –Also called associative arrays –They store values in pairs Key => Value –They are prefixed with % %me = ( name => “Fernan”, age => 37, loves => “Perl”, ); # create a hash with 3 key/value pairs print $me{name}; # print value associated with 'name' $me{born} = “Buenos Aires”; # add a new key-value pair

17Fernán Agüero Using variables Choice of variable types gives you power –Use the variable type that best fits your data Getting complex –You can create more complex data structures by mixing scalars, arrays and hashes. –Some examples: A hash of hashes to store sequences %sequences = ( eco0001 => { seq => “ATG...TGA”, desc => ”hypothetical protein” }, eco0002 => { seq => “ATG...TAA”, desc => “DNA polymerase” },... );

18Fernán Agüero From strings to lists and back again Convert a string into a list of values –Useful when reading files exported from spreadsheets –E.g. from Excel, in tab- or comma-delimited format –@values = split( /pattern/, $string ) ‏ $string = “Cel1980.1 ATG Hypothetical 2.54 High”; @values = split(/ /, $string); # @values now (“Cell1980.1”, “ATG”, “Hypothetical”, “2.54”, “High”)‏ print $values[3]; # would print 2.54 ($id, @values) = split(/ /, $string); # $id is “Cel1980.1” # @values is now (“ATG”, “Hypothetical”, “2.54”, “High”)‏

Convert a list of values into a string –Useful to generate files that can then be imported into a spreadsheet application (OpenOffice, Excel) ‏ –$string = join( “character(s)”, @list ) ‏ 19Fernán Agüero @list = (“Cel1980.1”, “ATG”, “Hypothetical”, “2.54”, “High”); $string = join(“||”, @list); # $string is now “Cel1980.1||ATG||Hypothetical||2.54||High”

20Fernán Agüero Working with files Declare a handle and associate it with a file Use the handle to refer to the file Reading from files Writing to files –This will overwrite the contents of the file! open(MYHANDLE, “/home/fernan/somefile.txt”); while ( $line = ) { # read the file one line at a time, do some action on $line } close MYHANDLE; open(MYOUTPUTHANDLE, “>/home/fernan/somefile.txt”); print MYOUTPUTHANDLE “Hello there!”; close MYOUTPUTHANDLE;

21Fernán Agüero Working with files (contd) ‏ Appending to files –Appending does not overwrite the contents of the file Special handles –They are always open, and available –STDIN, for reading (i.e. from pipes) ‏ –STDOUT, for writing –STDERR, for writing open(MYOUTPUTHANDLE, “>>/home/fernan/somefile.txt”; print MYOUTPUTHANDLE “Hello there!”; close MYOUTPUTHANDLE; while ( ) { # read from the keyboard or from a pipe } print STDOUT “MW: 325.08 kDa”, “\n”, “pI: 9.54”, “\n”, “Length: 2954 aa”; print STDERR “Warning: sequence length is zero!”;

22Fernán Agüero Operators Assignment operators –= +=.= Control operators –&& || ! logical AND, OR and NOT –and or not Comparison operators –Numerical = != == –String lt gt le ge ne eq cmp $a = 1; $a += 2; # $a is now 3 $a *= 2; # $a is now 6 $a = “Me”; $a.= “Myself”; # $a is now “MeMyself” $a.= “AndIrene”; # $a is now “MeMyselfAndIrene” if ( $a && $b ) { # do something } if ( $mw > 100 || $pi < 9 ) {... } if ( ! defined $c ) {... } if ( $a == 4 ) { # do something } if ( $b eq “ATG” ) {... } if ( $a and $b ) { # do something } if ( $mw > 100 or $pi < 9 ) {... } if ( not defined $c ) {... }

23Fernán Agüero Iterations, Loops while (expression) { execute block } unless (expression) { execute block } do { execute block } until ( expression ) ‏ foreach (@list) { execute block } for (initial; expression; increment) { execute block } –for ($i = 0; $i >= 100; $i = $i + 1) Start at zero (0), Continue while $i <= 100, Increment $i by one each time Expression –An expression that evaluates to either true or false Execute block –List of statements that will be executed in the loop or if the condition is met

Exercise 1 Read a tab-delimited file –Interpro Results Produce a new tab-delimited file –With less columns –With the columns reordered File: lmajor_interpro.tab Columns –0 Sequence –1 Checksum –2 Length –3 Search Method –4 Match Accession –5 Match Description Columns –6 Match Start –7 Match End –8 Match Evalue –9 Match T/F –10 Date –11 Interpro Family Accession –12 Interpro Family Description –13 Gene Ontology Terms

#!/usr/bin/perl open(INTERPRO, “lmajor_interpro.tab”); # read interpro data, one line at a time while ( $line = ) { # split the columns @values = split(/\t/, $string); $sequence = $values[0]; $checksum = $values[1]; $length = $values[2]; $method = $values[3]; #... print $sequence, “\t”, $method, “\t”, $accession, “\t”, $start, “\t”, $end, “\t”, $description, “\n”; } close INTERPRO;

Regular expressions They are expressions that describe a family of strings –They can describe a literal string: GAATTC (Think about find/Replace in Word, Excel, Acrobat) ‏ –But the power does not lie in matching literals They use a particular syntax –GAATTC –G[AT]+C, G[^GC]+C Will match GAATTC, GAAATTC, GAAATTTC, GTATATATAC –GA+T+C Will match GATC, GAATC, GAATTC, GAAATC, GAAATTTC –GA{2}T{2}C Will match GAATTC –GA{1,2}T{1,2}C Will match GATC, GAATC, GAATTC, GATTC –Allow mismatches: GA..TC

Regular expressions There are some special characters –[ ], options for matching at a certain place in the string –{ }, options for matching a certain number of times –\s a space –\S anyting but a space –\d a digit –\D anything but a digit –\w an alphabetical character –\W anything but an alphabetical character –. match any character –+ match the preceding character 1 or more times –* match the preceding character 0 or more times –? match the preceding character 0 or 1 times –( ) store the matched pattern in a variable (Hello)\s(Nick)\s(Thomson) ‏ $1 = Hello, $2 = Nick, $3 = Thomson

28Fernán Agüero BioPerl BioPerl is –A collection of Perl modules –That greatly simplify writing bioinformatics programs BioPerl allows you to be lazy –You don't need to care about formats Bioperl Reads FASTA, GenBank, ClustalW, BLAST... $seq->id, $blast->evalue, $blast->score, $blast->next_hit Use Bio::SeqIO; $seqio = Bio::SeqIO->new( -file => “tcruzi.fasta”, -format => “fasta” ); while ( $seqobj = $seqio->next() ) { $sequence = $seqobj->seq(); $id = $seqobj->id(); }

29Fernán Agüero Using BioPerl Read the documentation –Identify the module you need –Know the objects you will be dealing with Sequence objects? Alignments? –And the methods and functions implemented by the module next_seq, next_report, next_hit add_seq, get_seq, revcomp, write_aln use Bio::AlignIO; $io = Bio::AlignIO->new( -file => “267.aln”, -format => “clustalw” ); $aln = $io->next_aln(); $minialn = $aln->slice(20,30); $newaln = $aln->remove_columns(['mismatch']);

30Fernán Agüero Using BioPerl Fast conversion of formats use Bio::AlignIO; $in = Bio::AlignIO->new( -file => '1.aln”, -format => “clustalw”); $out = Bio::AlignIO->new( -file => 2.pfam”, -format => “pfam”); $aln = $in->next_aln; $out->write_aln($aln); use Bio::SeqIO; $in = Bio::SeqIO->new( -file => '1.gbk”, -format => “genbank”); $out = Bio::SeqIO->new( -file => 2.fasta”, -format => “fasta”); $seq = $in->next_seq; $out->write_seq($seq);

Exercises >seq1 blah blah blah 1877:2980 ATGCGGTGACCGGTTAATTACACACAGGTACAGCCATATGCCGATTT GACGATACGTTTAGGCTTTCACAAAGGTGAGACT >CDS CDS blah blah blah 2856:3245 cgatcgctatcgcggta cggatatcgcgatcg cgtatatctcgagtc cgagagtatcatatg ctatcggagagc cgaagctcttatatcg cggatatcgcgga A multiple FASTA file –Contains many sequences

32Fernán Agüero Exercises Read the following program, and try to explain its purpose, and expected input and output 1 #!/usr/bin/perl 2 3 # In Perl $/ is a special variable 4 # $/ is the record separator or $RS 5 # by default $/ is a newline, we redefine it here 5 $/ = '>'; 6 7 while ( $next = ) { 8 chomp; 9 next if not defined $next; 10 ($header,@seq) = split(/\n/, $next); 11 $seq = join('', @seq); 12 $seq =~ s/\s//g; 13 $seq =~ s/(.{60})/$1\n/g; 14 print ">$header\n$seq\n"; 15 }

Solve a real life problem Run Batch predictions for proteins and load the results in Artemis The problem –Predictions are run for each protein in turn –Coordinates (motifs, domains, similarity) are with respect of the protein sequence –Artemis expects coordinates with respect to the DNA sequence (chromosome, contig) ‏ Perl to the rescue

Getting there... step by step Extract all proteins –FASTA file Run your predictions/searches –Either on the web –Or at the command line Have the output of all predictions in tab-delimited files –Pre-process the files in a spreadsheet (Excel/OpenOffice) so that they're all in the same format –Same number of columns –Same order of the columns –Export as a text file (TXT, comma or tab-delimited) ‏

Getting there... step by step What is our plan? –Open the tab-delimited file –Convert the coordinates A more detailed, step by step list of actions –Open the the FASTA file, read the DNA coordinates for each protein (store coordinates for later use) ‏ –Open the TAB delimited file containing the predictions –For each protein Look up the DNA coordinates Add the DNA start coordinate to the protein coordinates E.g. 8900 + 10, 8900 + 70 DNA Protein 1 10000 8900 9200 1070 Feature

Writing our script #!/usr/bin/perl # Wellcome Trust Course Exercise 1, Thursday, Nov 20th, 2008 # version 1 open(FASTA,”/path/to/my/fasta/file”) or die “Arrggghhh!”; while ( $line = ) { if ( $line =~ /^>(\S+).*(\d+):(\d+)/ ) { $sequence = $1; $start = $2; $end = $3; $hash{$sequence}{start} = $start; $hash{$sequence}{end} = $end; } %hash = ( CDS1 => ( start => 1224, end => 1560 ), CDS2 => ( start => 1689, end => 2553 ), CDS3... ); GET VALUE $var = $hash{CDS1}{end}; SET VALUE $hash{CDS1}{start} = 1224;

37Fernán Agüero Getting help Read the Perl Manual – 'man perl' (Overview and links to other parts of the manual) ‏ – 'man perlfunc' (Perl Built-in functions, i.e. split, join, chomp) ‏ – 'man perlop' (Perl Built-in operators, i.e. + && ) ‏ – 'man perlre' (Perl regular expressions) ‏ Perldoc – 'perldoc -f chomp' Get documentation for a built-in function (in this case, chomp) ‏ –This also include external and/or third-party modules – 'perldoc Bio::AlignIO' – 'perldoc Bio::SeqIO' – 'perldoc Bio::Seq::LargeSeq'

38Fernán Agüero Further Reading Mastering Perl for Bioinformatics –James Tisdall –O'Reilly and Associates, 2003 Learning Perl –Randal Schwartz, Tom Phoenix, Brian D Foy –O'Reilly and Associates, 5th Edition, 2008 Intermediate Perl –Randal Schwartz, Brian D Foy, with Tom Phoenix –O'Reilly Media, 2006 Mastering Regular Expressions –Jeffrey Friedl –O'Reilly and Associates, 1997 CPAN, the Comprehensive Perl Archive Network –Your one stop shop for everything Perl Modules, Frameworks http://search.cpan.org

1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM

Similar presentations

Presentation on theme: "1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM

Similar presentations

Presentation on theme: "1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM"— Presentation transcript:

Similar presentations

About project

Feedback