Lecture
Lecture 6.12 An Introduction to Perl for Bioinformatics – Part 2 Will Hsiao Simon Fraser University Department of Molecular Biology and Biochemistry
Lecture 6.13
4 Outline Session 1 –Review of the previous day –Perl – historical perspective –Expand on Regular Expression –General Use of Perl –Expand on Perl Functions and introduce Modules –Interactive demo on Modules Break Session 2 –Use of Perl in Bioinformatics –Object Oriented Perl –Bioperl Overview –Interactive demo on Bioperl –Introduction to the Perl assignment
Lecture 6.15 Perl in Bioinformatics Case to point 1: Human Genome data exchange –“How Perl saved the Human Genome Project” Lincoln Stein (1996) Different sequencing centres all have different data format Perl allowed various genome centres to exchange and communicate data with each other Introduces a project to produce modules to process all known forms of biological data (Bioperl)
Lecture 6.16 Perl in Bioinformatics Case to point 2: Ensembl –Much of Ensembl is written in Perl –Ensembl has an extensive Perl API - allow you to access Ensembl database directly from your perl code Case to point 3: GMOD – Generic Model Organism Database – –a joint effort by model organism system databases (worm, fly, corn, rat, yeast, E. coli, arabidopsis, rice) to develop reusable components suitable to be adapted for other biological databases –Written mostly in Java and Perl
Lecture 6.17 Bioinformatics Spectrum MathBiologyComputer ScienceSoftware/ data analysis Perl JAVA C/C++
Lecture 6.18 Perl for bioinformatics in your lab Scripting –automation of repetitive analyses –parse results obtained from other programs Wrapping –accessing others programs (e.g. BLAST) through Perl Web CGI’ing –Develop an interactive web page to your lab –Create web forms
Lecture 6.19 Bioperl Overview The Bioperl project – –Comprehensive, well documented set of Perl modules –Last stable release (developer 1.5.1) –A bioinformatics toolkit for: Format conversion Report processing Data manipulation Sequence analyses and more! –Written in object-oriented Perl
Lecture What are objects? Examples of objects in real life: –Cars, dogs, dishwashers… Objects have ATTRIBUTES and ACTIONS Some attributes of a dog: Color of fur Height Owner’s Name Weight Tail position Some actions of a dog: Bark Walk Run Eat Wag tail
Lecture What are programming objects? Borrows from the concept of real life objects sub dye_fur{} sub eat{ } sub wag_tail{ } $fur_color $weight $tail_position Attributes are stored as variables Actions are implemented as functions A Program Dog Object
Lecture Object Exercise Pair up with your neighbour (2-3 people) In the next 2-3 minutes, come up with as many attributes and actions (aka methods) of a DNA sequence object –E.g. attributes of a DNA sequence object: $length=300, $percent_GC=50% –E.g. methods of a DNA sequence object: Translate_to_protein, remove_polyA_tail Share with the class
Lecture Objects belong to Classes If we take all your suggestions and design a generic template. We can then use this template to create objects. This template is called a Class An “instance of a class” is called an object DNA sequence object 1 DNA sequence object 2 DNA sequence object 3 DNA sequence object 4 DNA Sequence Class
Lecture How do we interact with an object? WOOF POLO Polo is the name of my dog We have to refer to an object by its name
Lecture Interact with a program object $Polo sub dye_fur{} sub eat{ } sub wag_tail{ } $fur_color $weight $tail_position A Program Dog Object WOOF $Polo is the name of a program dog object
Lecture A name is a reference Objects have unique names (labels) You refer to an object by its unique name This unique name that you give to an object is called a “reference”
Lecture Reference in Perl A reference is a scalar (simple) variable that refers to a chunk of memory Stored in that memory can be another variable or an object $array_ref MemoryMy Program 1234
Lecture Reference to an object $var{SwissProt_ID} $var{name} $var{length} $var{souce} $var{%domain_location} sub new{…} sub return_ID{…} sub get_domain{…} A protein object $my_protein Memory $my_protein is called a “reference” to an object (in this case a protein object) To access the attributes and methods of the protein object, you have to go through its reference (i.e. $my_protein) Objects have inherent functions that are useful These inherent functions also have specific names My Program
Lecture Object Oriented Programming What is O-O Programming? –Simple answer: a way to organize code so it interacts in certain ways and follows certain rules –Long answer: to be found in books on O-O Why O-O Programming? –Provides well defined framework –Promotes certain good practice such as code reuse, abstraction, cleaner design, etc. –Does have certain trade-offs (e.g. O-O Perl is usually slower than declarative Perl) –Designing good object classes requires forethoughts and skills
Lecture To use an object 1.Find out which class you need and learn about the class by reading its documentation 2.Make the class available to your program 3.Create a new object of the class 4.Start using the object by modifying its attributes and calling its methods
Lecture Example of using objects Task: –I have a sequence file in Genbank format that I want to convert to EMBL format How many objects do you think we need to accomplish the task above?
Lecture Find the Objects you need Objects that we need: 1.an object that read in sequences from a file 2.an object that represents a sequence record 3.an object that write sequences to a file Sequence File Input Object EMBLGenbank Sequence Object Sequence File Output Object Memory
Lecture Example of using objects Solution: –I remember that Bioperl provides this functionality. So first I’ll take a look at the Bioperl documentation –Website:
Lecture Bioperl Documentation demo Go to the webpage and navigate to SeqIO doc Pay attention to 1) the name of the module 2) Synopsis (code examples) 3) Description 4) list of methods
Lecture 6.125
Lecture Click
Lecture List of Modules by Class Complete List of Modules by Name
Lecture 6.128
Lecture Make the object class available In perl, classes are implemented as object-oriented modules To include a class, simply use the module –E.g. use Bio::SeqIO Note the name of the module is case sensitive By using Bio::SeqIO, my program automatically gain access to any modules included in Bio::SeqIO
Lecture Create an object 1.Make up a name for my object reference (e.g. $seq_input) 2.Create the object by calling the object class’s “new” method –every class has a “constructor” method to create an object of that class –constructor method is often called “new” –use single arrow operator to call methods 3.Assign the object to the object reference 4.You can give the object you are about to create some initial attributes (e.g. the file name of my sequence record, the format of the record) my $seq_inBio::SeqIO->new= ( -file => “myGBrecord”, -format => “genbank”);
Lecture Call object’s methods? We’ve seen the -> (single arrow) operator for calling a class method (e.g. new) The same operator is used for calling an object method –E.g. to ask $seq_in object to get a sequence record from your Genbank sequence file my $seq_record = $seq_in->next_seq();
Lecture Putting it all together #!/usr/bin/perl –w use strict; use Bio::SeqIO; my $seq_in = Bio::SeqIO->new( -file => “myGBrecord”, -format => “genbank”); my $seq_out = Bio::SeqIO->new( -file => “>myEMBLrec”, -format => ‘EMBL’); my $seq_record = $seq_in->next_seq(); $seq_out->write_seq($seq_record); Make the Bio::SeqIO class available to my program Create a new Bio::SeqIO object and initialize some attributes a sequence object
Lecture More Bioperl modules Bio::SeqIO: Sequence Input/Output –Retrieve sequence records and write to files –Converting sequence records from one format to another Bio::Seq: Manipulating sequences –Get subsequences ( $seq->subseq($start, $end) ) –Find the length of the object ( $seq->length ) –Reverse complement a DNA sequence –Translate a DNA sequence ….etc. Bio::Annotation: Annotate a sequence –Assign journal references to a sequence, etc. –Bio::Annotation is associated with an entire sequence record and not just part of a sequence (see also Bio::SeqFeature)
Lecture Some more Bioperl modules Bio::SeqFeature: Associate feature annotation to a sequence –“features” describe specific locations in the sequence –E.g. 5’ UTR, 3’ UTR, CDS, SNP, etc –Using this object, you can add feature annotations to your sequences –When you parse a genbank file using Bioperl, the “features” of a record are stored as SeqFeature objects Bio::DB::GenBank, GenPept, EMBL and Swissprot: Remote Database Access –You can retrieve a sequence from remote databases (through the Internet) using these objects
Lecture Even more Bioperl modules Bio::SearchIO: Parse sequence database search reports –Parse BLAST reports (make custom report) –Parse HMMer, FASTA, SIM4, WABA, etc. –Custom reports can be output to various formats (HTML, Table, etc) Bio::Tools::Run::StandAloneBLAST: Run Standalone BLAST through perl –By combining this and SearchIO, you can automate and customize BLAST search Bio::Graphics : Draw biological entities (e.g. a gene, an exon, BLAST alignments, etc)
Lecture Bioperl Summary For Online documentation: –For this workshop: –Tutorial: –HOWTOs: –Modules: Literature: –Stajich et al., The Bioperl toolkit: Perl modules for the life sciences. Genome Res Oct;12(10): PMID: Bioperl mailing list: –Best way to get help using Bioperl –Very active list (upwards of 10 messages a day) Use with caution: things change fast and without warning (unless you are on the mailing list…)
Lecture Interactive demo on Bioperl Open your laptop! Open a terminal window Type cd ~/perl_two Type gedit./bioperl_demo.pl& Let’s go over the example together
Lecture Summary for Session 2 Perl is a popular language in bioinformatics because: –it handles text well –It has great user base and support (e.g. Bioperl) Bioperl is a large collection of object oriented perl modules for many biological data analyses an object is a collection of attributes and methods You have to access an object through its reference a reference is a name
Lecture Perl Documents In-line documentation –POD = plain old documents –Read POD by typing perldoc –E.g. perldoc perl, perldoc Bio::SeqIO On-line documentation – – – Books –Learning Perl (the best way to learn Perl if you know a bit about programming already) –Beginning Perl for Bioinformatics (example based way to learn Perl for Bioinformatics) –Programming Perl (THE Perl reference book – not for the faint of heart)
Lecture Additional Book References Perl Cookbook 2 nd edition (quick solutions to 80% of what you want to do) Learning Perl Objects, References & Modules (for people who want to learn objects, references and modules in Perl) Perl in a Nutshell (an okay quick reference) Perl CD Bookshelf, Version 4.0 (electronic version of the above books – best value, searchable, and kill fewer trees) Mastering Perl for Bioinformatics (more example based learning) CGI Programming with Perl (rather outdated treatment on the subject... Not really recommended) Perl Graphics Programming (if you want to generate graphics using Perl; side note – Perl is probably not the best tool for generating graphics)
Lecture Introduction to the Assignment Part A Goals: –To convert passive knowledge to active skills –To write some simple perl programs by yourself Consists of 2 modules –Write a program to convert the temperature from F to C –Write a program to count the frequencies of bases in a sequence (sequence MAN1.fasta can be downloaded from Day6 wiki)
Lecture Introduction to the Assignment Part B Goals: –To see the power of Perl in bioinformatics –To see how some common bioinformatics tasks are done using Perl Consists of 3 modules –Download E. coli O157:H7 proteins in FASTA format –Use Regular Expression to find a protein motif –Run BLAST on all proteins in the proteome (>5000 BLAST runs)
Lecture Introduction to the Assignment Part B Most of the code is given to you, you just have to modify them (in total, no more than 15 lines of new code!!) You are not expected to know everything in the scripts. It takes time to learn a new language TAs and your CS team mates will help you, don’t wait until last minute to ask for help Remember, you still have to hand in your own version of the assignment! No copying!
Lecture Acknowledgements Thanks to Sohrab Shah and Sanja Rojic (CS, UBC) for a wonderful collaborative work on the lecture/lab material Some ideas of this lecture is borrowed from Lincoln Stein’s workshop (