BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu
Introduction Objective of BioPerl: Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. Background: Started in 1995 One of the oldest open source Bioinformatics Toolkit Project
So what is BioPerl? Higher level of abstraction Re-usable collection of Perl modules that facilitate bioinformatics application development: Accessing databases with different formats Sequence manipulation Execution and Parsing of the results of molecular biology programs Catch? BioPerl does not include programs like Blast, ClustalW, etc Uses system calls to execute external programs
So what is BioPerl? (continued…) 551 modules (incl. 82 interface modules) 37 module groups 79,582 lines of code (223,310 lines total) 144 lines of code per module For More info: BioPerl Module ListingBioPerl Module Listing
Major Areas covered in Bioperl Sequences, features, annotations, Pairwise alignment reports Multiple sequence alignments Bibliographic data Graphical rendering of sequence tracks Database for features and sequences
Additional things Gene prediction parsers Trees, parsing phylogenetic and molecular evolution software output Population genetic data and summary statistics Taxonomy Protein Structure
Downloading modules Modules can be obtained from: (Perl Modules) (BioPerl Modules) Downloading modules from CPAN Interactive mode perl -MCPAN -e shell Batch mode use CPAN; clean, install, make, recompile, test
Directory Structure BioPerl directory structure organization: Bio/ BioPerl modules models/ UML for BioPerl classes t/ Perl built-in tests t/data/ Data files used for the tests scripts/ Reusable scripts that use BioPerl scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. doc/ "How To" files and the FAQ as XML
Parsing Sequences Bio::SeqIO multiple drivers: genbank, embl, fasta,... Sequence objects Bio::PrimarySeq Bio::Seq Bio::Seq::RichSeq
Sequence Object Creation Sequence Creation : $sequence = Bio::Seq->new( -seq => ‘AATGCAA’ -display_id => ‘my_sequence’); Flat File Format Support : Raw, FASTA, GCG, GenBank, EMBL, PIR Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP
Sequence object Common (Bio::PrimarySeq) methods seq() - get the sequence as a string length() - get the sequence length subseq($s,$e) - get a subseqeunce translate(...) - translate to protein [DNA] revcom() - reverse complement [DNA] display_id() - identifier string description() - description string
Sequence Types Different Sequence Objects: Seq – Some annotations RichSeq – Additional annotations PrimarySeq – Bare minimum annotation ( id, accession number, alphabet) LocatableSeq – Start, stop and gap information also LargeSeq – Very long sequences LiveSeq – Newly sequenced genomes
Using a sequence use Bio::PrimarySeq; my $str = “ATGAATGATGAA”; my $seq = Bio::PrimarySeq->new(-seq => $str, -display_id=>”example”); print “id is “, $seq->display_id,”\n”; print $seq->seq, “\n”; my $revcom = $seq->revcom; print $revcom->seq, “\n”; print “frame1=”,$seq->translate->seq,“\n”; id is example ATGAATGATGAA TTCATCATTCAT trans frame1=MNDE
Accessing remote databases $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); $seq2 = $gb->get_Seq_by_acc('AF303112'); $seqio = $gb-> get_Stream_by_id(["J00522","AF303112"," "]);
Sequence – Accession numbers # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();
Reading and Writing Sequences Bio::SeqIO fasta, genbank, embl, swissprot,... Takes care of writing out associated features and annotations Two functions next_seq (reading sequences) write_seq (writing sequences)
Writing a Sequence use Bio::SeqIO; # Let’s convert swissprot to fasta format my $in = Bio::SeqIO->new(-format => ‘swiss’, -file => ‘file.sp’); my $out = Bio::SeqIO->new(-format => ‘fasta’, -file => ‘>file.fa’);` while( my $seq = $in->next_seq ) { $out->write_seq($seq); }
Manipulating sequence data with Seq methods Allows the easy manipulation of bioinformatics data Specific parts of various annotated formats can be selected and rearranged. Unwanted information can be voided out of reports Important information can be highlighted, processed, stored in arrays for graphs/charts/etc with relative ease Information can be added and subtracted in a flash
The Code #!/usr/local/bin/perl use Bio::Seq; use Bio::SeqIO; my $seqin = Bio::SeqIO->new('-file' => "genes.fasta", '-format' =>'Fasta'); my $seqobj = $seqin->next_seq(); my $seq = $seqobj->seq(),"\n"; #plain sequence print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n"; $seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines print $seq,"\n";
Before
After
Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord) Molecular Weight Monomer Counter Codon Counter DNA weights RNA weights Amino Weights More
The Code #!/usr/local/bin/perl use Bio::PrimarySeq; use Bio::Tools::SeqStats; my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1'); $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $hash_ref = $seq_stats->count_monomers(); foreach $base (sort keys %$hash_ref) { print "Number of bases of type ", $base, "= ",%$hash_ref- >{$base},"\n"; }
The Results
More Code use SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj); $weight = $seq_stats->get_mol_wt(); -returns the molecular weight $monomer_ref = $seq_stats->count_monomers(); -counts the number of monomers $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence -counts the number of codons
Monomer
Why the Large and The Small MW? Note that since sequences may contain ambiguous monomers (eg "M" meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wt returns a two-element array containing the greatest lower bound and least upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)
Identifying restriction enzyme sites (Restriction Enzyme) Bioperl's standard RestrictionEnzyme object comes with data for more than 150 different restriction enzymes. To select all available enzymes with cutting patterns that are six bases long: $re = new = $re->available_list(6); sites for that enzyme on a given nucleic acid sequence can be obtained using $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); # $seqobj is the Seq object for the dna to be = $re1- >cut_seq($seqobj);
Identifying restriction enzyme sites (Restriction Enzyme) (more) Adding an enzyme not in the default list is easily as this: $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV-- GAT^ATC', '-MAKE' =>'custom');
Manipulating sequence alignments Bioperl offers several perl objects to facilitate sequence alignment: pSW (Smith-Waterman) Clustalw.pm TCoffee.pm bl2seq option of StandAloneBlast.
Manipulating Alignments Some of the manipulations possible with SimpleAlign include: slice(): Obtaining an alignment ``slice'', that is, a subalignment inclusive of specified start and end columns. column_from_residue_number(): Finding column in an alignment where a specified residue of a specified sequence is located. consensus_string(): Making a consensus string. This method includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus percentage_identity(): A fast method for calculating the average percentage identity of the alignment consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.
The Code use Bio::SimpleAlign; $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); $threshold_percent = 60; $consensus_with_threshold = $aln- >consensus_string($threshold_percent); $iupac_consensus = $aln->consensus_iupac(); # dna/rna alignments only $percent_ident = $aln->percentage_identity; $seqname = 'AKH_HAEIN'; $pos = $aln- >column_from_residue_number($seqname, 14);
Searching for Sequence Similarity BLAST with BioPerl Parsing Blast and FASTA Reports Search and SearchIO BPLite, BPpsilite, BPbl2seq Parsing HMM Reports Standalone BioPerl BLAST
Remote Execution of BLAST BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm Runs these jobs at NCBI automatically NCBI has dynamic configurations (server side) to “always” be up and ready Automatically updated for new BioPerl Releases Convenient for independent researchers who do not have access to huge computing resources Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation) Legal Restrictions!!!
Example of Remote Blast $remote_blast = Bio::Tools::Run::RemoteBlast->new( '- prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' ); $r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while = $remote_blast->each_rid ) { foreach $rid ) { $rc = $remote_blast->retrieve_blast($rid); }
Sample Script to Read and Parse BLAST Report # Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report); $result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;
Running BLAST Locally StandAloneBlast Bio::Tools::Run::StandAloneBlast Factory = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast- Advantages: Private Use Customized Local Resources Avoid Network Problems
Examples # Setting parameters similar to RemoteBlast $input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG"); $blast_report = $factory->blastall($input); # Blast Report Object that directly accesses parser while (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){ print $hsp->score. " ". $hsp->subject- >seqname. "\n"; } }
Format Conversion – Sequences Example Use: Bio::SeqIO Core Code: $in = Bio::SeqIO->new('-file' => "COG0001", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">COG0001.gen", '-format' => 'genbank'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }
Format Conversion – Alignments Alignment formats supported: INPUT: fasta, selex (HMMER), bl2seq, clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved). Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used
ClustalW and Profile Align ClustalW using BioPerl Clustalw program should be installed and environment variable ‘CLUSTALDIR’ set Setting Parameters – Build a factory Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘ Need reference to sequence array object (See example) Align( ) and Profile_align( ) methods used
ClustalW – Example Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw Core code (Simple = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1); $factory = Bio::Tools::Run::Alignment:: $seq_array_ref = $aln= $factory->align($seq_array_ref);
Smith Waterman Search Smith Waterman pairwise alignment Standard method for producing an optimal local alignment of two sequences Auxilliary Bioperl-ext library required SW algorithm implemented in C and incorporated into bioperl Align_and_show() & Pairwise_alignment() in Bio::Tools::pSW module are methods used
Smith Waterman Search – Example Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO Core code: $factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2); $aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]); my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT); $alnout->write_aln($aln);
Smith Waterman Search AlignIO object in previous slide – could also be used to print into a file Use double loop to do all pairwise comparisons More Info: Bio::Tools::pSW mapageBio::Tools::pSW mapage