INTRODUCTION TO BIOPERL
BioPerl is … A Set of Perl modules for manipulating genomic and other biological data An Open Source Toolkit with many contributors A flexible and extensible system for doing bioinformatics data manipulation
Some things we can do Read in sequence data from a file in standard formats (FASTA, GenBank, EMBL, SwissProt,...) Convert sequence file format (Sequence & Alignment) Manipulate sequences, reverse complement, translate coding DNA sequence to protein. Parse a BLAST like report, get access to every bit of data in the report
Sequence file formats Simple formats - without features Fasta Rich formats - with features and annotations EMBL, GenBank, GFF3 SwissProt, GenPept TIGRXML, BSML, InterPro (XML)
Building a sequence #!/usr/bin/perl -w use strict; use Bio::Seq; my $seq = new Bio::Seq( -seq => 'ATGGGACCAAGTA', -display_id => 'example1‘ ); print “Sequence name ", $seq->display_id, "\n"; print “Sequence length is ", $seq->length, "\n"; print “Sub-sequence is ", $seq->subseq(1,3), "\n"; % perl Sequence name is example1 Sequence length is 13 Sub-sequence is ATG
Bio::PrimarySeq : Primary Information MethodDescription $seq->seqGet/Set the sequence string $seq->display_idGet/Set the Sequence identifier string $seq->descGet/Set the description string $seq->lengthReturn the length of the sequence $seq->subseq(start,end)Get a sub-sequence as a tring $seq->trunc(start,end)Get a sub-sequence as an object $seq->revcomGet the reverse complement (dna only) $seq->translateGet the protein translation (dna only)
Rich formats Taxonomic informations Bibliographic references Features (with location) + Annotations Sequence data Primary informations
Features & Annotations GFF format derived
GFF format « Generic Feature Format » Tab delimited format 9 columns: sequence_id, source, primary_tag, start, stop, score, strand, frame, description Different versions of GFF (GFF1, GFF2 & GFF3) Variation is in how the description column is formatted For GFF3, ‘primary_tag’ column values must be in the sequence ontology
Features & Annotations GFF format derived Have a location on a sequence start(), end() & strand() for location information score(), frame(), primary_tag(), source_tag() for feature information tag(): hash reference of tag/value Bio::SeqFeature::Generic More details
Convert format : Bio::SeqIO Read /Write sequence Initialize file: filename for input; prepend ‘>’ for writing format: for reading or writing Some supported format Format fastaFASTA genbankGenBank DB emblEMBL DB swissSwissProt DB
Read in sequence and write out in different format use Bio::SeqIO; my $in = new Bio::SeqIO( -format => 'genbank', -file => '‘ ); my $out = new Bio::SeqIO( -format => 'fasta', -file =>'>out.fa‘ ); while ( my $seq = $in->next_seq ) { $out->write_seq($seq); }
Read GFF #!/usr/bin/perl use Bio::Tools::GFF; my $file = shift; my $tag = shift; my $in = new Bio::Tools::GFF( -gff_version => 3, -file => $file ); while(my $feature = $in->next_feature) { if ($feature->primary_tag() eq $tag) { my ($id) = $feature->get_tag_values("ID"); print join("\t",$id,$feature->seq_id,$feature->start,$feature->end,$feature->strand),"\n"; } $in->close;
Bio::SearchIO Parsing analysis report Can be split into 3 components Result : One per query Hit : Sequence which matches query (Component of Result) HSP : High Scoring Segment Pairs (Component of Hit) Implemented for BLAST, BLAT, FASTA, HMMER, Exonerate…
Bio::SearchIO Can be split into 3 components: Result: One per query Hit: Sequence whiches match query Component of a Result Result HSP: High Scoring Segment Pairs Component of a Hit Hit 1 Hit 2 HSP 1 HSP 2 HSP 1
Bio::SearchIO use strict; use Bio::SearchIO; my $in = new Bio::SearchIO( -format => 'blast', -file => 'report.bls‘ ); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { if( $hsp->length('total') > 50 ) { if ( $hsp->percent_identity >= 75 ) { print "Query=", $result->query_name, " Hit=", $hit->name, " Length=", $hsp->length('total'), " Percent_id=", $hsp->percent_identity, "\n"; }
HOWTO Parsing with Bio::SearchIO Table of methods
Things I'm skipping (here) Bio::Tools::SeqStats - base-pair freq, dicodon freq, etc Bio::Tools::SeqWords - count n-mer words in a sequence Bio::SeqUtils – mixed helper functions Bio::Restriction - find restriction enzyme sites and cut sequence Bio::Graphics – represent information graphically
