BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Slides:



Advertisements
Similar presentations
INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
BLAST Sequence alignment, E-value & Extreme value distribution.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Introduction to the GCG Wisconsin Package The Center for Bioinformatics UNC at Chapel Hill Jianping (JP) Jin Ph.D. Bioinformatics Scientist Phone: (919)
The BioPerl project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research.
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
11ex.1 Modules and BioPerl. 11ex.2 sub reverseComplement { my ($seq) $seq =~ tr/ACGT/TGCA/; $seq = reverse $seq; return $seq; } my $revSeq = reverseComplement("GCAGTG");
Bioinformatics and Phylogenetic Analysis
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Course Day 4 Perl Extensions: BioPerl and Ensembl API.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
12ex.1. 12ex.2 The BioPerl project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science.
Bioperl modules.
Home Work I. Running Blast with BioPerl Input: 1) Sequence or Acc.Num. 2) Threshold (E value cutoff) Output: 1) Blast results – sequence names, alignment.
Sequence alignment, E-value & Extreme value distribution
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Sequence Alignment Topics: Introduction Exact Algorithm Alignment Models BioPerl functions.
BioPerl. cpan Open a terminal and type /bin/su - start "cpan", accept all defaults install Bio::Graphics.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Public Resources (II) – Analysis tools  Web-based analysis tools – easy to use, but often with less customization options.  Stand-alone analysis tools.
BioPerl - documentation Bioperl tutorial tutorial Mastering Perl for Bioinformatics: Introduction.
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BioPython Workshop Gershon Celniker Tel Aviv University.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Public Resources for Bioinformatics Databases : how to find relevant information. Analysis Tools.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
13.1 בשבועות הקרובים יתקיים סקר ההוראה (באתר מידע אישי לתלמיד)באתר מידע אישי לתלמיד סקר הוראה.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Identifying the ortholog of TNF (Tumor necrosis factor) in mosquito genomes Pet Projects:
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
NCBI Genome Workbench Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 15, 2004 Slides from Michael Dicuccio’s Genome Workbench.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
O Log in to amazon biolinux O For mac users O ssh O For Windows users O use putty O Hostname public_dns_address O username ubuntu.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Copyright OpenHelix. No use or reproduction without express written consent1.
Stand-alone tools 2. 1.Download the zip file to the GMS6014 folder. 2.Unzip the files to a folder named “clustalx”. 3.Edit the MDM2_isoforms_5.fasta file.
Introducing Bioperl Toward the Bioinformatics Perl programmer's nirvana.
Advanced Perl For Bioinformatics Part 1 2/23/06 1-4pm Module structure Module path Module export Object oriented programming Part 2 2/24/06 1-4pm Bioperl.
각종 생물정보 분석도구 의 실무적 활용 및 실습 김형용 개발팀 Insilicogen, Inc.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Biol Practical Biocomputing1 BioPerl General capabilities (packages) Sequences ○ fetching, reading, writing, reformatting, annotating, groups.
Biopython 1. What is Biopython? tools for computational molecular biology to program in python and want to make it as easy as possible to use python for.
Modules and BioPerl.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence Based Analysis Tutorial
Genes to Trees Daniel Ayres and Adam Bazinet
Explore Evolution: Instrument for Analysis
Basic Local Alignment Search Tool (BLAST)
Supporting High-Performance Data Processing on Flat-Files
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu

Introduction Objective of BioPerl: Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. Background: Started in 1995 One of the oldest open source Bioinformatics Toolkit Project

So what is BioPerl?  Higher level of abstraction  Re-usable collection of Perl modules that facilitate bioinformatics application development:  Accessing databases with different formats  Sequence manipulation  Execution and Parsing of the results of molecular biology programs  Catch? BioPerl does not include programs like Blast, ClustalW, etc  Uses system calls to execute external programs

So what is BioPerl? (continued…) 551 modules (incl. 82 interface modules) 37 module groups 79,582 lines of code (223,310 lines total) 144 lines of code per module For More info: BioPerl Module ListingBioPerl Module Listing

Major Areas covered in Bioperl Sequences, features, annotations, Pairwise alignment reports Multiple sequence alignments Bibliographic data Graphical rendering of sequence tracks Database for features and sequences

Additional things Gene prediction parsers Trees, parsing phylogenetic and molecular evolution software output Population genetic data and summary statistics Taxonomy Protein Structure

Downloading modules  Modules can be obtained from:  (Perl Modules)  (BioPerl Modules)  Downloading modules from CPAN  Interactive mode  perl -MCPAN -e shell  Batch mode  use CPAN;  clean, install, make, recompile, test

Directory Structure BioPerl directory structure organization: Bio/ BioPerl modules models/ UML for BioPerl classes t/ Perl built-in tests t/data/ Data files used for the tests scripts/ Reusable scripts that use BioPerl scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. doc/ "How To" files and the FAQ as XML

Parsing Sequences Bio::SeqIO multiple drivers: genbank, embl, fasta,... Sequence objects Bio::PrimarySeq Bio::Seq Bio::Seq::RichSeq

Sequence Object Creation Sequence Creation : $sequence = Bio::Seq->new( -seq => ‘AATGCAA’ -display_id => ‘my_sequence’); Flat File Format Support : Raw, FASTA, GCG, GenBank, EMBL, PIR Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP

Sequence object Common (Bio::PrimarySeq) methods seq() - get the sequence as a string length() - get the sequence length subseq($s,$e) - get a subseqeunce translate(...) - translate to protein [DNA] revcom() - reverse complement [DNA] display_id() - identifier string description() - description string

Sequence Types Different Sequence Objects: Seq – Some annotations RichSeq – Additional annotations PrimarySeq – Bare minimum annotation ( id, accession number, alphabet) LocatableSeq – Start, stop and gap information also LargeSeq – Very long sequences LiveSeq – Newly sequenced genomes

Using a sequence use Bio::PrimarySeq; my $str = “ATGAATGATGAA”; my $seq = Bio::PrimarySeq->new(-seq => $str, -display_id=>”example”); print “id is “, $seq->display_id,”\n”; print $seq->seq, “\n”; my $revcom = $seq->revcom; print $revcom->seq, “\n”; print “frame1=”,$seq->translate->seq,“\n”; id is example ATGAATGATGAA TTCATCATTCAT trans frame1=MNDE

Accessing remote databases $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); $seq2 = $gb->get_Seq_by_acc('AF303112'); $seqio = $gb-> get_Stream_by_id(["J00522","AF303112"," "]);

Sequence – Accession numbers # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();

Reading and Writing Sequences Bio::SeqIO fasta, genbank, embl, swissprot,... Takes care of writing out associated features and annotations Two functions next_seq (reading sequences) write_seq (writing sequences)

Writing a Sequence use Bio::SeqIO; # Let’s convert swissprot to fasta format my $in = Bio::SeqIO->new(-format => ‘swiss’, -file => ‘file.sp’); my $out = Bio::SeqIO->new(-format => ‘fasta’, -file => ‘>file.fa’);` while( my $seq = $in->next_seq ) { $out->write_seq($seq); }

Manipulating sequence data with Seq methods Allows the easy manipulation of bioinformatics data Specific parts of various annotated formats can be selected and rearranged. Unwanted information can be voided out of reports Important information can be highlighted, processed, stored in arrays for graphs/charts/etc with relative ease Information can be added and subtracted in a flash

The Code #!/usr/local/bin/perl use Bio::Seq; use Bio::SeqIO; my $seqin = Bio::SeqIO->new('-file' => "genes.fasta", '-format' =>'Fasta'); my $seqobj = $seqin->next_seq(); my $seq = $seqobj->seq(),"\n"; #plain sequence print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n"; $seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines print $seq,"\n";

Before

After

Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord) Molecular Weight Monomer Counter Codon Counter DNA weights RNA weights Amino Weights More

The Code #!/usr/local/bin/perl use Bio::PrimarySeq; use Bio::Tools::SeqStats; my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1'); $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $hash_ref = $seq_stats->count_monomers(); foreach $base (sort keys %$hash_ref) { print "Number of bases of type ", $base, "= ",%$hash_ref- >{$base},"\n"; }

The Results

More Code use SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj); $weight = $seq_stats->get_mol_wt(); -returns the molecular weight $monomer_ref = $seq_stats->count_monomers(); -counts the number of monomers $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence -counts the number of codons

Monomer

Why the Large and The Small MW? Note that since sequences may contain ambiguous monomers (eg "M" meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wt returns a two-element array containing the greatest lower bound and least upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)

Identifying restriction enzyme sites (Restriction Enzyme) Bioperl's standard RestrictionEnzyme object comes with data for more than 150 different restriction enzymes. To select all available enzymes with cutting patterns that are six bases long: $re = new = $re->available_list(6); sites for that enzyme on a given nucleic acid sequence can be obtained using $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); # $seqobj is the Seq object for the dna to be = $re1- >cut_seq($seqobj);

Identifying restriction enzyme sites (Restriction Enzyme) (more) Adding an enzyme not in the default list is easily as this: $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV-- GAT^ATC', '-MAKE' =>'custom');

Manipulating sequence alignments Bioperl offers several perl objects to facilitate sequence alignment: pSW (Smith-Waterman) Clustalw.pm TCoffee.pm bl2seq option of StandAloneBlast.

Manipulating Alignments Some of the manipulations possible with SimpleAlign include: slice(): Obtaining an alignment ``slice'', that is, a subalignment inclusive of specified start and end columns. column_from_residue_number(): Finding column in an alignment where a specified residue of a specified sequence is located. consensus_string(): Making a consensus string. This method includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus percentage_identity(): A fast method for calculating the average percentage identity of the alignment consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.

The Code use Bio::SimpleAlign; $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); $threshold_percent = 60; $consensus_with_threshold = $aln- >consensus_string($threshold_percent); $iupac_consensus = $aln->consensus_iupac(); # dna/rna alignments only $percent_ident = $aln->percentage_identity; $seqname = 'AKH_HAEIN'; $pos = $aln- >column_from_residue_number($seqname, 14);

Searching for Sequence Similarity BLAST with BioPerl Parsing Blast and FASTA Reports Search and SearchIO BPLite, BPpsilite, BPbl2seq Parsing HMM Reports Standalone BioPerl BLAST

Remote Execution of BLAST BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm Runs these jobs at NCBI automatically NCBI has dynamic configurations (server side) to “always” be up and ready Automatically updated for new BioPerl Releases Convenient for independent researchers who do not have access to huge computing resources Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation) Legal Restrictions!!!

Example of Remote Blast $remote_blast = Bio::Tools::Run::RemoteBlast->new( '- prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' ); $r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while = $remote_blast->each_rid ) { foreach $rid ) { $rc = $remote_blast->retrieve_blast($rid); }

Sample Script to Read and Parse BLAST Report # Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report); $result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;

Running BLAST Locally StandAloneBlast Bio::Tools::Run::StandAloneBlast Factory = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast- Advantages: Private Use Customized Local Resources Avoid Network Problems

Examples # Setting parameters similar to RemoteBlast $input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG"); $blast_report = $factory->blastall($input); # Blast Report Object that directly accesses parser while (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){ print $hsp->score. " ". $hsp->subject- >seqname. "\n"; } }

Format Conversion – Sequences Example Use: Bio::SeqIO Core Code: $in = Bio::SeqIO->new('-file' => "COG0001", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">COG0001.gen", '-format' => 'genbank'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }

Format Conversion – Alignments Alignment formats supported: INPUT: fasta, selex (HMMER), bl2seq, clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved). Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used

ClustalW and Profile Align ClustalW using BioPerl Clustalw program should be installed and environment variable ‘CLUSTALDIR’ set Setting Parameters – Build a factory Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘ Need reference to sequence array object (See example) Align( ) and Profile_align( ) methods used

ClustalW – Example Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw Core code (Simple = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1); $factory = Bio::Tools::Run::Alignment:: $seq_array_ref = $aln= $factory->align($seq_array_ref);

Smith Waterman Search Smith Waterman pairwise alignment Standard method for producing an optimal local alignment of two sequences Auxilliary Bioperl-ext library required SW algorithm implemented in C and incorporated into bioperl Align_and_show() & Pairwise_alignment() in Bio::Tools::pSW module are methods used

Smith Waterman Search – Example Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO Core code: $factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2); $aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]); my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT); $alnout->write_aln($aln);

Smith Waterman Search AlignIO object in previous slide – could also be used to print into a file Use double loop to do all pairwise comparisons More Info: Bio::Tools::pSW mapageBio::Tools::pSW mapage