Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center
Lane Medical Library & Knowledge Management Center 2 To Dos Close all programs other than IE on your laptop Log into virtual room YP: log into Safari
Lane Medical Library & Knowledge Management Center 3 To Do - 2 Please download all class materials from
Lane Medical Library & Knowledge Management Center 4 Class Focus for Session #2 1. Converting file contents 2. Introducing BioPerl 3. Perl and relational databases And remember: Ask LOTS OF QUESTIONS
Lane Medical Library & Knowledge Management Center 5 Cautions - Reminder All examples pertain to MS Office 2003 Unclear what is to be expected for MS Office 2007 All contents pertain to Perl 5.x, not 6.x V.5 and 6 are NOT compatible V.5 is far far more common, so not much of an issue
Lane Medical Library & Knowledge Management Center 6 Questions from last session?
Lane Medical Library & Knowledge Management Center 7 Part 1: Converting file contents
Lane Medical Library & Knowledge Management Center 8 Converting Data Stored in Flatfiles Input: ExampleOutputExcel3.csv File generated last week by Excel3.pl Let’s look and run Convert1.pl →Convert5.pl
Lane Medical Library & Knowledge Management Center 9 Part 2: BioPerl
Lane Medical Library & Knowledge Management Center 10 BioPerl: Overview BioPerl = >1,000 modules divided into 7 packages Not all in 1.4 1.4 = stable release
Lane Medical Library & Knowledge Management Center 11 Other, Non-BioPerl Modules
Lane Medical Library & Knowledge Management Center 12 BioPerl: You Have A Friend In High Places The big deal: BioPerl provides “objects” for various types of sequence data and their associated features and annotations. These objects provide interfaces for analysis of these sequences with a wide variety of external programs (BLAST, FASTA, clustalw and EMBOSS to name just a few). various types of databases for storage and retrieval of sequences remote (GenBank, EMBL etc) local (MySQL, Flat_databases flat files, GFF etc.).
Lane Medical Library & Knowledge Management Center 13 So What Is This Object Business?
Lane Medical Library & Knowledge Management Center 14 What A Biology-Related Program Looks Like When Coded According To The Object Paradigm t: Protein t: DNA t: RNA t: Gene t: Organism t: Species t: LivingObject t: Sequence
Lane Medical Library & Knowledge Management Center 15 Objects Inherit From A Class Or Prior Object Object 1 (ancestor) Class = prototype for all objects of this type Derive an object from an existing object Create an object (“new”) Object2 SequenceRNAProtein DNA
Lane Medical Library & Knowledge Management Center 16 An example: Class inheritance for shape concepts
Lane Medical Library & Knowledge Management Center 17 Key BioPerl Links BioPerl 1.4 installed as part of Perl (what you downloaded) BioPerl home: Lots of examples
Lane Medical Library & Knowledge Management Center 18 BioPerl Example: Querying GenBank To Retrieve Sequence Properties Seq7.pl Seq8.pl Seq9.pl → after exercise (next slide) Seq11.pl → after exercise (next slide) Related docs: GenBank search: current/bioperl-live/Bio/DB/GenBank.htmlhttp://doc.bioperl.org/releases/bioperl- current/bioperl-live/Bio/DB/GenBank.html SeqIO: current/bioperl-live/Bio/SeqIO/genbank.htmlSeqIOhttp://doc.bioperl.org/releases/bioperl- current/bioperl-live/Bio/SeqIO/genbank.htmlSeqIO See also And most importantly: current/bioperl-live/Bio/Seq.html
Lane Medical Library & Knowledge Management Center 19 Exercise: Print An Additional Sequence Feature Add an additional sequence feature to Seq8.pl What to print: see Methods for Seq object at current/bioperl-live/Bio/Seq.html current/bioperl-live/Bio/Seq.html
Lane Medical Library & Knowledge Management Center 20 Quiz Questions based on Seq11.pl use warnings; use strict; use Bio::DB::GenBank; # # main $| = 1; # Force unbuffered STDOUT and STDIN. my $gb = Bio::DB::GenBank->new( -format => 'GenBank', -seq_start => 0, -seq_end => 1000, -strand => 1, -complexity => 0); # put in some restrictions as to what is retrieved and stored into GenBank object... # get a stream via a query string my $query = Bio::DB::Query::GenBank->new (-query =>'Homo sapiens[Organism] AND M-cadherin', -db => 'nucleotide'); my $seqio = $gb->get_Stream_by_query($query); my $i=0; # count total number of sequences while (my $seq = $seqio->next_seq) { print "seq id =", $seq->id, "\t version = ", $seq->version, "\t seq acc number = ", $seq->accession_number, "\t seq length = ", $seq->length,"\n"; $i++; } print "retrieved $i sequences from GenBank \n"; #
Lane Medical Library & Knowledge Management Center 21 More Quizzing: Seq10.pl Run Seq10.pl Why the warning messages? Specifying strands 1 for plus 2 for minus Complexity: A GenBank nucleotide entry is often a part of a larger biological blob that contains other GI numbers (e.g., translated protein) Complexity regulates the display: 0 - get the whole blob 1 - get the bioseq for gi of interest (default in Entrez) 2 - get the minimal bioseq-set containing the gi of interest 3 - get the minimal nuc-prot containing the gi of interest 4 - get the minimal pub-set containing the gi of interest
Lane Medical Library & Knowledge Management Center 22 Some Cautions Be careful when querying databases → have an idea of how many sequences you may be downloading/processing Know that Perl might eat-up all of your CPU cycles
Lane Medical Library & Knowledge Management Center 23 Part 3: Interacting With A Database
Lane Medical Library & Knowledge Management Center 24 Preliminaries: Updating ODBC Manager First we need to add directions to “GenesToEvaluate” DB to ODBC Manager More at
Lane Medical Library & Knowledge Management Center 25 Example Perl Programs That Interact With A Database Ancillary files: ExampleOutputExcel3.csv needed as input to Access1.pl Access2.pl and Access3.pl don’t need this file All programs rely on GenesToEvaluate.mdb (Access DB)
Lane Medical Library & Knowledge Management Center 26 In Closing: Suggestions Modify the programs provided here Baby steps… Save often Keep lots of prior versions so you can recover from your mistakes SU provides lots of documentation → use it! Google is invaluable