Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center © 2008 The Board of Trustees of The Leland Stanford Junior University
Lane Medical Library & Knowledge Management Center 2 Prep Log into WebEx session (stanford.webex.com/Meetings) Please download all class materials for 2 nd class from FAQ at
Lane Medical Library & Knowledge Management Center 3 Class Focus for Session #2 1. Altering file contents from text files 2. Altering file contents from Excel files And remember: Ask LOTS OF QUESTIONS
Lane Medical Library & Knowledge Management Center 4 Reminder: Cautions All examples pertain to MS Office 2003 Unclear what is to be expected for MS Office 2007 All contents pertain to Perl 5.x, not 6.x V.5 and 6 are NOT compatible V.5 is far more common, so not much of an issue
Lane Medical Library & Knowledge Management Center 5 Questions from last session? → stomp the teacher!
Lane Medical Library & Knowledge Management Center 6 Preliminaries: A Biological Useful Perl Program … to produce the data to be used in this class Let’s run Excel3.pl (briefly described last week)
Lane Medical Library & Knowledge Management Center 7 Excel3.pl: A “Real” Program What it does: 1. Reads input from an Excel worksheet containing public identifiers for DNA sequences associated with genes 2. Uses Entrez Utilities provided by NCBI to retrieve: UniGene cluster ID UniGene Gene symbol NCBI Gene ID 3. Writes the result into another Excel worksheet Features a mix of procedural and object programmingobject programming Relevant links: gene gene Entrez Utilities Entrez Utilities
Lane Medical Library & Knowledge Management Center 8 What Excel3.pl does:
Lane Medical Library & Knowledge Management Center 9 Part 1: Altering file contents
Lane Medical Library & Knowledge Management Center 10 Converting Data Stored in Flatfiles Input: ConvertOuput.csv = renamed file generated by Excel3.pl Let’s look and run Convert1.pl →Convert5.pl
Lane Medical Library & Knowledge Management Center 11 Convert1.pl Structure of program Run program Exercise: what is chomp?chomp Understanding file handlesfile handles What is $_ ?$_ Create an error: uncomment line 22 and run Introducing the escape character: “\”
Lane Medical Library & Knowledge Management Center 12 Convert2.pl: Like Convert1.pl, but Prints Only First Item Using arrays to process contents of a line Introducing splitsplit Changing directories Useful to segregate data files Need to change the path to make this work in your environment Note difference between Mac and Windows syntax for path names
Lane Medical Library & Knowledge Management Center 13 Convert3.pl: Like Convert2.pl, but Prints Changed Order of Columns Run program Q: how would you avoid printing the title line in the input file?
Lane Medical Library & Knowledge Management Center 14 Convert4.pl: Like Convert3.pl, but Removes “.” in Cluster IDs Run program Introducing the match and substitute operator:match and substitute Matching: ‘/something/’ Substituting: ‘s/something1/something2/’ Used in regular expressions for text matching (more later) Introducing the tab operator: “\t”
Lane Medical Library & Knowledge Management Center 15 Convert5.pl: Like Convert3.pl, but with Smarts + Prints More Elements Run program Introducing “regular expressions”regular expressions Q: how would you modify this code to print only when a “Gene: Gene Symbol” was found → tip: use matching operator: If (not($var =~ /something/)) { do something } → Try doing it: 10 min
Lane Medical Library & Knowledge Management Center 16 More on Regular Expressions Very powerful i.e., flexible, fast Complicated topic Can require lots of trial and error to get it right Quick reference card essential Best comprehensive resource Covers more than Perl Friedl, 2006
Lane Medical Library & Knowledge Management Center 17 BREAK
Lane Medical Library & Knowledge Management Center 18 Part 2: Practical examples of programs that alter file contents using regular expressions
Lane Medical Library & Knowledge Management Center 19 Regular Expressions: More Examples The example we’ll use: Extracting clone IDs for CDH5 by… 1. Importing SOURCE results directly into ExcelSOURCE 2. Parsing the.csv version of that file (CDH5Clones.csv)
Lane Medical Library & Knowledge Management Center 20 Processing EST IDs from SOURCE Input: CDH5Clones.csv or CDH5Clones.xls
Lane Medical Library & Knowledge Management Center 21 Clone1.pl: Filtering of Results What it does: Reads.csv file of SOURCE results Finds all clones from PLACE library Returns list in single column form Run the program Why the error?
Lane Medical Library & Knowledge Management Center 22 Clone2.pl: Numerical Filtering of Results Problem: Suppose you only want clones with IDs >= because you already have clones with ID< ? Solution: Check numerical value of clone ID and decide whether to retain it or not. → Run program!
Lane Medical Library & Knowledge Management Center 23 Part 3: Back to “Object Programming”
Lane Medical Library & Knowledge Management Center 24 Three concepts: 1. Objects 2. Methods 3. Classes Understanding Enough Object Programming to be Dangerous Tisdall, 2003
Lane Medical Library & Knowledge Management Center 25 “The key idea of OO programming is that all data is stored and modified with special data structures called objects, and each kind of object can be accessed only by its defined subroutines called methods. The user of an OO class is typically spared the effort of directly manipulating data, and can use class methods for this instead”, Tisdall, 2003.
Lane Medical Library & Knowledge Management Center 26 Understanding Objects Object = Collection of data that logically belongs together. E.g., a “genome” object has parts (“attributes”) such as… Name of the species Genomic sequence List of genes, associated with their list of exons Start and end points for each exon A type of object (e.g., genome object) is called a class All objects derive from a class
Lane Medical Library & Knowledge Management Center 27 Understanding Methods A Method is just like a subroutine, but these subroutines are associated specifically with a class Each type of object has one or more methods that it can call, and only those methods →The only way to access the data in an object is via the methods defined for that class. E.g., a genome object might have … A compare method, for whole-genome comparisons A list-gene-families method, for listing all gene families known to exist in a genome A GC-percent function, for calculating %GC in specific areas of the genome, or all of it.
Lane Medical Library & Knowledge Management Center 28 Understanding Classes Class = object definition + collection of methods for them defines a class. A specific object (e.g. a genome object for H. sapiens) is called an instance of a class.
Lane Medical Library & Knowledge Management Center 29 ExcelClone2.pl: Doing the Same Thing as Clone2.pl, But Using Data From an Excel File and with OO Use Spreadsheet::BasicRead moduleBasicRead Program structure: A loop within a loop Iterates over every worksheet cell that contains data Prints the content of cells only if it meets our conditions
Lane Medical Library & Knowledge Management Center 30 How ExcelClone2.pl Uses Object Functionality Creates an object of type Spreadsheet Access getNextRow function associated with this object Access cellValue function associated with this object
Lane Medical Library & Knowledge Management Center 31 Q: So Why Object Programming? A: Because it encapsulates functionality → fastest way to develop with minimal coding You just need to know: 1. That the functionality exists 2. How to call it
Lane Medical Library & Knowledge Management Center 32 BioPerl: An Example of OO Perl Code Valuable for Biological Research
Lane Medical Library & Knowledge Management Center 33 BioPerl: Overview BioPerl = >1,000 modules divided into 7 packages Not all packages in v1.4… → but v1.4 = latest stable release
Lane Medical Library & Knowledge Management Center 34 BioPerl: You Have A Friend In High Places The big deal: BioPerl provides “objects” for various types of sequence data and their associated features and annotations. These objects provide interfaces for analysis of these sequences with a wide variety of external programs (BLAST, FASTA, clustalw and EMBOSS to name just a few). various types of databases for storage and retrieval of sequences remote (GenBank, EMBL etc) local (MySQL, Flat_databases flat files, GFF etc.).
Lane Medical Library & Knowledge Management Center 35 Other, Non-BioPerl Modules
Lane Medical Library & Knowledge Management Center 36 Key BioPerl Links BioPerl 1.4 installed as part of Perl (what you downloaded) BioPerl home: Lots of examples
Lane Medical Library & Knowledge Management Center 37 In Closing: Suggestions Modify the programs provided here Baby steps… Save often Keep lots of prior versions so you can recover from your mistakes SU provides lots of documentation → use it! Get a quick reference card if you value your neurons Google is invaluable
Lane Medical Library & Knowledge Management Center 38 Class Survey qrZdySrbHk2BnYeg_3d_3d