Lane Medical Library & Knowledge Management Center Perl Programming for Biologists SESSION 2: Tue Feb 10 th 2009 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center © 2008 The Board of Trustees of The Leland Stanford Junior University
Lane Medical Library & Knowledge Management Center 2 Prep Log into WebEx session (stanford.webex.com/Meetings) Please download all class materials for 2 nd class from FAQ at in a directory Open a command window and cd to that directory Start Open Perl IDE or Mac equivalent
Lane Medical Library & Knowledge Management Center 3 Reminder: Cautions All examples pertain to MS Office 2003 From MS Office 2007, save in 2003 format to use Perl code described here. All contents pertain to Perl 5.x, not 6.x
Lane Medical Library & Knowledge Management Center 4 Session #2 Focus 1. Understanding key Perl language elements Scrutinizing several variant programs 2. Altering file contents from text files And remember: Ask QUESTIONS
Lane Medical Library & Knowledge Management Center 5 Recap from Session 1
Lane Medical Library & Knowledge Management Center 6 Recap Questions from last session? → Stomp the teacher!
Lane Medical Library & Knowledge Management Center 7 Reviewing Simple1.pl Understanding what each element does #!C:\Perl\bin # # Simple1 # use strict; use warnings; # sub Multiply { my $f1 = shift; my $f2 = shift; return ($f1 * $f2); } # # main print "Let's test Perl \n"; my $TempVar = 0; print "The two numbers are: $InputNumbers[0] and $InputNumbers[1] \n"; my $Result = Multiply($InputNumbers[0],$InputNumbers[1]); print "Here's the value of both numbers multiplied: $Result \n"; print "I'm done! \n";
Lane Medical Library & Knowledge Management Center 8 Simple2.pl: Introducing New Language Elements → let’s look at it using Open Perl IDE and XXX
Lane Medical Library & Knowledge Management Center 9 A Final Example: Biologically Useful Perl Program What it does: 1. Reads input from an Excel worksheet containing public identifiers for DNA sequences associated with genes 2. Uses Entrez Utilities provided by NCBI to retrieve: UniGene cluster ID UniGene Gene symbol NCBI Gene ID 3. Writes the result into another Excel worksheet Features a mix of procedural and object programmingobject programming Relevant links: Entrez Utilities Entrez Utilities
Lane Medical Library & Knowledge Management Center 10 What Excel3.pl does:
Lane Medical Library & Knowledge Management Center 11 Let’s Run Excel3.pl Type “perl -f Excel3.pl” in the directory where you installed the demonstration programs
Lane Medical Library & Knowledge Management Center 12 Polling Time: How’s the speed? 1: Too fast 2. Too slow 3. More or less OK 4. I feel nauseous
Lane Medical Library & Knowledge Management Center 13 Moving On: Altering file contents
Lane Medical Library & Knowledge Management Center 14 Converting Data Stored in Flatfiles Input: ConvertOuput.csv = renamed file generated by Excel3.pl, converted to csv format Let’s look and run Convert1.pl →Convert5.pl
Lane Medical Library & Knowledge Management Center 15 Convert1.pl Structure of program Run program Exercise: what is chomp?chomp Understanding file handlesfile handles What is $_ ?$_ Create an error: uncomment line 22 and run Introducing the escape character: “\”
Lane Medical Library & Knowledge Management Center 16 Convert2.pl: Like Convert1.pl, but Prints Only First Item Using arrays to process contents of a line Introducing splitsplit Changing directories Useful to segregate data files Need to change the path to make this work in your environment Note difference between Mac and Windows syntax for path names
Lane Medical Library & Knowledge Management Center 17 Convert3.pl: Like Convert2.pl, but Prints Changed Order of Columns Run program Q: how would you avoid printing the title line in the input file?
Lane Medical Library & Knowledge Management Center 18 Convert4.pl: Like Convert3.pl, but Removes “.” in Cluster IDs Run program Introducing the match and substitute operator:match and substitute Matching: ‘/something/’ Substituting: ‘s/something1/something2/’ Used in regular expressions for text matching (more later) Introducing the tab operator: “\t”
Lane Medical Library & Knowledge Management Center 19 Convert5.pl: Like Convert3.pl, but with Smarts + Prints More Elements Run program Introducing “regular expressions”regular expressions Q: how would you modify this code to print only when a “Gene: Gene Symbol” was found → tip: use matching operator: If (not($var =~ /something/)) { do something } → Try doing it: 10 min
Lane Medical Library & Knowledge Management Center 20 More on Regular Expressions Very powerful i.e., flexible, fast Complicated topic Can require lots of trial and error to get it right Quick reference card essential Best comprehensive resource Covers more than Perl Friedl, 2006
Lane Medical Library & Knowledge Management Center 21 Polling Time: How’s the speed? 1: Too fast 2. Too slow 3. More or less OK 4. I feel nauseous
Lane Medical Library & Knowledge Management Center 22 Part 2: Practical examples of programs that alter file contents using regular expressions
Lane Medical Library & Knowledge Management Center 23 Regular Expressions: More Examples The example we’ll use: Extracting clone IDs for CDH5 by… 1. Importing SOURCE results directly into ExcelSOURCE 2. Parsing the.csv version of that file (CDH5Clones.csv)
Lane Medical Library & Knowledge Management Center 24 Processing EST IDs from SOURCE Input: CDH5Clones.csv or CDH5Clones.xls
Lane Medical Library & Knowledge Management Center 25 Clone1.pl: Filtering of Results What it does: Reads.csv file of SOURCE results Finds all clones from PLACE library Returns list in single column form Run the program Why the error?
Lane Medical Library & Knowledge Management Center 26 Clone2.pl: Numerical Filtering of Results Problem: Suppose you only want clones with IDs >= because you already have clones with ID< ? Solution: Check numerical value of clone ID and decide whether to retain it or not. → Run program!