Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot,

Similar presentations


Presentation on theme: "Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot,"— Presentation transcript:

1 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center © 2008 The Board of Trustees of The Leland Stanford Junior University

2 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 2 Prep Log into WebEx session (stanford.webex.com/Meetings) Please download all class materials for 2 nd class from FAQ at http://lane.stanford.edu/howto/index.html?id=_3098 http://lane.stanford.edu/howto/index.html?id=_3098

3 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 3 Class Focus for Session #2 1. Altering file contents from text files 2. Altering file contents from Excel files And remember: Ask LOTS OF QUESTIONS

4 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 4 Reminder: Cautions All examples pertain to MS Office 2003  Unclear what is to be expected for MS Office 2007 All contents pertain to Perl 5.x, not 6.x  V.5 and 6 are NOT compatible  V.5 is far more common, so not much of an issue

5 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 5 Questions from last session? → stomp the teacher!

6 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 6 Preliminaries: A Biological Useful Perl Program … to produce the data to be used in this class Let’s run Excel3.pl (briefly described last week)

7 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 7 Excel3.pl: A “Real” Program What it does: 1. Reads input from an Excel worksheet containing public identifiers for DNA sequences associated with genes 2. Uses Entrez Utilities provided by NCBI to retrieve: UniGene cluster ID UniGene Gene symbol NCBI Gene ID 3. Writes the result into another Excel worksheet Features a mix of procedural and object programmingobject programming Relevant links:  http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene&orig_db=uni gene http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene&orig_db=uni gene  Entrez Utilities Entrez Utilities

8 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 8 What Excel3.pl does:

9 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 9 Part 1: Altering file contents

10 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 10 Converting Data Stored in Flatfiles Input: ConvertOuput.csv  = renamed file generated by Excel3.pl Let’s look and run Convert1.pl →Convert5.pl

11 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 11 Convert1.pl Structure of program Run program Exercise: what is chomp?chomp Understanding file handlesfile handles What is $_ ?$_ Create an error: uncomment line 22 and run Introducing the escape character: “\”

12 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 12 Convert2.pl: Like Convert1.pl, but Prints Only First Item Using arrays to process contents of a line  Introducing splitsplit Changing directories  Useful to segregate data files  Need to change the path to make this work in your environment Note difference between Mac and Windows syntax for path names

13 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 13 Convert3.pl: Like Convert2.pl, but Prints Changed Order of Columns Run program Q: how would you avoid printing the title line in the input file?

14 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 14 Convert4.pl: Like Convert3.pl, but Removes “.” in Cluster IDs Run program  Introducing the match and substitute operator:match and substitute Matching: ‘/something/’ Substituting: ‘s/something1/something2/’ Used in regular expressions for text matching (more later)  Introducing the tab operator: “\t”

15 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 15 Convert5.pl: Like Convert3.pl, but with Smarts + Prints More Elements Run program Introducing “regular expressions”regular expressions  Q: how would you modify this code to print only when a “Gene: Gene Symbol” was found → tip: use matching operator: If (not($var =~ /something/)) { do something } → Try doing it: 10 min

16 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 16 More on Regular Expressions Very powerful  i.e., flexible, fast Complicated topic  Can require lots of trial and error to get it right  Quick reference card essential  Best comprehensive resource Covers more than Perl Friedl, 2006

17 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 17 BREAK

18 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 18 Part 2: Practical examples of programs that alter file contents using regular expressions

19 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 19 Regular Expressions: More Examples The example we’ll use: Extracting clone IDs for CDH5 by… 1. Importing SOURCE results directly into ExcelSOURCE 2. Parsing the.csv version of that file (CDH5Clones.csv)

20 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 20 Processing EST IDs from SOURCE Input: CDH5Clones.csv or CDH5Clones.xls

21 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 21 Clone1.pl: Filtering of Results What it does:  Reads.csv file of SOURCE results  Finds all clones from PLACE library  Returns list in single column form Run the program Why the error?

22 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 22 Clone2.pl: Numerical Filtering of Results Problem: Suppose you only want clones with IDs >= 7002000 because you already have clones with ID<7002000? Solution: Check numerical value of clone ID and decide whether to retain it or not. → Run program!

23 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 23 Part 3: Back to “Object Programming”

24 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 24 Three concepts: 1. Objects 2. Methods 3. Classes Understanding Enough Object Programming to be Dangerous Tisdall, 2003

25 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 25 “The key idea of OO programming is that all data is stored and modified with special data structures called objects, and each kind of object can be accessed only by its defined subroutines called methods. The user of an OO class is typically spared the effort of directly manipulating data, and can use class methods for this instead”, Tisdall, 2003.

26 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 26 Understanding Objects Object = Collection of data that logically belongs together.  E.g., a “genome” object has parts (“attributes”) such as… Name of the species Genomic sequence List of genes, associated with their list of exons Start and end points for each exon A type of object (e.g., genome object) is called a class  All objects derive from a class

27 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 27 Understanding Methods A Method is just like a subroutine, but these subroutines are associated specifically with a class Each type of object has one or more methods that it can call, and only those methods →The only way to access the data in an object is via the methods defined for that class. E.g., a genome object might have …  A compare method, for whole-genome comparisons  A list-gene-families method, for listing all gene families known to exist in a genome  A GC-percent function, for calculating %GC in specific areas of the genome, or all of it.

28 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 28 Understanding Classes Class = object definition + collection of methods for them defines a class. A specific object (e.g. a genome object for H. sapiens) is called an instance of a class.

29 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 29 ExcelClone2.pl: Doing the Same Thing as Clone2.pl, But Using Data From an Excel File and with OO Use Spreadsheet::BasicRead moduleBasicRead Program structure:  A loop within a loop  Iterates over every worksheet cell that contains data  Prints the content of cells only if it meets our conditions

30 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 30 How ExcelClone2.pl Uses Object Functionality Creates an object of type Spreadsheet Access getNextRow function associated with this object Access cellValue function associated with this object

31 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 31 Q: So Why Object Programming? A: Because it encapsulates functionality → fastest way to develop with minimal coding You just need to know: 1. That the functionality exists 2. How to call it

32 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 32 BioPerl: An Example of OO Perl Code Valuable for Biological Research

33 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 33 BioPerl: Overview BioPerl = >1,000 modules divided into 7 packages  Not all packages in v1.4… → but v1.4 = latest stable release

34 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 34 BioPerl: You Have A Friend In High Places The big deal: BioPerl provides “objects” for various types of sequence data and their associated features and annotations.  These objects provide interfaces for analysis of these sequences with a wide variety of external programs (BLAST, FASTA, clustalw and EMBOSS to name just a few). various types of databases for storage and retrieval of sequences  remote (GenBank, EMBL etc)  local (MySQL, Flat_databases flat files, GFF etc.).

35 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 35 Other, Non-BioPerl Modules

36 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 36 Key BioPerl Links BioPerl 1.4 installed as part of Perl 5.8.8.822 (what you downloaded) BioPerl home: http://www.bioperl.org/wiki/Main_Page http://www.bioperl.org/wiki/Main_Page http://www.bioperl.org/wiki/Getting_Started  Lots of examples

37 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 37 In Closing: Suggestions Modify the programs provided here  Baby steps… Save often Keep lots of prior versions so you can recover from your mistakes SU provides lots of documentation → use it! Get a quick reference card if you value your neurons Google is invaluable

38 Lane Medical Library & Knowledge Management Center http://lane.stanford.edu 38 Class Survey http://www.surveymk.com/s.aspx?sm=qw_2f5lc qrZdySrbHk2BnYeg_3d_3d


Download ppt "Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot,"

Similar presentations


Ads by Google