Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot,

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Tutorial 12: Enhancing Excel with Visual Basic for Applications
Wincite Introduces Knowledge Notebooks A new approach to collecting, organizing and distributing internal and external information sources and analysis.
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,
Lane Medical Library & Knowledge Management Center How to Write a Program Yannick Pouliot, PhD Bioresearch Informationist
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 3: Tue Feb 17 th 2009 Yannick Pouliot,
Lane Medical Library & Knowledge Management Center Ni mble Perl Programming Using Scriptome Yannick Pouliot, PhD Bioresearch Informationist.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists SESSION 2: Tue Feb 10 th 2009 Yannick Pouliot,
Lane Medical Library & Knowledge Management Center Essential UNIX Skills for Biologists Yannick Pouliot, PhD Bioresearch Informationist.
Modules, Hierarchy Charts, and Documentation
Chapter 1 Program Design
MCB 5472 Psi BLAST, Perl: Arrays, Loops J. Peter Gogarten Office: BPB 404 phone: ,
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Computers Are Your Future © 2008Prentice-Hall, Inc.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists A bold experiment into the unknown… PART 1:
Garland Library Online Orientation. Introduction  This portion of the Online orientation is intended to help library users gain the basic knowledge and.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
MS Access: Database Concepts Instructor: Vicki Weidler.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
Lesson 4 MICROSOFT EXCEL PART 1 by Nguyễn Thanh Tùng Web:
Linux Operations and Administration
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Managing Business Data Lecture 8. Summary of Previous Lecture File Systems  Purpose and Limitations Database systems  Definition, advantages over file.
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
State of Connecticut Core-CT Project Query Updated 4/14/2003.
Python File Handling. In all the programs you have made so far when program is closed all the data is lost, but what if you want to keep the data to use.
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Beginning BioPerl for Biologists MPI Ploen Jun Wang.
Access 2013 Microsoft Access 2013 is a database application that is ideal for gathering and understanding data that’s been collected on just about anything.
Lane Medical Library & Knowledge Management Center Introductory Perl Programming for Biologists Part 1: 2/3/2009 PRELIMINARY VERSION.
0 eCPIC User Training: Resource Library These training materials are owned by the Federal Government. They can be used or modified only by FESCOM member.
Chapter 17 Creating a Database.

Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
CS4710 Why Progam?. Why learn to program? Utility of programming skills: understand tools modify tools create your own automate repetitive tasks automate.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists, Second Edition Part 1: 9/11/2007 Yannick Pouliot,
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python Karsten Hokamp, PhD Genetics TCD, 03/11/2015.
Relational Databases: Basic Concepts BCHB Lecture 21 By Edwards & Li Slides:
GE3M25: Computer Programming for Biologists Python, Class 5
44220: Database Design & Implementation Introduction to Module Ian Perry Room: C49 Ext.: 7287
1 Technical & Business Writing (ENG-715) Muhammad Bilal Bashir UIIT, Rawalpindi.
Part 4 Arrays: Stacks foreach command Regular expressions: String structure analysis and substrings extractions and substitutions Command line arguments:
An electronic document that stores various types of data.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
PYTHON FOR HIGH PERFORMANCE COMPUTING. OUTLINE  Compiling for performance  Native ways for performance  Generator  Examples.
44220: Database Design & Implementation Introduction to Module Ian Perry Room: C41C Ext.: 7287
COMPREHENSIVE Excel Tutorial 12 Expanding Excel with Visual Basic for Applications.
Microsoft Excel 2007 Noris Bt. Ismail Faculty of Information and Communication Technology Tel : (Ext 8408) BCOMP0101.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Relational Databases: Basic Concepts
Catapult Python Programming Session 4
Welcome to the Markers Database Tutorial
Smart Integration Express
Spreadsheets, Modelling & Databases
Databases This topic looks at the basic concept of a database, the key features and benefits of a Database Management System (DBMS) and the basic theory.
Presentation transcript:

Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 2: Tue Feb 12 th 2008 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center © 2008 The Board of Trustees of The Leland Stanford Junior University

Lane Medical Library & Knowledge Management Center 2 Prep Log into WebEx session (stanford.webex.com/Meetings) Please download all class materials for 2 nd class from FAQ at

Lane Medical Library & Knowledge Management Center 3 Class Focus for Session #2 1. Altering file contents from text files 2. Altering file contents from Excel files And remember: Ask LOTS OF QUESTIONS

Lane Medical Library & Knowledge Management Center 4 Reminder: Cautions All examples pertain to MS Office 2003  Unclear what is to be expected for MS Office 2007 All contents pertain to Perl 5.x, not 6.x  V.5 and 6 are NOT compatible  V.5 is far more common, so not much of an issue

Lane Medical Library & Knowledge Management Center 5 Questions from last session? → stomp the teacher!

Lane Medical Library & Knowledge Management Center 6 Preliminaries: A Biological Useful Perl Program … to produce the data to be used in this class Let’s run Excel3.pl (briefly described last week)

Lane Medical Library & Knowledge Management Center 7 Excel3.pl: A “Real” Program What it does: 1. Reads input from an Excel worksheet containing public identifiers for DNA sequences associated with genes 2. Uses Entrez Utilities provided by NCBI to retrieve: UniGene cluster ID UniGene Gene symbol NCBI Gene ID 3. Writes the result into another Excel worksheet Features a mix of procedural and object programmingobject programming Relevant links:  gene gene  Entrez Utilities Entrez Utilities

Lane Medical Library & Knowledge Management Center 8 What Excel3.pl does:

Lane Medical Library & Knowledge Management Center 9 Part 1: Altering file contents

Lane Medical Library & Knowledge Management Center 10 Converting Data Stored in Flatfiles Input: ConvertOuput.csv  = renamed file generated by Excel3.pl Let’s look and run Convert1.pl →Convert5.pl

Lane Medical Library & Knowledge Management Center 11 Convert1.pl Structure of program Run program Exercise: what is chomp?chomp Understanding file handlesfile handles What is $_ ?$_ Create an error: uncomment line 22 and run Introducing the escape character: “\”

Lane Medical Library & Knowledge Management Center 12 Convert2.pl: Like Convert1.pl, but Prints Only First Item Using arrays to process contents of a line  Introducing splitsplit Changing directories  Useful to segregate data files  Need to change the path to make this work in your environment Note difference between Mac and Windows syntax for path names

Lane Medical Library & Knowledge Management Center 13 Convert3.pl: Like Convert2.pl, but Prints Changed Order of Columns Run program Q: how would you avoid printing the title line in the input file?

Lane Medical Library & Knowledge Management Center 14 Convert4.pl: Like Convert3.pl, but Removes “.” in Cluster IDs Run program  Introducing the match and substitute operator:match and substitute Matching: ‘/something/’ Substituting: ‘s/something1/something2/’ Used in regular expressions for text matching (more later)  Introducing the tab operator: “\t”

Lane Medical Library & Knowledge Management Center 15 Convert5.pl: Like Convert3.pl, but with Smarts + Prints More Elements Run program Introducing “regular expressions”regular expressions  Q: how would you modify this code to print only when a “Gene: Gene Symbol” was found → tip: use matching operator: If (not($var =~ /something/)) { do something } → Try doing it: 10 min

Lane Medical Library & Knowledge Management Center 16 More on Regular Expressions Very powerful  i.e., flexible, fast Complicated topic  Can require lots of trial and error to get it right  Quick reference card essential  Best comprehensive resource Covers more than Perl Friedl, 2006

Lane Medical Library & Knowledge Management Center 17 BREAK

Lane Medical Library & Knowledge Management Center 18 Part 2: Practical examples of programs that alter file contents using regular expressions

Lane Medical Library & Knowledge Management Center 19 Regular Expressions: More Examples The example we’ll use: Extracting clone IDs for CDH5 by… 1. Importing SOURCE results directly into ExcelSOURCE 2. Parsing the.csv version of that file (CDH5Clones.csv)

Lane Medical Library & Knowledge Management Center 20 Processing EST IDs from SOURCE Input: CDH5Clones.csv or CDH5Clones.xls

Lane Medical Library & Knowledge Management Center 21 Clone1.pl: Filtering of Results What it does:  Reads.csv file of SOURCE results  Finds all clones from PLACE library  Returns list in single column form Run the program Why the error?

Lane Medical Library & Knowledge Management Center 22 Clone2.pl: Numerical Filtering of Results Problem: Suppose you only want clones with IDs >= because you already have clones with ID< ? Solution: Check numerical value of clone ID and decide whether to retain it or not. → Run program!

Lane Medical Library & Knowledge Management Center 23 Part 3: Back to “Object Programming”

Lane Medical Library & Knowledge Management Center 24 Three concepts: 1. Objects 2. Methods 3. Classes Understanding Enough Object Programming to be Dangerous Tisdall, 2003

Lane Medical Library & Knowledge Management Center 25 “The key idea of OO programming is that all data is stored and modified with special data structures called objects, and each kind of object can be accessed only by its defined subroutines called methods. The user of an OO class is typically spared the effort of directly manipulating data, and can use class methods for this instead”, Tisdall, 2003.

Lane Medical Library & Knowledge Management Center 26 Understanding Objects Object = Collection of data that logically belongs together.  E.g., a “genome” object has parts (“attributes”) such as… Name of the species Genomic sequence List of genes, associated with their list of exons Start and end points for each exon A type of object (e.g., genome object) is called a class  All objects derive from a class

Lane Medical Library & Knowledge Management Center 27 Understanding Methods A Method is just like a subroutine, but these subroutines are associated specifically with a class Each type of object has one or more methods that it can call, and only those methods →The only way to access the data in an object is via the methods defined for that class. E.g., a genome object might have …  A compare method, for whole-genome comparisons  A list-gene-families method, for listing all gene families known to exist in a genome  A GC-percent function, for calculating %GC in specific areas of the genome, or all of it.

Lane Medical Library & Knowledge Management Center 28 Understanding Classes Class = object definition + collection of methods for them defines a class. A specific object (e.g. a genome object for H. sapiens) is called an instance of a class.

Lane Medical Library & Knowledge Management Center 29 ExcelClone2.pl: Doing the Same Thing as Clone2.pl, But Using Data From an Excel File and with OO Use Spreadsheet::BasicRead moduleBasicRead Program structure:  A loop within a loop  Iterates over every worksheet cell that contains data  Prints the content of cells only if it meets our conditions

Lane Medical Library & Knowledge Management Center 30 How ExcelClone2.pl Uses Object Functionality Creates an object of type Spreadsheet Access getNextRow function associated with this object Access cellValue function associated with this object

Lane Medical Library & Knowledge Management Center 31 Q: So Why Object Programming? A: Because it encapsulates functionality → fastest way to develop with minimal coding You just need to know: 1. That the functionality exists 2. How to call it

Lane Medical Library & Knowledge Management Center 32 BioPerl: An Example of OO Perl Code Valuable for Biological Research

Lane Medical Library & Knowledge Management Center 33 BioPerl: Overview BioPerl = >1,000 modules divided into 7 packages  Not all packages in v1.4… → but v1.4 = latest stable release

Lane Medical Library & Knowledge Management Center 34 BioPerl: You Have A Friend In High Places The big deal: BioPerl provides “objects” for various types of sequence data and their associated features and annotations.  These objects provide interfaces for analysis of these sequences with a wide variety of external programs (BLAST, FASTA, clustalw and EMBOSS to name just a few). various types of databases for storage and retrieval of sequences  remote (GenBank, EMBL etc)  local (MySQL, Flat_databases flat files, GFF etc.).

Lane Medical Library & Knowledge Management Center 35 Other, Non-BioPerl Modules

Lane Medical Library & Knowledge Management Center 36 Key BioPerl Links BioPerl 1.4 installed as part of Perl (what you downloaded) BioPerl home:  Lots of examples

Lane Medical Library & Knowledge Management Center 37 In Closing: Suggestions Modify the programs provided here  Baby steps… Save often Keep lots of prior versions so you can recover from your mistakes SU provides lots of documentation → use it! Get a quick reference card if you value your neurons Google is invaluable

Lane Medical Library & Knowledge Management Center 38 Class Survey qrZdySrbHk2BnYeg_3d_3d