1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM

Slides:



Advertisements
Similar presentations
Perl Practical Extration and Reporting Language An Introduction by Shwen Ho.
Advertisements

Lecture 6 More advanced Perl…. Substitute Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/CAATTG/CG/g;
A Guide to Unix Using Linux Fourth Edition
The Linux Operating System Lecture 6: Perl for the Systems Administrator Tonga Institute of Higher Education.
Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters.
Programming Perls* Objective: To introduce students to the perl language. –Perl is a language for getting your job done. –Making Easy Things Easy & Hard.
Perl for Bioinformatics Lecture 4. Variables - review A variable name starts with a $ It contains a number or a text string Use my to define a variable.
Introduction to Perl Bioinformatics. What is Perl? Practical Extraction and Report Language A scripting language Components an interpreter scripts: text.
Perl Basics Chapters 1-6 of “Learning Perl” By Randal Schwartz, Tom Christiansen & Larry Wall; ISBN , 302 pages. Second Edition, July 1997.
CS311 – Today's class Perl – Practical Extraction Report Language. Assignment 2 discussion Lecture 071CS Operating Systems I.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Perl Basics A Perl Tutorial NLP Course What is Perl?  Practical Extraction and Report Language  Interpreted Language Optimized for String Manipulation.
Introduction to Perl. How to run perl Perl is an interpreted language. This means you run it through an interpreter, not a compiler. Your program/script.
Perl Lecture #1 Scripting Languages Fall Perl Practical Extraction and Report Language -created by Larry Wall -- mid – 1980’s –needed a quick language.
Guide To UNIX Using Linux Third Edition
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Introduction to Perl Part III By: Cedric Notredame Adapted from (BT McInnes)
Practical Extraction & Report Language PERL Joseph Beltran.
BioPerl - documentation Bioperl tutorial tutorial Mastering Perl for Bioinformatics: Introduction.
PERL Variables and data structures Andrew Emerson, High Performance Systems, CINECA.
Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre Bioinformatics Centre IMTECH, Chandigarh Web:
Introduction to Python
Perl Tutorial Presented by Pradeepsunder. Why PERL ???  Practical extraction and report language  Similar to shell script but lot easier and more powerful.
Invitation to Computer Science, Java Version, Second Edition.
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
1 System Administration Introduction to Scripting, Perl Session 3 – Sat 10 Nov 2007 References:  chapter 1, The Unix Programming Environment, Kernighan.
Meet Perl, Part 2 Flow of Control and I/O. Perl Statements Lots of different ways to write similar statements –Can make your code look more like natural.
1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Introduction to Perl Giorgos Georgakilas Graduated from C.E.I.D.Graduated from C.E.I.D. M.Sc. degree in ITMBM.Sc. degree in ITMB Ph.D. student in DIANA-LabPh.D.
Introduction to Perl Yupu Liang cbio at MSKCC
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Perl Language Yize Chen CS354. History Perl was designed by Larry Wall in 1987 as a text processing language Perl has revised several times and becomes.
Perl: Lecture 1 The language. What Perl is Merger of Unix tools – Very popular under UNIX – shell, sed, awk Programming language – C syntax Scripting.
Chapter 9: Perl Programming Practical Extraction and Report Language Some materials are taken from Sams Teach Yourself Perl 5 in 21 Days, Second Edition.
Chapter 10: BASH Shell Scripting Fun with fi. In this chapter … Control structures File descriptors Variables.
Introduction to Perl Part III By: Bridget Thomson McInnes 6 Feburary 2004.
Introduction to Unix – CS 21
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607 Office Hours – Tuesday and.
Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp
Introduction to Perl “Practical Extraction and Report Language” “Pathologically Eclectic Rubbish Lister”
5 1 Data Files CGI/Perl Programming By Diane Zak.
Perl Tutorial. Why PERL ??? Practical extraction and report language Similar to shell script but lot easier and more powerful Easy availablity All details.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Getting started in Perl: Intro to Perl for programmers Matthew Heusser – xndev.com - Presented to the West Michigan Perl User’s Group.
Topic 2: Working with scalars CSE2395/CSE3395 Perl Programming Learning Perl 3rd edition chapter 2, pages 19-38, Programming Perl 3rd edition chapter.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
Introduction to Perl. What is Perl Perl is an interpreted language. This means you run it through an interpreter, not a compiler. Similar to shell script.
Perl Variables: Array Web Programming1. Review: Perl Variables Scalar ► e.g. $var1 = “Mary”; $var2= 1; ► holds number, character, string Array ► e.g.
Week Four Agenda Link of the week Review week three lab assignment This week’s expected outcomes Next lab assignment Break-out problems Upcoming deadlines.
PERL By C. Shing ITEC Dept Radford University. Objectives Understand the history Understand constants and variables Understand operators Understand control.
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
CSI605 perl. Perl Facts Perl = Pathologically Eclectic Rubbish Lister Perl is highly portable across many different platforms and operating systems Perl.
File Handle and conditional Lecture 2. File Handling The Files associated with Perl are often text files: e.g. text1.txt Files need to be “opened for.
The Scripting Programming Language
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
Perl Subroutines User Input Perl on linux Forks and Pipes.
PHP Tutorial. What is PHP PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages.
Linux Administration Working with the BASH Shell.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Perl for Bioinformatics
Control Structures: for & while Loops
Lesson 2. Control structures File IO - reading and writing Subroutines
Programming Perls* Objective: To introduce students to the perl language. Perl is a language for getting your job done. Making Easy Things Easy & Hard.
INTRODUCTION to PERL PART 1.
Presentation transcript:

1Fernán Agüero An Introduction to Perl Programming Fernán Agüero Instituto de Investigaciones Biotecnológicas, UNSAM

2Fernán Agüero Where we are today

3Fernán Agüero A bioinformatic experiment An experiment at the computer is no different than an experiment at the bench: –You search for an answer to a specific question –Experiments should be reproducible by someone else using similar methods Identify the problem –Which catalytic mechanism does enzyme X use? Identify the tools to answer the question –Sequence similarity searches –multiple sequence alignments –Identification of conserved motifs, and domains –Modelling the 3D protein structure Define criteria for success of the experiment –Almost all computational methods will provide a hit, or give you an answer –You need to distinguish noise from signal, discard false positives, etc. –You need to unserstand the programs and tools you use, the algorithmic base, whether they rely on external data for training, etc.

A bionformatic experiment Identify the problem –You have a new genome –Search for kinases that are conserved in other organisms –Do a multiple sequence alignment of best hits –Search for conserved motifs Describe the problem in computational terms –BLAST query sequence vs database X –Filter BLAST HITS (Score > 100) ‏ –Filter BLAST HITS (Annotation has the word ‘kinase’) ‏ –Store Best hits in FASTA format –Run ClustalW on stored sequences –Run BLOCKS on the multiple sequence alignment –Manually inspect the conserved blocks 4Fernán Agüero x 5000 sequences

5Fernán Agüero Programming computers Anyone with experience in the design of wet lab experiments can program a computer An experiment in the lab begins with a question, that leads you to a testable hypothesis The experiment helps you to test/discard that hypothesis In the computer, the programs you write should be designed to test hypothesis Learning a programming language might seem like a daunting task, but it’s similar to learning to use a new tool, or another natural language (english, french) ‏

First steps: programming a computer to automate tasks 6Fernán Agüero Sequences BLAST Good hit? Is this a kinase? Save No NO Yes

7Fernán Agüero Un experimento bioinformático … Select the data sets –In a wet lab experiment you rely on reagents. Generally you know when they were prepared, who prepared them, how they were prepared, etc. –In a dry lab experiment, the same type of information is essential. You need to know your data sources Database: When was the information compiled? What was the criteria for accepting entries into the dataset? –Take down notes and save the output of programs in your bioinformatic experiments!

8Fernán Agüero What is Perl Perl is a programming language Interpreted High Level Dynamic Created by Larry Wall –Perl is a language to get your job done! In Perl –There is more than one way to do it

9Fernán Agüero Why Perl Perl has been designed with text processing in mind –Filter text, generate reports Ideally suited for sequence analysis –All sequences are text –Convert formats easily with Perl GenBank FASTA Clustalw –Analyze and process files Find restriction enzyme sites Motifs Vector Trim sequences Etc. © Lincoln Stein

10Fernán Agüero A basic Perl script #!/usr/bin/perl statement; one long statement using many lines; exit; # optional Absolute PATH to the Perl interpreter /usr/local/bin/perl ; /opt/bin/perl Cynthia Gibas and Per Jambeck. O'Reilly & Associates (2001), ISBN Statements are ended by semicolons which perl whereis perl will give you the PATH to your Perl

11Fernán Agüero A basic Perl program/script A Perl Program is –A plain text file –Containing statements written in the Perl language Any text editor can be used –Vi, vim, emacs, xemacs, nedit, gedit, pico, nano, ee –Textpad, PSPad (Windows) ‏ –BBEdit, TextWrangler (Mac OS X) ‏ –Remember: a Word Processor IS NOT a text editor In Unix a text file can be a program (i.e. it can be executed) ‏ –Giving execution privileges: 'chmod +x program.pl' –Telling the operating system about the interpreter responsible for understanding the statements in the file #!/usr/bin/perl ‏

Running our perl script Save the file –Usually we use '.pl' for perl scripts –program.pl Give the file execution privileges –chmod +x program.pl Execute it from the command line –./program.pl

Variables –Allow you to store data in your program –And make operations with the data Numeric Values Text Values In Perl there are different types of variables –To store unidimensional data 1 value –To store many values Lists –To store associations between values Key: value Hashes, associative arrays 13Fernán Agüero

14Fernán Agüero Variables Scalars ($) ‏ –Unidimensional –Can hold any type of data Text Integers Floating point numbers –They are prefixed with $ $var = “GGATCCGGGACCAAAA”; # assign a string $val = 42; # assign a number ($a, $b, $c) = (“me”, “my”, “mine”); # assign all at once print $a; # would print “me” ($l, $r) = ($r, $l); # swap values

15Fernán Agüero Variables (contd) ‏ Arrays (or Lists) ‏ –In Perl, an array/list is an indexed collection of values. Values can be scalar values of any type (text, numbers) ‏ –The first index starts at position 0 (zero). –They are = (“juan”, “jose”, “fred”); # assign 3 elements print $list[0]; # print first element of $list “roberto”; # adds string at the end print $list[3]; # prints “roberto” $first = # get leftmost value $last = # get rightmost value

16Fernán Agüero Variables (contd) ‏ Hashes (%) ‏ –Also called associative arrays –They store values in pairs Key => Value –They are prefixed with % %me = ( name => “Fernan”, age => 37, loves => “Perl”, ); # create a hash with 3 key/value pairs print $me{name}; # print value associated with 'name' $me{born} = “Buenos Aires”; # add a new key-value pair

17Fernán Agüero Using variables Choice of variable types gives you power –Use the variable type that best fits your data Getting complex –You can create more complex data structures by mixing scalars, arrays and hashes. –Some examples: A hash of hashes to store sequences %sequences = ( eco0001 => { seq => “ATG...TGA”, desc => ”hypothetical protein” }, eco0002 => { seq => “ATG...TAA”, desc => “DNA polymerase” },... );

18Fernán Agüero From strings to lists and back again Convert a string into a list of values –Useful when reading files exported from spreadsheets –E.g. from Excel, in tab- or comma-delimited format = split( /pattern/, $string ) ‏ $string = “Cel ATG Hypothetical 2.54 = split(/ /, $string); now (“Cell1980.1”, “ATG”, “Hypothetical”, “2.54”, “High”)‏ print $values[3]; # would print 2.54 = split(/ /, $string); # $id is “Cel1980.1” is now (“ATG”, “Hypothetical”, “2.54”, “High”)‏

Convert a list of values into a string –Useful to generate files that can then be imported into a spreadsheet application (OpenOffice, Excel) ‏ –$string = join( ) ‏ 19Fernán = (“Cel1980.1”, “ATG”, “Hypothetical”, “2.54”, “High”); $string = # $string is now “Cel1980.1||ATG||Hypothetical||2.54||High”

20Fernán Agüero Working with files Declare a handle and associate it with a file Use the handle to refer to the file Reading from files Writing to files –This will overwrite the contents of the file! open(MYHANDLE, “/home/fernan/somefile.txt”); while ( $line = ) { # read the file one line at a time, do some action on $line } close MYHANDLE; open(MYOUTPUTHANDLE, “>/home/fernan/somefile.txt”); print MYOUTPUTHANDLE “Hello there!”; close MYOUTPUTHANDLE;

21Fernán Agüero Working with files (contd) ‏ Appending to files –Appending does not overwrite the contents of the file Special handles –They are always open, and available –STDIN, for reading (i.e. from pipes) ‏ –STDOUT, for writing –STDERR, for writing open(MYOUTPUTHANDLE, “>>/home/fernan/somefile.txt”; print MYOUTPUTHANDLE “Hello there!”; close MYOUTPUTHANDLE; while ( ) { # read from the keyboard or from a pipe } print STDOUT “MW: kDa”, “\n”, “pI: 9.54”, “\n”, “Length: 2954 aa”; print STDERR “Warning: sequence length is zero!”;

22Fernán Agüero Operators Assignment operators –= +=.= Control operators –&& || ! logical AND, OR and NOT –and or not Comparison operators –Numerical = != == –String lt gt le ge ne eq cmp $a = 1; $a += 2; # $a is now 3 $a *= 2; # $a is now 6 $a = “Me”; $a.= “Myself”; # $a is now “MeMyself” $a.= “AndIrene”; # $a is now “MeMyselfAndIrene” if ( $a && $b ) { # do something } if ( $mw > 100 || $pi < 9 ) {... } if ( ! defined $c ) {... } if ( $a == 4 ) { # do something } if ( $b eq “ATG” ) {... } if ( $a and $b ) { # do something } if ( $mw > 100 or $pi < 9 ) {... } if ( not defined $c ) {... }

23Fernán Agüero Iterations, Loops while (expression) { execute block } unless (expression) { execute block } do { execute block } until ( expression ) ‏ foreach { execute block } for (initial; expression; increment) { execute block } –for ($i = 0; $i >= 100; $i = $i + 1) Start at zero (0), Continue while $i <= 100, Increment $i by one each time Expression –An expression that evaluates to either true or false Execute block –List of statements that will be executed in the loop or if the condition is met

Exercise 1 Read a tab-delimited file –Interpro Results Produce a new tab-delimited file –With less columns –With the columns reordered File: lmajor_interpro.tab Columns –0 Sequence –1 Checksum –2 Length –3 Search Method –4 Match Accession –5 Match Description Columns –6 Match Start –7 Match End –8 Match Evalue –9 Match T/F –10 Date –11 Interpro Family Accession –12 Interpro Family Description –13 Gene Ontology Terms

#!/usr/bin/perl open(INTERPRO, “lmajor_interpro.tab”); # read interpro data, one line at a time while ( $line = ) { # split the = split(/\t/, $string); $sequence = $values[0]; $checksum = $values[1]; $length = $values[2]; $method = $values[3]; #... print $sequence, “\t”, $method, “\t”, $accession, “\t”, $start, “\t”, $end, “\t”, $description, “\n”; } close INTERPRO;

Regular expressions They are expressions that describe a family of strings –They can describe a literal string: GAATTC (Think about find/Replace in Word, Excel, Acrobat) ‏ –But the power does not lie in matching literals They use a particular syntax –GAATTC –G[AT]+C, G[^GC]+C Will match GAATTC, GAAATTC, GAAATTTC, GTATATATAC –GA+T+C Will match GATC, GAATC, GAATTC, GAAATC, GAAATTTC –GA{2}T{2}C Will match GAATTC –GA{1,2}T{1,2}C Will match GATC, GAATC, GAATTC, GATTC –Allow mismatches: GA..TC

Regular expressions There are some special characters –[ ], options for matching at a certain place in the string –{ }, options for matching a certain number of times –\s a space –\S anyting but a space –\d a digit –\D anything but a digit –\w an alphabetical character –\W anything but an alphabetical character –. match any character –+ match the preceding character 1 or more times –* match the preceding character 0 or more times –? match the preceding character 0 or 1 times –( ) store the matched pattern in a variable (Hello)\s(Nick)\s(Thomson) ‏ $1 = Hello, $2 = Nick, $3 = Thomson

28Fernán Agüero BioPerl BioPerl is –A collection of Perl modules –That greatly simplify writing bioinformatics programs BioPerl allows you to be lazy –You don't need to care about formats Bioperl Reads FASTA, GenBank, ClustalW, BLAST... $seq->id, $blast->evalue, $blast->score, $blast->next_hit Use Bio::SeqIO; $seqio = Bio::SeqIO->new( -file => “tcruzi.fasta”, -format => “fasta” ); while ( $seqobj = $seqio->next() ) { $sequence = $seqobj->seq(); $id = $seqobj->id(); }

29Fernán Agüero Using BioPerl Read the documentation –Identify the module you need –Know the objects you will be dealing with Sequence objects? Alignments? –And the methods and functions implemented by the module next_seq, next_report, next_hit add_seq, get_seq, revcomp, write_aln use Bio::AlignIO; $io = Bio::AlignIO->new( -file => “267.aln”, -format => “clustalw” ); $aln = $io->next_aln(); $minialn = $aln->slice(20,30); $newaln = $aln->remove_columns(['mismatch']);

30Fernán Agüero Using BioPerl Fast conversion of formats use Bio::AlignIO; $in = Bio::AlignIO->new( -file => '1.aln”, -format => “clustalw”); $out = Bio::AlignIO->new( -file => 2.pfam”, -format => “pfam”); $aln = $in->next_aln; $out->write_aln($aln); use Bio::SeqIO; $in = Bio::SeqIO->new( -file => '1.gbk”, -format => “genbank”); $out = Bio::SeqIO->new( -file => 2.fasta”, -format => “fasta”); $seq = $in->next_seq; $out->write_seq($seq);

Exercises >seq1 blah blah blah 1877:2980 ATGCGGTGACCGGTTAATTACACACAGGTACAGCCATATGCCGATTT GACGATACGTTTAGGCTTTCACAAAGGTGAGACT >CDS CDS blah blah blah 2856:3245 cgatcgctatcgcggta cggatatcgcgatcg cgtatatctcgagtc cgagagtatcatatg ctatcggagagc cgaagctcttatatcg cggatatcgcgga A multiple FASTA file –Contains many sequences

32Fernán Agüero Exercises Read the following program, and try to explain its purpose, and expected input and output 1 #!/usr/bin/perl 2 3 # In Perl $/ is a special variable 4 # $/ is the record separator or $RS 5 # by default $/ is a newline, we redefine it here 5 $/ = '>'; 6 7 while ( $next = ) { 8 chomp; 9 next if not defined $next; 10 = split(/\n/, $next); 11 $seq = 12 $seq =~ s/\s//g; 13 $seq =~ s/(.{60})/$1\n/g; 14 print ">$header\n$seq\n"; 15 }

Solve a real life problem Run Batch predictions for proteins and load the results in Artemis The problem –Predictions are run for each protein in turn –Coordinates (motifs, domains, similarity) are with respect of the protein sequence –Artemis expects coordinates with respect to the DNA sequence (chromosome, contig) ‏ Perl to the rescue

Getting there... step by step Extract all proteins –FASTA file Run your predictions/searches –Either on the web –Or at the command line Have the output of all predictions in tab-delimited files –Pre-process the files in a spreadsheet (Excel/OpenOffice) so that they're all in the same format –Same number of columns –Same order of the columns –Export as a text file (TXT, comma or tab-delimited) ‏

Getting there... step by step What is our plan? –Open the tab-delimited file –Convert the coordinates A more detailed, step by step list of actions –Open the the FASTA file, read the DNA coordinates for each protein (store coordinates for later use) ‏ –Open the TAB delimited file containing the predictions –For each protein Look up the DNA coordinates Add the DNA start coordinate to the protein coordinates E.g , DNA Protein Feature

Writing our script #!/usr/bin/perl # Wellcome Trust Course Exercise 1, Thursday, Nov 20th, 2008 # version 1 open(FASTA,”/path/to/my/fasta/file”) or die “Arrggghhh!”; while ( $line = ) { if ( $line =~ /^>(\S+).*(\d+):(\d+)/ ) { $sequence = $1; $start = $2; $end = $3; $hash{$sequence}{start} = $start; $hash{$sequence}{end} = $end; } %hash = ( CDS1 => ( start => 1224, end => 1560 ), CDS2 => ( start => 1689, end => 2553 ), CDS3... ); GET VALUE $var = $hash{CDS1}{end}; SET VALUE $hash{CDS1}{start} = 1224;

37Fernán Agüero Getting help Read the Perl Manual – 'man perl' (Overview and links to other parts of the manual) ‏ – 'man perlfunc' (Perl Built-in functions, i.e. split, join, chomp) ‏ – 'man perlop' (Perl Built-in operators, i.e. + && ) ‏ – 'man perlre' (Perl regular expressions) ‏ Perldoc – 'perldoc -f chomp' Get documentation for a built-in function (in this case, chomp) ‏ –This also include external and/or third-party modules – 'perldoc Bio::AlignIO' – 'perldoc Bio::SeqIO' – 'perldoc Bio::Seq::LargeSeq'

38Fernán Agüero Further Reading Mastering Perl for Bioinformatics –James Tisdall –O'Reilly and Associates, 2003 Learning Perl –Randal Schwartz, Tom Phoenix, Brian D Foy –O'Reilly and Associates, 5th Edition, 2008 Intermediate Perl –Randal Schwartz, Brian D Foy, with Tom Phoenix –O'Reilly Media, 2006 Mastering Regular Expressions –Jeffrey Friedl –O'Reilly and Associates, 1997 CPAN, the Comprehensive Perl Archive Network –Your one stop shop for everything Perl Modules, Frameworks