Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters 1-4, Tisdall
Multiple platforms, multiple languages Windows, Mac, UNIX, Linux –UNIX remains the standard for bioinformatics software development, while PC’s and Mac’s are typically end-users. Java, Python, CORBA, C++, Ruby, Perl –There’s more than one way of doing things. –Uniformity continues to be one of the biggest problems faced in bioinformatics
Why Perl? Ease of use by novice programmers Fast software prototyping –Flexible language –Compact code (sometimes) Powerful pattern matching via “regular expressions” Availability of program and modules (BioPerl) Portability Open Source – easy to extend and customize No Licensing fees
Perl is easy to get… Many computers come with Perl already installed –Check by typing perl –v in a Unix, Linux, MacOSX shell, or Windows MS-DOS shell If not, simply go to or to download a recent version of Perl (download binary whenever possible, source code requires compiling) ActiveState provides several tools for Perl developers (Although some think Perl is an “old” language, it is constantly undergoing revision and improvement
What is Perl? Practical Extraction Report Language An interpreted programming language optimized for scanning text files, extracting information, and printing reports The string-based language of DNA and protein sequence data makes this an obvious choice
What is a Perl program? A program consists of a text file containing a series of Perl statements –Perl programs can be written in a variety of text editors including MS Word, WordPad, NotePad, or as you will use Komodo from ActiveState Perl statements are separated by semi-colons (;) Multiple spaces, tabs, and blank lines are ignored Anything following a # is ignored (comment line) Perl is case sensitive
Perl has three data types $ - Scalar: holds a single value, which can be a number or string, $EcoRI = - Array: stores multiple scalar values [0, 1, 2, etc.] % - Hash: An associative array with keys and values
Using Scalar Variables Example 4-1 Tisdall provides a simple example, a thorough description of this exercise is supplied both in the text
Some additional comments regarding strings: Quotes: –‘XYZ’ Text between a pair of single quotes is interpreted literally –To get a single-quote in a string precede it by a backslash –To get a backslash into a single quoted string, precede backslash with backslash ‘hello’ #hello ‘can\’t’ #can’t ‘ #
Double quotes interpolate variables “” variable names within the string are replaced by their current values –$x = 1; print ‘$x’; #will print out $x print “$x”; # will print out 1
Arithmetic operators + Addition - Subtraction * Multiplication ** Exponentiation / Division % Modulus
Other important operators = is an assignment operator == or eq is equals += or -= assignment operators that add or subtract, $a += 2; # means $a = $a +2; ++,, -- are autoincrement operators that add or subtract one from variable when following variable ($a++ = $a + 1)
\n = newline Often times you would like to introduce some spacing into your output \n introduces a blank line following any variable Print “apple”; print “grape”; Output looks like: apple grape Print “apple\n”; print “grape\n”; Output looks like:apple grape
Chomp and Chop Chop removes the last character from a string –$a = “Dr. Barber is hip”; –Chop ($a);#$a is now “Dr. Barber is hi” Chomp removes a line from the end of the string –$a = “Dr. Barber is hip\n”; –Chomp ($a);#$a is now “Dr. Barber is hip”
Do examples 4-2, 4-3, 4-4
Working with Files Biological data can come in a variety of file formats and our job is to utilize these files and extract what we want One such file format is FASTA
Scalar vs. Array Example 4-5 provides a simple distinction between use of a scalar variable and an array, read it, but don’t necessarily do it Also, it shows how you use filehandles in association with your file are input operators, you will become better acquainted with this when we use later
adhI.pep Supplant NM_021964fragment.pep with adhI.pep, which can be downloaded from the web-site to a folder you need to create on your computer called “BIOS482” Do Example 4-7, if time permits write analogous code to the code that follows this example to test out arrays
The Power of Perl Regular Expressions
What is a regular expression (regex)? It is a description for a group of characters you want to search for in a string, a file, a website, etc. Think of the group of characters as a pattern that you want to find within a string Use regular expressions to search text quickly and accurately
Pattern Matching Syntax $variable_name =~ /pattern/; –$variable_name – this is the variable containng the string you want to search –=~ - the binding operator is used for testing regular expressions –Letters before and after / (front and back, respectively, are operators and modifiers that affect the regular expression search
Matching operator you have been introduced to substitution and translation operators already m// or just // is used to find patterns in a string Test if a string contains the sequence ATG –$dnastr = ‘TTCGATGCCAC’; –If ($str =~ /ATG/) { –Print (“ATG found.\n”); –} –Else { –Print (“ATG not found.\n”); –}
Case modifier /atg/ would not find a match in the previous example However /atg/i would i is a case-independent modifier We will introduce additional modifiers when necessary
Global modifier If there were more than one ATG in the sequence, the previous examples only acknowledge the first one they run into /ATG/g g is a modifier for a global search, searching a string for ALL instance of pattern not the first one.
Other operators for regex s/// - substitution perator is used to change strings, put the oldstring between the first and second /, and the new string between the second and third tr/// - is used to change individual characters. Put the old character between the first and second /, and new character between the second and third
Metacharacters help search for complicated patterns \d or [0-9] – match any digit \w or [a-zA-Z_0-9] – match a character \D – match a non-digit character \W – match a non-word character \s, [\t\n\r\f] – match whitespace character \S – match non-whitespace character \n – match a newline character \r – match a carriage return \t – match a tab \f – match a formfeed. – match any SINGLE character There are more!
Regex quantifiers These syntax structures allow you to specifiy how long a regular expression pattern match should be –* match 0 or more times –+ match 1 or more times –? Match 1 or 0 times –{n} match exactly n times –{n, } match at least n times –{n,m} match at least n, but not more than m times
Examples of quantifier use [A+CGC?A] #match one or more A’s followed by CG, followed by an optional C followed by an A /A{3}/# Match exactly 3 A’s /A{3,} # match 3 or more A’s /A {3,8}/ #match 3 to 8 A’s The transcription factor binding site for SSP protein is GGCGGCGGCTGGCTAGGG –/{(GGC), 3}T{G,2}CTA{G,3}/
Alternation Vertical bar (|) allows you to match one of several alternatives /song|blue/ # match either ‘song’ or ‘blue’ /a|b|c/ # match a, b, or c, same as [abc] The GATA-1 TF binding site is defined by a T or an A, followed by GATA followed by an A or G. In regex that would be: /(T|A)GATA(A|G)/
Anchoring patterns ^ matches the beginning of a string, while $ matches the end of a string /^this/ #matches ‘this one’ but not ‘watch this’ /this$/ #matches ‘watch this’ but not ‘this one’
Pattern memory You know how to match characters, you need a way to find out what was matched by storing or saving the matching portions Putting parentheses around any pattern will allow the part of the string matched by the pattern to be remembered and stored in a special variable called $1. If there are multiple patterns, they are stored in $2, $3, …)
Finding and storing GATA-1 binding site $seq = “AAAGAGAGGGATAGAATAGAGATG ATAAGAAA”; $seq =~ /(T|A)GATA(A|G)/; Print “$1\n”; Output: TGATAA
Other special variables $& the part of the string that actually matched $` everything before the match $’ everything after the match –Modify previous program to : Print “$`\n”; Print “$&\n”; Print “$’\n”; Output: AAAGAGAGGGATAGAATAGAGA TGATAA GAAA
Websites on RegEx /pod/perlre.htmlhttp:// /pod/perlre.html ttperl/perlreg.htmhttp:// ttperl/perlreg.htm nistration/RegExp/page2.htmlhttp:// nistration/RegExp/page2.html /jw-0713-regex.htmlhttp:// /jw-0713-regex.html
Exercises Try some regular expressions with your motif.pl program pg Read pages 70-75, work through example 5- 4 (pick your own nucleotide file from NCBI) Next, do Example 5-7 to learn how to write to files
Homework