part 4 Arrays: Stacks foreach command Regular expressions: String structure analysis and substrings extractions and substitutions Command line array Modules in Perl: How to use/share libraries of functions Functions/Subroutines: Repetitive use of functional blocks Error messages: How to interrupt program on a mistake die statement
part 4 Arrays as a “FIRST-COME … LAST-SERVED” = (7,-1,2,4,5); 5 numbers array # zero = (); # store numbers 7; -1; 2; 4; 5; $lastNumber = print “last number stored was $lastNumber\n”; Jar of 5 numbers push 5 pop
part 4 When push/pop commands are useful? #!/usr/local/bin/perl # storing file = (); open (INP, “ ) { chomp($line); $line; } close(INP); # calculating number of lines in the file $nLines = $#fileLines + 1; print “There are $nLines lines in data.txt file\n”; # printing out data.txt file content foreach $line { print “$line\n”; } Finding potential regulatory elements in noncoding regions of the human genome is a challenging problem. Analyzing novel sequences for the presence of known transcription factor binding sites or their weight matrices produces a huge number = (1..6); foreach $d { print “$d “; } print “\n”;
part 4 Command line arguments #!/usr/local/bin/perl # determine file name $fName = $ARGV[0]; # open, read and print out file open (INP, “ ) { print $line; } close(INP); printFile.pl -- program, which prints out contents of a file Finding potential regulatory elements in noncoding regions of the human genome is a challenging problem. Analyzing novel sequences for the presence of known transcription factor binding sites or their weight matrices produces a huge number of numbers.txt words.txt printFile.pl numbers.txt printFile.pl -- array of arguments following program = (“numbers.txt”);
part 4 Example. Print out N-th line of the file #!/usr/local/bin/perl # determine file name, and line index $fName = $ARGV[0]; $lineNo = $ARGV[1]; # open and read file open (INP, “ ) { $line; } close(INP); # print out N-th line print $fileLines[ $lineNo-1 ]; Finding potential regulatory elements in noncoding regions of the human genome is a challenging problem. Analyzing novel sequences for the presence of known transcription factor binding sites or their weight matrices produces a huge number of words.txt printFile.pl words.txt 3 a challenging problem. Analyzing novel
part 4 Error messages #!/usr/local/bin/perl # check whether we’ve got 2 arguments or not if ($#ARGV != 1) { die “Error. Incorrect number of arguments\n”; }... printFile.pl words.txt 3 How to stop correctly a program with an indication of a run problem? Example problem: Program should be executed with 2 arguments, but user specifies only 1: printFile.pl 3 Program should stop and report about an error Print out a message and stop the program Stop on incorrect indication of a line number:... if ($ARGV[1] <= 0) { die “Error. Incorrect line number: $ARGV[1]\n”; }...
part 4 Defining novel functions and commands $x = min(5,3); print “Smallest of 5 and 3 is: $x\n”; # Function min sub min { ($a, $b) if ($a < $b) { $small = $a; } else { $small = $b; } return $small; } Defining min function, which returns minimum of 2 numbers: Function is a “mini computer” inside a program, it gets input data and produces output results FUNCTION (filtering out numbers) INPUT 2 Hello Everybody OUTPUT Hello Everybody INPUT parameters
part 4 Regular expressions $string1 = “Total: 576 genes, 2763 exons, some introns”; $string2 = “human -G-ACT---TTGC------AA----A---A----”; How to extract 2 numbers? How to extract just DNA sequence? Special symbols substituting groups of common type characters (called patterns): \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character ^ Match the beginning of the line. Match any character (except newline) $ Match the end of the line \t Tabulation symbol (HT, TAB) \n Newline (LF, NL)
part 4 Grouping options: * Match 0 or more times + Match 1 or more times [] Character class Patterns management: $string = “Total: 576 genes, 2763 exons, some introns”; $string =~ s/\d+/some/g; --> “Total: some genes, some exons, some introns”; $string =~ s/\s+/#/g; --> “Total:#576#genes,#2763#exons,#some#introns”; $string =~ s/\D+/\*/g; --> “* 576 * 2763 * * *”;
part 4 Localizing substrings: human -G-ACT---TTGC------AA----A---A-----CG-----G-AT TGGG--- | ||| ||| || | | || | || |||| mouse TGAACTCAAGTGCTATTTTAATTCCATTCATTCTCCGTGGCTGCATCAGGGCCTGGGGCT human C----GG------GA TG-AG--AGG | || || || || ||| mouse CTACCTCCTGACAAACATTTGGTCTCTAGAAGGCTTCTGAAGTTAGGCAAGTCTGAAAAT alignment.blast while ($line = ) { if ($line =~ /^mouse/) { print $line;} How to extract only the lines starting with ‘mouse’ ? mouse TGAACTCAAGTGCTATTTTAATTCCATTCATTCTCCGTGGCTGCATCAGGGCCTGGGGCT mouse CTACCTCCTGACAAACATTTGGTCTCTAGAAGGCTTCTGAAGTTAGGCAAGTCTGAAAAT
part 4 Obtaining substrings after localization: human -G-ACT---TTGC------AA----A---A-----CG-----G-AT TGGG--- | ||| ||| || | | || | || |||| mouse TGAACTCAAGTGCTATTTTAATTCCATTCATTCTCCGTGGCTGCATCAGGGCCTGGGGCT human C----GG------GA TG-AG--AGG | || || || || ||| mouse CTACCTCCTGACAAACATTTGGTCTCTAGAAGGCTTCTGAAGTTAGGCAAGTCTGAAAAT alignment.blast $humanSeq = “”; $mouseSeq = “”; while ($line = ) { if ($line =~ /^mouse (\S+)$/) { $mouseSeq.= $1; } elsif ($line =~ /^human (\S+)$/) { $humanSeq.= $1; } } print “Human sequence: $humanSeq\n”; print “Mouse sequence: $mouseSeq\n”; How to extract human and mouse sequences? /...(xxx)...(xxx)../ -- substrings enclosed into parenthesizes are available after a search in a format of variables $1, $2,...
part 4 Modules: Perl does not have functions for all the cases, but majority of those functions are already programmed by other people… And they share their libraries of functions, which are called modules Perl does not know how to create pictures, use GD; -- now it knows How to communicate with databases? use DBI; How to do DNA sequence analysis? use BioPerl; How to extract command line options? use Getopt; -- storage of Perl modules use X; command indicates that functions from X module should be used