Lecture 8 perl pattern matching features Bioinformatics Lecture 8 perl pattern matching features
Questions to think about Create a hash table that performs the condon to AA conversion and use it to convert codons {entered from the key board} into their corresponding Amino Acids Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file
Questions to think about Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches
Introduction Pattern Matching Pattern extraction Pattern Substitution Split and join functions Unpack function
Pattern Matching More patterns Recall =~ is the pattern matching operator A first simple match example print “EcoRI site found!” if $dna =~ /gat/; It means if $DNA (string) contains the pattern gat then print Ecori site found. What is inside the 2 / is the pattern and =~ is the pattern matching symbol More patterns if ($dna =~ /[GATCgatc]/ ) if /^[GATC] / i If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) Print “EcoR1 site found!!!”;
Pattern Matching A More flexible pattern: print “EcoRI site found!” if $dna =~ /GAA[GATC]TTC/; Pattern where 4th letter is any let within square brackets [GATC] means any character other than G or A or T or C [0-9] or \d (digit) [ a-z] [-A-Z] /[AT][GC][TG]/ /[a-zA-Z0-9_]/ or /\w/ (word) / \s/ (white space) and to invert \s uppercase the letter \S (non white space)
Pattern matching: metacharacters Metacharacter Description . Any character except newline \. Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on variations of metacharacters here: metacharacters
Pattern matching: Quantifiers Quantifier Description ? 0 or 1 occurrence + 1 or more occurrences * 0 or more occurrences {N} n occurrences {N,M} Between N and M occurrences {N, } At least N occurrences { ,M} No more than M occurrences
Pattern matching: Quantifiers Pattern Match the following format: M58200.2 { =~/\w+\.\d+/ } If the sequence is: Pu-C-X(40-80)-Pu-C Pu [AG] and X[ATGC] $sequence = /[AG]C[GATC]{40,80}[AG]C/;
Extracting pattern to variables Anchors E..g. Matching a word exactly: /\bword\b/ \b boundary: just looks for word and not a sequence of the letters w o r and d The start of line anchor ^ /^>/ only those beginning with > The end of line character $ />$/ only where the last character is > /^$/ : what does this mean?
Further examples File_size_base_only.pl example #!/usr/bin/perl # file size2.pl $length = 0; $lines = 0; while (<>) { chomp; $length = $length + length $_ if $_ =~ /[GATCgatc]/; #Alternative: $length += length if /^[GATCN] / i; $lines = $lines + 1; } print "LENGTH = $length\n"; print "LINES = $lines\n";
FASTA files Sample of an NCBI record format: Write and test (file_size_bases_only.pl) using a FASTA file as input: FASTADNA1.txt: example of FASTA file >2L52.1 CE20433 Zinc finger, C2H2 type (CAMBRIDGE) protein id:CAA21776.1 GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC sample of file in EMBL format gccacagatt acaggaagtc atatttttag acctaaatca ctatcctcta tctttcagca 60 agaaaagaac atctacttgg tttcgttccc tatccaagat tcagatggtg aaacgagtga 120 tcatgcacct gatgaacgtg caaaaccaca gtcaagccat gacaaccccg atctacagtt 180 tgatgttgaa actgccgatt ggtacgccta cagtgaaaac tatggcacaa gtgaagaaaa 240 Sample of an NCBI record format: 1 atgaacccca acctgtgggt cgacgcgcag agcacttgca agagggaatg cgacgctgac 61 ctggagtgcg agacctttga gaagtgctgc cccaatgtct gtggaaccaa gagctgtgtg 121 gctgctcggt acatggacat caaggggaag aaggggcctg tggggatgcc caaagaggca 181 acctgtgacc gcttcatgtg catccagcaa ggctcagagt gcgacatctg ggacgggcag 241 cctgtctgca agtgcaagga caggtgtgag aaggagccga gctttacctg cgcctcggac
Extracting Patterns Consider a sequence like >M185580 clone 333a, complete sequence > M18… is the sequence ID Clone 33a, com…. : optional comments Need to stored some of elements of the descriptor line: =~/ ( \S+)/ part of the match is extracted and put into variable $1;
Extracting patterns #! /usr/bin/perl –w # demonstrates the effect of parentheses. while ( my $line = <> ) { $line =~ /\w+ (\w+) \w+ (\w+)/; print "Second word: '$1' on line $..\n" if defined $1; print "Fourth word: '$2' on line $..\n" if defined $2; } Change it to catch the first and the 3 word of a sentence
Search and replace s/t/u/ replace (t)thymine with (u) Uracil; once only s/t/u/g (g = global) so scan the whole string s/t/u/gi (global and case insensitive) What about the following : s/^\s+// s/\s+$// s/\s+$/ /g (where g stands for global) Write a perl script that reads in the DNA sequences from the FastaDNA1file.txt and replaces all the Thymine bases with the corresponding Uracil bases
Splits and joins To transform strings into arrays: split Line 1 looks like: 192a8,The Stranger DNA ,GGGTTCCGATTTCCAA,CCTTAGGCCAAATTAAGGCC Consider the following code: chomp($line = <>); # read the line into $line @fields = split ‘,’,$line; ($clone,$laboratory,$left_oligo,$right_oligo) = split ‘,’,$line; Reads in line 1 and puts each part before the delimiter; e.g. 192a8, into element of array…. To transform arrays (lists) into strings: join $tab = join “\t”,@fields; 192a8 The Sanger Centre GGGTTCCGATTTCCAA CCTTAGGCCAAATTAAGGCC #initialize an array my @perlFunc = ("substr","grep","defined","undef"); my $perlFunc = join " ", @perlFunc; print "Perl Functions: $perlFunc\n"; See example split_file.pl
Other useful functions Unpack syntax : @triplets = unpack("a3" x (length($line)/3), $line); Frame Shift (1 position to the right) @triplets = unpack(‘a’ . “a3” x (length ($line)/3),$line); Unpack_codons.pl
Questions Modify the file_bases_size_only.pl to count the the number of bases for a file in an EMBL format and one in an NCBI format Using the FASTADNA1.txt : extract the sections of the descriptor line to appropriate scalar variables. Assuming the DNA sequence of FastaDNA1file.txt is the complementary or anti-sense strand print the mRNA when the primary strand ( sequence ) is transcribed
Exam Questions Perl is a important bioinformatics language. Explain the main features of perl that make in appealing to the field of Bioinformatics. Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file Write a perl script only reads and prints DNA sequences from a FASTA file. Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches
FastaDNA1file.txt Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and illustrates the number of alignment matches to non matches.