Download presentation
Presentation is loading. Please wait.
Published bySydney Eaton Modified over 9 years ago
1
Lecture 7: Perl pattern handling features
2
Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine amino acid is found ” if $AA =~ /m/; – It means if $AA (string) contains the m then print methionine amino acid found. – What is inside the / / is the pattern and =~ is the pattern matching symbol – It could also be written as if ($dna =~ /m/) { – print “An methionine amino acid is found ”; } – Met.pl
3
Pattern Matching – If we want to check for the start codon we could use: – if ($seq =~ /ATG/ ) – { Print “a start codon was found on line number\n” } – Or could write if /ATG / i (where I stands for case) – if we want to see if there is an A or T or G or C in the sequence use: $seq =~ /[ATGC]/ – The main way to use the Boolean OR is If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) { – Print “EcoR1 site found!!!”; } – (note EcoR1 is an important DNA sequence)
4
Sequence size example File_size_2 example – #!/usr/bin/perl – # file size2.pl – $length = 0; $lines = 0; – while (<>) { chomp; $length = $length + length $_ if $_ =~ /[GATCNgatcn]/; # n refers to any nucelotide #{refer to http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml}http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml – $lines = $lines + 1; – } – print "LENGTH = $length\n"; print "LINES = $lines\n"; The above is a modification of the length of the file example to include only files that have G or A or T or C in the input line. However this will lead to problems for FASTA files as the descriptor line will be included: Why?
5
Pattern Matching A NOT Boolean operator such as to see if the pattern contains letters that are not vowels can be represented via pattern handling by using the ^ symbol and a set of characters: e.g. – If ($seq =~ /[^aeiou]/ {print “no vowel”}; More flexible pattern syntax: Quite common to check for words or numbers so perl has represented as: – /[0-9]/ or/ \d/ is a digit – A word character is /[a-zA-Z0-9_]/ and is represented by /\w/ (word) – / \s/ represents a white space – By invert the case of the letter it has the reverse meaning; e.g. /\S/ (non white space) A more complete list of what are referred to as “metacharacters” is shown in the next slide (you must of course use =~ in expression)
6
Pattern matching: metacharacters Metacharacter Description. Any character except newline \.Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on metacharacters here: metacharacters and other regular expresions note (abc) \1 \2 are important for comparing sets of characters).metacharactersregular expresions
7
Pattern matching: Quantifiers Quantifier Description – ? 0 or 1 occurrence – +1 or more occurrences – * 0 or more occurrences – {N}n occurrences – {N,M} Between N and M occurrences – {N, } At least N occurrences – {,M} No more than M occurrences
8
Pattern matching: Quantifiers Consider the following pattern – DT249 4 (your class code) consists of [one or more word characters; then a space and then a digit so the match is: { =~/\w+\s\d/ } If the sequence has the following format: – Pu-C-X(40-80)-Pu-C Pu [AG] and X[ATGC] – $sequence =~ /[AG]C[GATC]{40,80}[AG]C/; Quantify.pl
9
Pattern Matching To determine where to look for a “pattern” in a sequence: Anchors – The start of line anchor ^ {note it is like the Boolean not operator but it is within [^aeiou]} /^>/ only those beginning with > – The end of line character $ />$/ only where the last character is > – /^$/ : what does this mean? – The boundary anchor \b E.g. Matching a word exactly: /\bword\b/ where \b boundary: just looks for “word” and not a sequence of the letters such as w o r and d – The non boundary anchor is \B /\Bword\B/ look for words like unworthy, trustworthy….. But not worthy or word
10
Sequence Size example: modified File_size_2 example – #!/usr/bin/perl – # file size2.pl – $length = 0; $lines = 0; – while (<>) { chomp; $length = $length + length $_ if $_ =~ /[GATCNgatcn]+$/; – #Alternative: $length += length if /^[GATCN]+$ / i; $lines = $lines + 1; – } – print "LENGTH = $length\n"; print "LINES = $lines\n"; Refer to DNA sequence codes to see meaning of A…NDNA sequence codes
11
Extracting Patterns The second aspect of Perl pattern handling is: Pattern extraction: Consider a sequence like > M185580, clone 333a, complete sequence – M18… is the sequence ID – Clone 33a, com…. : optional comments Need to stored some of elements of the descriptor line: – $seq =~/ ( \S+)/ part of the match is extracted and put into variable $1;
12
Extracting patterns #! /usr/bin/perl –w # demonstrates the effect of parentheses. while ( my $line = <> ) { $line =~ /\w+ (\w+) \w+ (\w+)/; print "Second word: '$1' on line $..\n" if defined $1; print "Fourth word: '$2' on line $..\n" if defined $2; } – Change it to catch the first and the 3 word of a sentence More examples in ExtractExample1.pl
13
Search/replace and trans-literial s/t/u/ replace (t)thymine with (u) Uracil; once only s/t/u/g (g = global) so scan the whole string s/t/u/gi (global and case insensitive) – What about the following : – s/^\s+// – s/\s+$// – s/\s+/ /g (where g stands for global) The transliteration search and replace function – $seq =~ tr/ATGC/TACG/; gets the compliment of a string of characters. (the normal search and replace works in a different way to the tr function) Refer to SearchReplace.pl
14
Search /replace/extract Write a program that removes the > from the FASTA line descriptor and assigns each element to appropriate variables. Example Fastafile_replace.txt – >gi|171361, Saccharomyces cerevisiae, cystathionine gamma-lyase – GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC – GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG – ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG – AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT – ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA – GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT – ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG – CCTATACACATAGACATTTGCACCTTATACATATAC
15
Exercises Write a script that: 1.Confirms if the user has input the code in the following format: Classcode_yearcode(papercode) E.g dt249 4(w203c) 2.Many important DNA sequences have specific patters; e.g. TATA write a script to find the position of this sequence in a FASTA file sequence.
16
Exercises 3.Write a script that can find the reverse complement of an DNA sequence without using the tr function. (Hint: a global search and replace will give an incorrect answer) 4.Coding regions begin win the AUG (ATG) codon and end with a stop codons. Write a perl script that extract a coding sequence from a FASTA file.
17
Exercise 5.Modify the Sequence size example from earlier to: – Allow the user to input a file name and determine its length.
18
Exam Questions Perl is a important bioinformatics language. Explain the main features of perl that make it suitable for bioinformatics (10 marks) Write a perl script that illustrates its pattern matching extraction and substitution ability. (6 marks) (refer to assignment/previous papers perl scripts)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.