Patterns, Patterns and More Patterns Exploiting Perl's built-in regular expression technology.

Slides:



Advertisements
Similar presentations
ENUMERATED, typedef. ENUMERATED DATA TYPES An enumeration consists of a set of named integer constants. An enumeration type declaration gives the name.
Advertisements

Line Efficiency     Percentage Month Today’s Date
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
COS 381 Day 19. Agenda  Assignment 5 Posted Due April 7  Exam 3 which was originally scheduled for Apr 4 is going to on April 13 XML & Perl (Chap 8-10)
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
PERL Part 3 1.Subroutines 2.Pattern matching and regular expressions.
Databases and Perl Using Perl to talk to databases.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
Scripting Languages Chapter 8 More About Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
Input Validation with Regular Expressions COEN 351.
Data Collections: Dictionaries CSC 161: The Art of Programming Prof. Henry Kautz 11/4/2009.
Subroutines and Files Bioinformatics Ellen Walker Hiram College.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Chapter 9: Perl (continue) Advanced Perl Programming Some materials are taken from Sams Teach Yourself Perl 5 in 21 Days, Second Edition.
Compound Statements If you want to do more than one statement if an if- else case, you can form a block of statements, or compound statement, by enclosing.
CPTG286K Programming - Perl Chapter 7: Regular Expressions.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Time to talk about your class projects!. Shell Scripting Awk (lecture 2)
CS346 Regular Expressions1 Pattern Matching Regular Expression.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607 Office Hours – Tuesday and.
1 Perl, Beyond the Basics: Regular Expressions, Subroutines, and Objects in Perl CSCI 431 Programming Languages Fall 2003.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CPTG286K Programming - Perl Chapter 1: A Stroll Through Perl Instructor: Denny Lin.
1. switch statement 2. “Nested” statements Conditionals, part 2 1.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
8 1 String Manipulation CGI/Perl Programming By Diane Zak.
PERL By C. Shing ITEC Dept Radford University. Objectives Understand the history Understand constants and variables Understand operators Understand control.
Part 4 Arrays: Stacks foreach command Regular expressions: String structure analysis and substrings extractions and substitutions Command line arguments:
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.
The Scripting Programming Language
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
Jan 2016 Solar Lunar Data.
Introduction to Bioinformatic Computation. Lecture #
Using the Internet to publish data and applications
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
Q1 Jan Feb Mar ENTER TEXT HERE Notes
Average Monthly Temperature and Rainfall
CSCI 431 Programming Languages Fall 2003
Gantt Chart Enter Year Here Activities Jan Feb Mar Apr May Jun Jul Aug
Free PPT Diagrams : ALLPPT.com
Step 3 Step 2 Step 1 Put your text here Put your text here
MONTH CYCLE BEGINS CYCLE ENDS DUE TO FINANCE JUL /2/2015
Introduction to Bioinformatic Computation. Lecture #
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Free PPT Diagrams : ALLPPT.com
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
TIMELINE NAME OF PROJECT Today 2016 Jan Feb Mar Apr May Jun
Perl Regular Expressions – Part 1
Presentation transcript:

Patterns, Patterns and More Patterns Exploiting Perl's built-in regular expression technology

Pattern Basics ● What is a regular expression? /even/ eleven # matches at end of word eventually # matches at start of word even Stevens # matches twice: an entire word and within a word heaven # 'a' breaks the pattern Even # uppercase 'E' breaks the pattern EVEN # all uppercase breaks the pattern eveN # uppercase 'N' breaks the pattern leave # not even close! Steve not here # space between 'Steve' and 'not' breaks the pattern

my $pattern = "even"; my $string = "do the words heaven and eleven match?"; if ( find_it( $pattern, $string ) ) { print "A match was found.\n"; } else { print "No match was found.\n"; } What makes regular expressions so special?

my $string = "do the words heaven and eleven match?"; if ( $string =~ /even/ ) { print "A match was found.\n"; } else { print "No match was found.\n"; } find_it the Perl way

Maxim 7.1 Use a regular expression to specify what you want to find, not how to find it

Introducing The Pattern Metacharacters

/T+/ T TTTTTT TT t this and that hello tttttttttt The + repetition metacharacter

/ela+/ elation elaaaaaaaa /(ela)+/ elaelaelaela ela /\(ela\)+/ (ela)))))) (ela(ela(ela More repetition

/0|1|2|3|4|5|6|7|8|9/ there's a 0 in here somewhere My telephone number is: /a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z/ /A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/ The | alternation metacharacter

/0|1|2|3|4|5|6|7|8|9/ /[ ]/ /[aeiou]/ /a|e|i|o|u/ /[^aeiou]/ /[ ]/ /[0-9]/ /[a-z]/ /[A-Z]/ /[-A-Z]/ /[BCFHST][aeiou][mty]/ BatHog HitCan TotMay Cutbat Say /[BbCcFfHhSsTt][aeiou][mty]/ Metacharacter shorthand and character classes

/[0-9]/ /\d/ /[a-zA-Z0-9_]/ /\w/ /\s/ /[^ \t\n\r\f]/ /\D/ /[0-9][^ \t\n\r\f][a-zA-Z0-9_][a-zA-Z0-9_][^0-9]/ /\d\s\w\w\D/ More metacharacter shorthand

Maxim 7.2 Use regular expression shorthand to reduce the risk of error

/\w+/ /\d\s\w+\D/ /\d\s\w{2}\D/ /\d\s\w{2,4}\D/ /\d\s\w{2,}\D/ More repetition

/[Bb]art?/ bar Bar bart Bart /[Bb]art*/ bar Bart barttt Bartttttttttttttttttttt!!! /p*/ The ? and * optional metacharacters

/[Bb]ar./ barb bark barking embarking barn Bart Barry /[Bb]ar.?/ The any character metacharacter

Anchors

/\bbark\b/ That dog sure has a loud bark, doesn't it? That dog's barking is driving me crazy! /\Bbark\B/ The \b word boundary metacharacter

/^Bioinformatics/ Bioinformatics, Biocomputing and Perl is a great book. For a great introduction to Bioinformatics, see Moorhouse, Barry (2004). The ^ start-of-line metacharacter

/Perl$/ My favourite programming language is Perl Is Perl your favourite programming language? /^$/ The $ end-of-line metacharacter

#! /usr/bin/perl -w # The 'simplepat' program - simple regular expression example. while ( <> ) { print "Got a blank line.\n" if /^$/; print "Line has a curly brace.\n" if /[}{]/; print "Line contains 'program'.\n" if /\bprogram\b/; } The Binding Operators

$ perl simplepat simplepat Got a blank line. Line contains 'program'. Got a blank line. Line has a curly brace. Line contains 'program'. Line has a curly brace. Results from simplepat...

if ( $line =~ /^$/ ) if ( $line !~ /^$/ ) To Match or Not To Match...

/(ela)+/ #! /usr/bin/perl -w # The 'grouping' program - demonstrates the effect # of parentheses. while ( my $line = <> ) { $line =~ /\w+ (\w+) \w+ (\w+)/; print "Second word: '$1' on line $..\n" if defined $1; print "Fourth word: '$2' on line $..\n" if defined $2; } Remembering What Was Matched

This is a sample file for use with the grouping program that is included with the Patterns Patterns and More Patterns chapter from Bioinformatics, Biocomputing and Perl. $ perl grouping test.group.data Second word: 'is' on line 1. Fourth word: 'sample' on line 1. Second word: 'grouping' on line 2. Fourth word: 'that' on line 2. Second word: 'and' on line 4. Fourth word: 'Patterns' on line 4. Results from grouping...

#! /usr/bin/perl -w # The 'grouping2' program - demonstrates the effect of # more parentheses. while ( my $line = <> ) { $line =~ /\w+ ((\w+) \w+ (\w+))/; print "Three words: '$1' on line $..\n" if defined $1; print "Second word: '$2' on line $..\n" if defined $2; print "Fourth word: '$3' on line $..\n" if defined $3; } The grouping2 program

Three words: 'is a sample' on line 1. Second word: 'is' on line 1. Fourth word: 'sample' on line 1. Three words: 'grouping program that' on line 2. Second word: 'grouping' on line 2. Fourth word: 'that' on line 2. Three words: 'and More Patterns' on line 4. Second word: 'and' on line 4. Fourth word: 'Patterns' on line 4. Results from grouping2...

Maxim 7.3 When working with nested parentheses, count the opening parentheses, starting with the leftmost, to determine which parts of the pattern are assigned to which after-match variables

/(.+), Bart/ Get over here, now, Bart! Do you hear me, Bart? Get over here, now, Bart! Do you hear me /(.+?), Bart/ Get over here, now Greedy By Default

/usr/bin/perl //\w+/\w+/\w+/ /\/\w+\/\w+\/\w+/ /\/(\w+)\/(\w+)\/(\w+)/ m#/\w+/\w+/\w+# m#/(\w+)/(\w+)/(\w+)# m{ } m m[ ] m( ) /even/ m/even/ Alternative Pattern Delimiters

sub biodb2mysql { # # Given: a date in DD-MMM-YYYY format. # Return: a date in YYYY-MM-DD format. # my $original = shift; $original =~ /(\d\d)-(\w\w\w)-(\d\d\d\d)/; my ( $day, $month, $year ) = ( $1, $2, $3 ); Another Useful Utility

$month = '01' if $month eq 'JAN'; $month = '02' if $month eq 'FEB'; $month = '03' if $month eq 'MAR'; $month = '04' if $month eq 'APR'; $month = '05' if $month eq 'MAY'; $month = '06' if $month eq 'JUN'; $month = '07' if $month eq 'JUL'; $month = '08' if $month eq 'AUG'; $month = '09' if $month eq 'SEP'; $month = '10' if $month eq 'OCT'; $month = '11' if $month eq 'NOV'; $month = '12' if $month eq 'DEC'; return $year. '-'. $month. '-'. $day; } biodb2mysql subroutine, cont.

/(\d{2})-(\w{3})-(\d{4})/ /(\d+)-(\w+)-(\d+)/ Alternate biodb2mysql patterns

s/these/those/ Give me some of these, these, these and these. Thanks. Give me some of those, these, these and these. Thanks. s/these/those/g Give me some of those, those, those and those. Thanks. s/these/those/gi Substitutions: Search And Replace

s/^\s+// s/\s+$// s/\s+/ /g Substituting for whitespace

gccacagatt acaggaagtc atatttttag acctaaatca ctatcctcta tctttcagca 60 agaaaagaac atctacttgg tttcgttccc tatccaagat tcagatggtg aaacgagtga 120 tcatgcacct gatgaacgtg caaaaccaca gtcaagccat gacaaccccg atctacagtt 180. gcatctgtct gtatccgcaa cctaaaatca gtgctttaga agccgtggac attgatttag 6660 gtacgtgtag agcaagactt aaatttgtac gtgaaactaa aagccagttg tatgcattag 6720 ctttttcaat ttgtataacg tataacgtat ataatgttaa ttttagattt tcttacaact 6780 tgatttaaaa gtttaagatt catgtattta tattttatgg ggggacatga atagatct 6838 if ( $sequence =~ /acttaaatttgtacgtg/ ) s/\s*\d+$// s/\s*//g Finding A Sequence

#! /usr/bin/perl -w # The 'prepare_embl' program - getting embl.data # ready for use. while ( <> ) { s/\s*\d+$//; s/\s*//g; print; } $ perl prepare_embl embl.data > embl.data.out $ wc embl.data.out embl.data.out The prepare_embl program

#! /usr/bin/perl -w # The 'match_embl' program - check a sequence against # the EMBL database entry stored in the # embl.data.out data-file. use constant TRUE => 1; open EMBLENTRY, "embl.data.out" or die "No data-file: have you executed prepare_embl?\n"; my $sequence = ; close EMBLENTRY; print "Length of sequence is: ", length $sequence, " characters.\n"; while ( TRUE ) { The match_embl program

print "\nPlease enter a sequence to check.\n Type 'quit' to end: "; my $to_check = <>; chomp( $to_check ); $to_check = lc $to_check; if ( $to_check =~ /^quit$/ ) { last; } if ( $sequence =~ /$to_check/ ) { print "The EMBL data extract contains: $to_check.\n"; } else { print "No match found for: $to_check.\n"; } The match_embl program, cont.

$ perl match_embl Length of sequence is: 6838 characters. Please enter a sequence to check. Type 'quit' to end: aaatttgggccc No match found for: aaatttgggccc.. Please enter a sequence to check. Type 'quit' to end: caGGGGGgg No match found for: caggggggg. Please enter a sequence to check. Type 'quit' to end: tcatgcacctgatgaacgtgcaaaaccacagtcaagccatga The EMBL data extract contains: tcatgcacctgatgaacgtgcaaaaccacagtcaagccatga. Please enter a sequence to check. Type 'quit' to end: quit Results from match_embl...

Where To From Here