BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions.

BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions Perl Modules Module Getopt::Std Range Operator Restriction Maps Parsing Rebase File Quiz 4

Midterm Discussions Here are my solutions BINF 634 FALL 2013 LECTURE 8 Modules and Maps2

Program Discussions Here are my solutions BINF 634 FALL 2013 LECTURE 8 Modules and Maps3

4 Perl Modules - I You can put useful subroutines into modules for future use Only write and debug subroutines once Creating a module: Put subroutines into a file with extension.pm, e.g. MySubs.pm Include " 1; " as last line in file

BINF 634 FALL 2013 LECTURE 8 Modules and Maps5 Perl Modules - II Using a module: Include statement: use Mysubs; note: leave off the.pm from the module name in use statement Then use any subroutine defined in MySubs.pm Modules are included at compile time Example: BeginPerlBioinformatics.pm contains subroutines from Tisdall textbook (see web page) http://oreilly.com/catalog/9780596000806/

BINF 634 FALL 2013 LECTURE 8 Modules and Maps6 Perl Modules Where does Perl find the modules? Perl looks in the list of directories in built-in array @INC includes paths to standard modules such as strict.pm and warnings.pm @INC automatically includes the current directory You can prepend directories to @INC by using the -I switch % perl -I"/mypath/libdir" prog.pl You also can tell Perl where to look as follows: use lib "/mypath"; use MyFASTASubs; # file is: "/mypath/MyFASTASubs.pm" use MyRandomSubs;

% cat RAND.pm # # subroutines for using random numbers # sub random_uniform { my ($lower, $upper) = @_; return $lower + rand ($upper - $lower); }; sub random_int { my ($lower, $upper) = @_; return int($lower + rand ($upper - $lower + 1)); } sub shuffle_array { my (@a) = @_; my $length = scalar @a; for (my $i = 0; $i < $length; $i++) { my $j = random_int($i, $length-1); ($a[$i], $a[$j]) = ($a[$j], $a[$i]); } return @a; } sub shuffle_string { my ($s) = @_; my @a = split "", $s; return join "", shuffle_array(@a); } 1; 7BINF 634 FALL 2013 LECTURE 8 Modules and Maps

#!/usr/bin/perl use strict; use warnings; # file: test_myrand.pl use lib 'C:\Documents and Settings\Owner\workspace\binf634_book_examples'; use RAND; my $dna = "AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC"; print "$dna\n"; $dna = shuffle_string($dna); print "$dna\n"; % test_myrand.pl AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC GCTACGCAGAGGTTAGTCATCTCGACATACTGCTT % test_myrand.pl AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC TCTTCGGAACCTACGACGTTCTATACAATGCTGGG 8BINF 634 FALL 2013 LECTURE 8 Modules and Maps

9 use Getopt::Std; Suppose prog.pl contains the statements: use Getopt::Std; my %opts = (); # a hash to hold the options getopts('bnf:', \%opts); # -b & -n are boolean flags # -f takes an argument (indicated by ":") Hash keys will be x (where x is the switch name) with key values the value of the argument or 1 if no argument is specified. The following set %opts = (n=>1, f=>"infile") % prog.pl -f infile -n % prog.pl -nf infile % prog.pl -n -f infile

#!/usr/bin/perl use strict; # file: opt.pl use Getopt::Std; my %options=(); # empty hash of options # "d:" means d takes an argument getopts("od:fF",\%options); print "-o $options{o}\n" if defined $options{o}; print "-d $options{d}\n" if defined $options{d}; print "-f $options{f}\n" if defined $options{f}; print "-F $options{F}\n" if defined $options{F}; print "Unprocessed by Getopt::Std:\n" if @ARGV; foreach (@ARGV) { print "$_\n"; } % opt.pl -Fd5 -o 123 foo -o 1 -d 5 -F 1 Unprocessed by Getopt::Std: 123 foo 10BINF 634 FALL 2013 LECTURE 8 Modules and Maps processing of options stopped when it saw 123

BINF 634 FALL 2013 LECTURE 8 Modules and Maps11 Restriction Maps (Tisdall Ch. 9) Restriction Enzymes are “chemical scissors” that cut DNA in specific places specified by short sequence patterns HindII binds and cuts at GTY^RAC Y = C or T (pyrimidines) R = A or G (purines) ^ indicates cleave site A Restriction Map of a DNA sequence shows all the positions at which a given Restriction Enzyme will cut Several hundred restriction enzymes are known Database REBASE: http://www.neb.com

REBASE version 104 bionet.104 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http://rebase.neb.com Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Rich Roberts Mar 30 2001 AaaI (XmaIII) C^GGCCG AacI (BamHI) GGATCC AaeI (BamHI) GGATCC AagI (ClaI) AT^CGAT AaqI (ApaLI) GTGCAC AarI CACCTGCNNNN^ AarI ^NNNNNNNNGCAGGTG AatI (StuI) AGG^CCT AatII GACGT^C AauI (Bsp1407I) T^GTACA AbaI (BclI) T^GATCA AbeI (BbvCI) CC^TCAGC AbeI (BbvCI) GC^TGAGG AbrI (XhoI) C^TCGAG AcaI (AsuII) TTCGAA AcaII (BamHI) GGATCC AcaIII (MstI) TGCGCA AcaIV (HaeIII) GGCC AccI GT^MKAC # The IUB ambiguity codes R = G or A Y = C or T M = A or C K = G or T S = G or C W = A or T B = not A (C or G or T) D = not C (A or G or T) H = not G (A or C or T) V = not T (A or C or G) N = A or C or G or T 12 BINF 634 FALL 2013 LECTURE 8 Modules and Maps

13 Restriction Maps Problem: Given a DNA sequence and the name of a restriction enzyme in REBASE Find all location of the restriction sites on the DNA Pseudocode: read in rebase data from file “bionet” create hash with key = enzyme name and value = regular expression for each user query (in the form of an enzyme name) if query is defined in the hash get positions that match regular expr in DNA report positions, if any }

BINF 634 FALL 2013 LECTURE 8 Modules and Maps14 Range Operator start.. end In list context, returns a slice of an array: print "@A[10.. 20]\n"; # print items 10 thru 20 In a scalar context, return true if $. (line number) is currently between start and end values: # print out first 20 lines of file: while ( ) { print if (1.. 20); } # print out up to first line containing "foo": while ( ) { print if (1.. /foo/); }

BINF 634 FALL 2013 LECTURE 8 Modules and Maps15 Finding matches with regular expressions Reminder: If we use the global modifier g, then pos($string) returns position after the match: $string = "ATCGCATGGAA"; $string =~ /T.G/g; print "$& ends at position ", pos($string)-1, "\n"; $string =~ /T.G/g; print "$& ends at position ", pos($string)-1, "\n"; Output: TCG ends at position 3 TGG ends at position 8

# Example 9-2 Subroutine to parse a REBASE datafile # parseREBASE-Parse REBASE bionet file # # A subroutine to return a hash where # key = restriction enzyme name # value = blank-separated recognition site and regular expression sub parseREBASE { my($rebasefile) = @_; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare variables my @rebasefile = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp; # Read in the REBASE file # note: incorrect on p. 191 my $rebase_filehandle = open_file($rebasefile); 16BINF 634 FALL 2013 LECTURE 8 Modules and Maps

while( ) { # note: incorrect on p. 191 # Discard header lines next if ( 1.. /Rich Roberts/ ); # Discard blank lines next if /^\s*$/; # Split the fields my @fields = split( " ", $_); # Get and store the name and the recognition site # grab just the first and last fields $name = shift @fields; $site = pop @fields; # Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site); # Store the data into the hash $rebase_hash{$name} = "$site $regexp"; } # Return the hash containing the reformatted REBASE data return %rebase_hash; } 17BINF 634 FALL 2013 LECTURE 8 Modules and Maps

############################################################# # Subroutine IUB_to_regexp # # Translate IUB ambiguity codes to regular expressions # # The IUB ambiguity codes # R = G or A Y = C or T M = A or C K = G or T # S = G or C W = A or T B = not A (C or G or T) # D = not C (A or G or T) H = not G (A or C or T) # V = not T (A or C or G) N = A or C or G or T sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ''; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); # Remove the ^ signs $iub =~ s/\^//g; # Translate the iub sequence for (my $i = 0; $i < length($iub); ++$i ){ $regular_expression.= $iub2character_class{substr($iub,$i,1)}; } return $regular_expression; } 18BINF 634 FALL 2013 LECTURE 8 Modules and Maps

Sample Main Program to test parseREBASE subroutine: #!/usr/bin/perl use warnings; use strict; my %rebasehash = (); %rebasehash = parseREBASE("bionet"); for my $key (sort keys %rebasehash) { my ($site, $regexp) = split " ", $rebasehash{$key}; print "enzyme = $key site = $site regexp = $regexp\n"; } sub parseREBASE {... } Output: enzyme = AaaI site = C^GGCCG regexp = CGGCCG enzyme = AacI site = GGATCC regexp = GGATCC enzyme = AaeI site = GGATCC regexp = GGATCC enzyme = AagI site = AT^CGAT regexp = ATCGAT enzyme = AaqI site = GTGCAC regexp = GTGCAC enzyme = AarI site = ^NNNNNNNNGCAGGTG regexp = [ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTG 19BINF 634 FALL 2013 LECTURE 8 Modules and Maps

20 Restriction Maps Problem: Given a DNA sequence and the name of a restriction enzyme in REBASE Find all location of the restriction sites on the DNA Pseudocode: read in rebase data from file “bionet” create hash with key = enzyme name and value = regular expression for each user query (in the form of an enzyme name) if query is defined in the hash get positions that match regular expr in DNA report positions, if any }

##################################################################### # Subroutine match_positions # Find locations of a match of a regular expression in a string # Return an array of positions where the regular expression # appears in the string sub match_positions { my($regexp, $sequence) = @_; my @positions = ( ); # Determine positions of regular expression matches while ( $sequence =~ /$regexp/ig ) { push @positions, pos($sequence) - length($&) + 1; } return @positions; } # sample main program: my $dna = “ATTGATAAGTCA”; my @pos = match_positions(“T[GC]A”, $dna); print “@pos\n”; exit; Output: 3 10 21BINF 634 FALL 2013 LECTURE 8 Modules and Maps N.B. – pos returns the position relative to 0 immediately after the match. So the match gives us the position at the start of the match relative to 1.

BINF 634 FALL 2013 LECTURE 8 Modules and Maps22 Restriction Maps Problem: Given a DNA sequence and the name of a restriction enzyme in REBASE Find all location of the restriction sites on the DNA Pseudocode: read in rebase data from file “bionet” create hash with key = enzyme name and value = regular expression for each user query (in the form of an enzyme name) if query is defined in the hash get positions that match regular expr in DNA report positions, if any }

#!/usr/bin/perl -w # Make restriction map from user queries on names of # restriction enzymes use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; my @locations = ( ); # Read in the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the DNA sequence data from the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Get the REBASE data into a hash, from file "bionet" %rebase_hash = parseREBASE("bionet"); 23BINF 634 FALL 2013 LECTURE 8 Modules and Maps

# Prompt user for restriction enzyme names, create restriction map do { print "Search for what restriction site (or quit)?: "; $query = ; chomp $query; # Exit if empty query if ($query =~ /^\s*$/ ) { exit;} # Perform the search in the DNA sequence if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split " ", $rebase_hash{$query}; # Create the restriction map @locations = match_positions($regexp, $dna); # Report the restriction map to the user if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "A restriction site for $query at locations:\n"; print join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } print "\n"; } until ( $query =~ /quit/ ); exit; 24BINF 634 FALL 2013 LECTURE 8 Modules and Maps

25 % rebase.pl Search for what restriction site (or quit)?: AceI Searching for AceI G^CWGC GC[AT]GC A restriction site for AceI at locations: 54 94 582 660 696 702 840 855 957 Search for what restriction site (or quit)?: AceII A restriction enzyme AceII is not in the DNA: Search for what restriction site (or quit)?: Acc36I Searching for Acc36I ^NNNNNNNNGCAGGT [ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGT A restriction site for Acc36I at locations: 166 Search for what restriction site for (or quit)?: quit % The Program in Action!!

BINF 634 FALL 2013 LECTURE 8 Modules and Maps26 HW Read Tisdall Appendix Chapter 9 and Appendix B

BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions.

Similar presentations

Presentation on theme: "BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions.

Similar presentations

Presentation on theme: "BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions."— Presentation transcript:

Similar presentations

About project

Feedback