Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp

Slides:



Advertisements
Similar presentations
Perl & Regular Expressions (RegEx)
Advertisements

Computer Science & Engineering 2111 Text Functions 1CSE 2111 Lecture-Text Functions.
Computer Programming for Biologists Class 9 Dec 4 th, 2014 Karsten Hokamp
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Scripting Languages Chapter 8 More About Regular Expressions.
1.10 Strings academy.zariba.com 1. Lecture Content 1.What is a string? 2.Creating and Using strings 3.Manipulating Strings 4.Other String Operations 5.Building.
UNIX Filters.
Applications of Regular Expressions BY— NIKHIL KUMAR KATTE 1.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Last Updated March 2006 Slide 1 Regular Expressions.
Tutorial 14 Working with Forms and Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Lecture 8 perl pattern matching features
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Regular Expressions in Perl Part I Alan Gold. Basic syntax =~ is the matching operator !~ is the negated matching operator // are the default delimiters.
Computer Programming for Biologists Class 8 Nov 28 th, 2014 Karsten Hokamp
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Introduction To Perl Susan Lukose. Introduction to Perl Practical Extraction and Report Language Easy to learn and use.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
JavaScript, Part 2 Instructor: Charles Moen CSCI/CINF 4230.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
CS346 Regular Expressions1 Pattern Matching Regular Expression.
Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp
5 1 Data Files CGI/Perl Programming By Diane Zak.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Computer Programming for Biologists Class 6 Nov 21 th, 2014 Karsten Hokamp
R EGULAR E XPRESSION IN P ERL (P ART 1) Thach Nguyen.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CPTG286K Programming - Perl Chapter 1: A Stroll Through Perl Instructor: Denny Lin.
GE3M25: Computer Programming for Biologists Python, Class 5
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 4 Karsten Hokamp, PhD Genetics TCD, 01/12/2015.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
Computer Programming for Biologists Class 4 Nov 14 th, 2014 Karsten Hokamp
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
8 1 String Manipulation CGI/Perl Programming By Diane Zak.
Part 4 Arrays: Stacks foreach command Regular expressions: String structure analysis and substrings extractions and substitutions Command line arguments:
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
CS 330 Class 7 Comments on Exam Programming plan for today:
The ‘grep’ Command Colin Masterson.
CIT 383: Administrative Scripting
Presentation transcript:

Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp

Computer Programming for Biologists  Project  Program Exit  Random numbers  Regular Expressions Overview

Computer Programming for Biologists Task 1: Report length of a sequence in Fasta format Understand the problem, consider input/output: >Tmsb10 ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCT GAAGAAA ACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGA AAAGAGG AGTGAAATCTCCTAA  Sequence length is 135 bp. Project

Computer Programming for Biologists Problems: 1.File contains header line 2.Sequence contains line-breaks >Tmsb10 ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCT GAAGAAA ACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGA AAAGAGG AGTGAAATCTCCTAA Project

Computer Programming for Biologists Steps: 1.Read in file content (line-by-line) 2. Remove line-breaks 3. Skip header line 4. Concatenate sequence into one long string 5. Calculate and report length Project

Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { } Project

Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks # 3. Skip header line # 4. Concatenate sequence into one long string } Project

Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line # 4. Concatenate sequence into one long string } Project

Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line # 4. Concatenate sequence into one long string $sequence.= $input; } Project

Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line # 4. Concatenate sequence into one long string $sequence.= $input; } # 5. Calculate and report length $length = length($sequence); print "Sequence length: $length bp\n"; Project

Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? if ($first eq '>') { # skip this line next; } $sequence.= $input; Project

Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? if ($first eq '>') { # skip this line next; } $sequence.= $input; Project

Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? unless ($first eq '>') { # this must be part of the sequence $sequence.= $input; } Project alternative version alternative version

Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? if ($first eq '>') { # skip this line next; } # 4. Concatenate sequence into one long string $sequence.= $input; } # 5. Calculate and report length $length = length($sequence); print "Sequence length: $length bp\n"; Project

Computer Programming for Biologists # Suggestions for the start of the script: # make sure a file has been provided unless { die "Please specify file name on command line!"; } # initialise sequence variable $sequence = ''; # 1. Read in file content (line-by-line) while ($input = <>) { … Project

Computer Programming for Biologists 1. automatic exit at end of script 2. explicit exit with value: exit 0; # default or exit 1; # normally indicates an error 3. exit on failure: die "error message"; ("\n" supresses line number) Exiting a program

Computer Programming for Biologists Example: Exiting a program

Computer Programming for Biologists Practical: Project

Computer Programming for Biologists constructs that describe patterns powerful methods for text processing search for patterns in a string search and extract patterns search and replace patterns pattern at which to split a string Regular Expressions

Computer Programming for Biologists Examples: Look for a motif in a dna/protein sequence Find low complexity repeats and mask with x's Find start of sequence string in GenBank record Extract addresses from a web-page Replace strings, e.g.: with Regular Expressions

Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions string in which to search

Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions binding operator

Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions pattern

Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions delimiters

Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions binding operator pattern delimiters string in which to search

Computer Programming for Biologists Find a pattern in a string (stored in a variable): $_ = 'ataggctagctaga'; if ( /ctag/ ) { print 'Found!';} Regular Expressions pattern delimiters without binding // to a variable, regular expression works on $_

Computer Programming for Biologists Search modifier: i = make search case-insensitive $sequence = 'ataggctagctaga'; if ( $sequence =~ /TAG/i ) { print 'Found!'; } Regular Expressions

Computer Programming for Biologists Metacharacters: ^ = match at the beginning of a line $ = match at the end of the line. = match any character (except newline) \ = escape the next metacharacter $sequence = ">sequence1\natgacctggaataggat"; if ( $sequence =~ /^>/ ) { # line starts with '>' print 'Found Fasta header!'; } Regular Expressions /\.$/ matches dot at end of line

Computer Programming for Biologists Exercise: Modify your course project (sequanto.pl) to use a regular expression for detection of a header line instead of 'substr' and 'eq' to check first character. Project

Computer Programming for Biologists Matching repetition: a? = match 'a' 1 or 0 times a* = match 'a' 0 or more times, i.e., any number of times a+ = match 'a' 1 or more times, i.e., at least once a{n,m} = match at least "n" times, but not more than "m" times. a{n,} = match at least "n" or more times a{n} = match exactly "n" times $sequence =~ /a{5,}/; # finds repeats of 5 or more 'a's Regular Expressions

Computer Programming for Biologists Search for classes of characters \d = match a digit character \w = match a word character (alphanumeric and '_') \D = match a non-digit character \W = match a non-word character \s = whitespace \S = match a non-whitespace character $date = '30 Jan 2009'; if ( date =~ /\d{1,2} \w+ \d{2,4}/ ) { print 'Correct date format!'; } Regular Expressions also matches '1 February 09'

Computer Programming for Biologists Match special characters \t = matches a tabulator (tab) \b = matches a word boundary \r = matches return \n = matches UNIX newline \cM = matches Control-M (line-ending in Windows) while (my $line = <>) { if ($line =~ /\cM/) { warn "Windows line-ending detected!"; } Regular Expressions

Computer Programming for Biologists Search for range of characters [ ] = match at least one of the characters specified within these brackets - = specifies a range, e.g. [a-z], or [0-9] ^ = match any character not in the list, e.g. [^A-Z] $sequence = 'ataggctapgctaga'; if ( $sequence =~ /[^acgt]/ ) { print "Sequence contains non-DNA character: $&"; } Regular Expressions $& is a special variable containing the last pattern match $` and $' contain strings before and after match

Computer Programming for Biologists Search and replace (substitute): s/pattern1/pattern2/ $sequence = 'ataggctagctaga'; $rna = $sequence; $rna =~ s/t/u/; -> 'auaggctagctaga' Regular Expressions Only the first match will be replaced!

Computer Programming for Biologists Modifiers for substitution: i = case in-sensitive g = global s = match includes newline $sequence = 'ataggctagctaga'; $rna = $sequence; $rna =~ s/t/u/g; -> 'auaggcuagcuaga' Regular Expressions replaces all 't' in the line with 'u'

Computer Programming for Biologists Example: Clean up a sequence string: $sequence = " 1 ataggctagctagat 16 ttagagctagta "; $sequence =~ s/[^actg]//g; -> 'ataggctagctagatttagagctagta' Regular Expressions Deletes everything that is not a, c, t, or g.

Computer Programming for Biologists Extract matched patterns: -put patterns in parentheses -\1, \2, \3, … refers back to ()'s within pattern match -$1, $2, $3, … refers back to ()'s after pattern match $sequence = ">test\natgtagagctagta"; if ($sequence =~ /^>(.*)/) { $id = $1; } or $ =~ at \2 dot \3/; print "Changed address to $1 at $2 dot $3\n"; Regular Expressions changes to 'kahokamp at tcd dot ie''

Computer Programming for Biologists Practical: Project

Computer Programming for Biologists Change a character into an = split //, $string; Split input line at = split /\t/, $input_line; Default splits $_ on whitespace: while (<>) = split; … } Regular Expressions in split