Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

Perl & Regular Expressions (RegEx)
BBK P1 Module2010/11 : [‹#›] Regular Expressions.
Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
Regular Expressions in Perl By Josue Vazquez. What are Regular Expressions? A template that either matches or doesn’t match a given string. Often called.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Asp.NET Core Vaidation Controls. Slide 2 ASP.NET Validation Controls (Introduction) The ASP.NET validation controls can be used to validate data on the.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Scripting Languages Chapter 8 More About Regular Expressions.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved Streams Streams –Sequences of characters organized.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Chapter 9 Formatted Input/Output. Objectives In this chapter, you will learn: –To understand input and output streams. –To be able to use all print formatting.
System Programming Regular Expressions Regular Expressions
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
PHP Workshop ‹#› Data Manipulation & Regex. PHP Workshop ‹#› What..? Often in PHP we have to get data from files, or maybe through forms from a user.
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
WDV 331 Dreamweaver Applications Find and Replace Dreamweaver CS6 Chapter 20.
Regular Expression JavaScript Web Technology Derived from:
Finding the needle(s) in the textual haystack
RegExp. Regular Expression A regular expression is a certain way to describe a pattern of characters. Pattern-matching or keyword search. Regular expressions.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Regular Expressions.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Module 6 – Generics Module 7 – Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
What are Regular Expressions?What are Regular Expressions?  Pattern to match text  Consists of two parts, atoms and operators  Atoms specifies what.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Finding the needle(s) in the textual haystack
/^Hel{2}o\s*World\n$/
Regular Expressions Upsorn Praphamontripong CS 1110
Strings and Serialization
Looking for Patterns - Finding them with Regular Expressions
Advanced Regular Expressions
Regular Expressions and perl
Finding the needle(s) in the textual haystack
Regular Expression Beihang Open Source Club.
Finding the needle(s) in the textual haystack
The ‘grep’ Command Colin Masterson.
Advanced Find and Replace with Regular Expressions
Selenium WebDriver Web Test Tool Training
Data Manipulation & Regex
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Presentation transcript:

Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Regular Expression 2

3

Text patterns and matches A regular expression, or regex for short, is a pattern describing a certain amount of text In this slide, regular expressions are highlighted as regex –it is the most basic pattern, simply matching the literal text regex (highlighted in this slide) I will use the term “string” to indicate the text that I am applying the regular expression to and will be highlighted as string 4

Literal characters The most basic regular expression consists of a single literal character, ex: a –match the first occurrence of that character in the string –on Jack is a boy Jack is a boy, not Jack is a boy In this slide, I’ll use a shorter notation sometimes –a: Jack is a boy Eleven characters with special meanings: –[ \ ^ $. | ? * + ( )–[ \ ^ $. | ? * + ( ) –metacharacters –escape metacharacters with a backslash use 1\+1=2 to match 1+1=2 5

Character classes/sets Match only one out of several characters –to match an a or an e, use [ae] –you could use this in gr[ae]y to match gray or grey –a character class matches only a single character –gr[ae]y will not match graay or graey –the order does not matter Use a hyphen to specify a range of characters –[0-9] matches a single digit between 0 and 9 –combine ranges and single characters [0-9a-fA-F] –combine ranges and single characters [0-9a-fxA-FX] A caret after the opening square bracket negates the class –q[^x] matches qu in question but does not match Iraq since there is no character after the q for the negated character class to match 6

Shorthand character classes \d matches a single character that is a digit \w matches a word character –alphanumeric characters plus underscore \s matches a whitespace character –includes tabs and line breaks –\S not \s The actual characters matched by the shorthands depends on the software you’re using –$ man perlre 7

Non-printable characters Use special character sequences to put non-printable characters –\t for tab (ASCII 0x09) –\r for carriage return (0x0D) –\n for line feed (0x0A) Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n Use \xFF to match a specify character by its hexadecimal index in the character set –\xA9 matches the copyright symbol \uFFFF for a Unicode character (if supported) –\u20A0 matches the euro currency sign 8

The dot The dot,., matches (almost) any character The dot matches a single character, except line break characters –a short for [^\n] –gr.y matches gray, grey, gr%y, etc Most regex engines have a “dot matches all” or “single line” mode that makes the dot match any single character, including line breaks 9

Anchors Anchors do not match any characters but match a position –^ matches at the start of the string –$ matches at the end of the string Most regex engines have a “multi-line” mode that makes ^ match after any line break, and $ before any line break –b$ matches only bob \b matches at a word boundary –a word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w –\b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters –\B matches at every position where \b cannot match –\bis\b: This island is beautiful 10

Alternation Alternation is the regular expression equivalent of “or” –cat|dog: About cats and dogs You can add as many alternatives as you want –cat|dog|mouse|fish 11

Repetition ? makes the preceding token in the regular expression optional –colou?r matches colour or color * matches the preceding token zero or more times + matches the preceding token once or more – matches an HTML tag without any attributes – is easier to write but matches invalid tags such as {} specifies a specific amount of repetition –\b[1-9][0-9]{3}\b matches 1000–9999 –\b[1-9][0-9]{2,4}\b matches 100–

Greedy and lazy repetition The repetition operators or quantifiers are greedy They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex – : This is a first test Place a question mark after the quantifier to make it lazy, i.e., stop matching as soon as possible – : This is a first test A better solution is to use ]+> to quickly match an HTML tag without regard to attributes –the negated character class is more specific than the dot, which helps the regex engine find matches quickly 13

Grouping and backreferences Place round brackets, (), around multiple tokens to group them together –you can then apply a quantifier to the group –Set(Value)? matches Set or SetValue Round brackets create a capturing group –the above example has one group –how to access the group’s contents depends on the software or programming language you’re using Group zero always contains the entire regex match –Set(Value)?: SetValue, then $0 = SetValue, $1 = Value –Set(Value)?: Set, then or $0 = Set, $1 is nothing Use the special syntax Set(?:Value)? to group tokens without creating a capturing group –more efficient if you don’t need the contents 14

Look-around Look-around is a special kind of group The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result Look-around matches a position, just like anchors –q(?=u) matches question, but not Iraq (?=u) match at each position in the string before a u u is not part of the overall regex match positive look-ahead –q(?!u) matches Iraq but not question negative look-ahead –(?<=a)b matches abc positive look-behind –(?<!a)b fails to match abc negative look-behind 15

Reference Regular Expression Quick Start – expressions.info/quickstart.htmlhttp:// expressions.info/quickstart.html 16

We have done a lot of exercises 17

Now, let’s Talk about Bioinformatics programming in real cases 18

Sequence alignment We have learnt/implemented it twice –dynamic programming –longest common sub-string/sub-sequence –sequence alignment DNA/protein sequence residue substitution We know that –time complexity is O(n2) –backtracking, alternative alignments… That’s all? –No! There are always better algorithms. That’s why we always have new papers to read. Theoretical Applicative 19

Sequence alignment Some advanced ideas When is this point never considered? Band alignment Arbitrary region 20

21 Seq = AGATCGAT AAA AAC. AGA 1. ATC 3. CGA 5. GAT 2 6 TCG 4. TTT The state-of-the-art solutions: seeding and extension.

This is not 22 Bioinformatics algorithm

Protein clustering 23 Ina FASTA file and an integer k Outk clusters of proteins Requirement - invoke BLAST - complexity/teamwork report - using Perl would be the best Bonus - k-means algorithm - invoke clustering package

Deadline /5/4 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. to

BLAST Download protein sequence from UniProt –$ wget -o ytf.fa ' ription+factor+AND+reviewed%3ayes&force=yes&format=fasta‘ ription+factor+AND+reviewed%3ayes&force=yes&format=fasta A Unix tip using grep and regular expression –$ grep '^>' ytf.fa | wc –l # how many sequences –$ grep -c '^>' ytf.fa # a better version Download BLAST from NCBI – –I prefer this version ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/blast ia32-linux.tar.gz ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/blast ia32-linux.tar.gz Execution –$ format db –i ytf.fa # building indices –$ blastall -d ytf.fa -i ytf.fa -p blastp > ytf.bo # default output –$ blastall -d ytf.fa -i ytf.fa -m 6 -p blastp > ytf.bo # tabular output 25