Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch ( Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya
Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees and MUMs Flexible pattern matching in strings G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press Algorithms on strings, trees and sequences D. Gusfield, Cambridge University Press, 1997
Master Course Fourth lecture: Examples
Example 1: Assume that you have a transcription factor atc(g|a)(t|g|a)gt whose ocurrences are going to be searched into a text of length 1500bps: - what is the best strategy? - how many random ocurrences will appear?
Example 2: Assume that you have a 100 transcription factors atc(g|a)(t|g|a)gt whose ocurrences are going to be searched into a text of length 1500bps: - what is the best strategy? - how many random ocurrences will appear?
Example 3: Assume that you have a 100 transcription factors atc(g|a)(t|g|a)gt whose ocurrences are going to be searched into a 50 promoter regions of 1500bps: - what is the best strategy?
Example 4: - what is the best strategy? - how many random ocurrences will appear? Assume that you have a transcription factor a c t g whose ocurrences are going to be searched into a text of length 1500bps:
Example 5: - what is the best strategy? Assume that you have a transcription factor a c t g whose ocurrences are going to be searched into a text of length 1500bps:
Example 6: Assume that you have two short DNA sequences and you need to compare them. In each case what are you doing? - Using global pairwise alignment. - Using local pairwise alignment. - Using suffix trees. - Using frequency table of l-mers.
Example 7: Assume that you have two genomic DNA sequences and you need to compare them. In each case what are you doing? - Using global pairwise alignment. - Using local pairwise alignment. - Using suffix trees. - Using frequency table of l-mers.
Suffix trees Given string ababaas: 1: ababaas 2: babaas 3: abaas 4: baas 5: aas 6: as 7: s as,3 s,6 as,5 s,7 as,4 ba baas,2 a ba baas,1 a ba baas,1 ba baas,2 as,3as,4 s,6 as,5 s,7 Suffixes: What kind of queries?
Applications of Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 1. Exact string matching ………………………… Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Applications of Suffix trees 2. The substring problem for a database of patterns DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4
Applications of Suffix trees 3. The longest common substring of two strings a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4
Applications of Suffix trees 4. Finding the maximal repeats. a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4
Applications of Suffix trees 5. Finding MUMs. Third lecture: Second part: Alignment of genomes: MUMs
Dynamic programming What about genomes? Quadratic cost of space and time. accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Short sequences (up to bps) can be aligned using dynamic programming Quadratic cost of space and time. acc agt | | | |xx acc a--
Genomic sequences In which cases Dinamic Programming can be applied? The length of sequences is 1000 times longer. Genomic sequences have millions of base pairs. The running time is times higher ! (1 second becomes 11 days) (1 minute becomes 2 years)
First assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………… Genome B ……………………………. Genome A
Realistic assumption? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………………………………………. ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B
Realistic assumptions? But, now is it a real case? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A ………………………………………………………………… ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B
Preview in a real case Chlamidia muridarum: bps Chlamidia Thrachomatis: bps
Preview in a real case Pyrococcus abyssis: bps Pyrococcus horikoshu: bps
MUM … a a t g….c t g... … c g t g….c c c... MatchingUniqueMaximal MUM
Search for MUMs Given strings ababaabs and aabaat: List of UM aab,abaa,baa. ba a s,8 s,6 s,7 baabs,2 b a baabs,1 a bs,3 a s,5 a bs,4 b a b t,2 t,5 t,6 t,4 aat,1 t,3 (through the list of UM) 1st: Bottom-up traversal 2nd: Search for maximals (Through the tree) MUMs: aab,abaa.
Preview of many genomes
List of works
Image and interface accgc…….cttgc...tccgg……ccaac...
Computational and biological background (3) Chlamydophila pneumoniae AR39: bps Chlamydia pneumoniae: Chlamidia muridarum: bps Chlamidia trachomatis: bps
Alignment revisited Pyrococcus abyssis: Pyrococcus horikoshu: bps