Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short sequences ( up to bps) Dot Matrix Pairwise align. Multiple align. Hash alg. 3. Comparison of large sequences ( more that bps) Data structures Suffix treesMUMs 4. String matching
Comparison of large sequences First part: Alignment of large sequences
Dynamic programming What about genomes? Quadratic cost of space and time. accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Short sequences (up to bps) can be aligned using dynamic programming Quadratic cost of space and time. acc agt | | | |xx acc a--
Genomic sequences In which case Dynamic Programming can be applied? The length of sequences is 1000 times longer. Genomic sequences have millions of base pairs. The running time is times higher ! (1 second becomes 11 days) (1 minute becomes 2 years)
First assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………… Genome B ……………………………. Genome A
Realistic assumption? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………………………………………. ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B
Realistic assumptions? But, now is it a real case? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A ………………………………………………………………… ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B
Preview in a real case Chlamidia muridarum: bps Chlamidia Thrachomatis: bps
Preview in a real case Pyrococcus abyssis: bps Pyrococcus horikoshu: bps
Methodology of an alignment 1st: 2nd: 3th: (Linear cost) Identify the portions that can be aligned. Make a preview: ……………………..…. …………………...…. Make the alignment: …..………… ………………. (Linear cost)
Methodology of an alignment (Linear cost) Make a preview: ……………………..…. …………………...…. 1st: 2nd: 3th: Identify the portions that can be aligned. Make the alignment: …..………… ………………. ?
Preview-Revisited … a a t g….c t g... … c g t g….c c c... MatchingUniqueMaximal MUM Connect to MALGENMALGEN
Methodology of an alignment 1st: 2nd: 3th: Identify the portions that can be aligned. Make a preview: ……………………..…. …………………...…. Make the alignment: …..………… ………………. How can MUMs be found? With CLUSTALW, TCOFFEE,… How can these portions be determined? Linear cost with Suffix trees
Comparison of large sequences M-GCAT Todd Treangen
Homework 1.Javier14. Alexis 2.Dmitry15. Ramon 3.Ana Iris 4.David 5.Patricia 6.Rogeli 7.Atif 8.Aina 9.Isaac 10.Maria Merce 11.Romina 12.Guillem 13.Raul
Bioinformatics PhD. Course Second part: Introducing Suffix trees
Suffix trees Given string ababaas: 1: ababaas 2: babaas 3: abaas 4: baas 5: aas 6: as 7: s as,3 s,6 as,5 s,7 as,4 ba baas,2 a ba baas,1 a ba baas,1 ba baas,2 as,3as,4 s,6 as,5 s,7 Suffixes: What kind of queries?
Applications of Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 1. Exact string matching ………………………… Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Quadratic insertion algorithm Given the string ………………………… P1: the leaves of suffixes from have been inserted and the suffix-tree …... Invariant Properties:
Quadratic insertion algorithm Given the string ababaabbs ababaabbs,1
Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1
Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1 aba baabbs,1
Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3
Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3 ba baabbs,2
Quadratic insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 ba baabbs,2 abbs,4
Quadratic insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1
Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1 abbs,5
Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1 abbs,5
Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1
Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1 bs,6
Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1 bs,6
Quadratic insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7
Quadratic insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7 s,8
Quadratic insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7 s,7s,9
Generalizad suffix tree The suffix tree of many strings … and it is the suffix tree of the concatenation of strings. the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, : is called the generalized suffix tree … For instance,
Generalizad suffix tree a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given the suffix tree of ababaabα : Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : a bα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 ab aaβ,1
Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : a bα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 ab aaβ,1
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 ab aaβ,1 a β,2 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 ab aaβ,1 a β,2 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 ab aaβ,1 a β,2 a β,3
Construction of the suffix tree of ababaabbαaabaaβ : Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 ab aaβ,1 a β,2 a β,3
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 β,6 Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8 α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 β,6 Generalized suffix tree of ababaabbαaabaaβ :
Applications of Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 1. Exact string matching ………………………… Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Applications of Suffix trees 2. The substring problem for a database of strings DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8 α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 β,6
Applications of Suffix trees 3. The longest common substring of two strings a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8 α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 β,6
Applications of Suffix trees 4. Finding the maximal repeats. a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8 α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 β,6
Applications of Suffix trees 5. Finding MUMs. a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,8 α,9 b aaβ,1 a β,2 a β,3 a β,4 β,5 β,6
Bioinformatics PhD. Course Third part: Suffix links
a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa ?
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa
Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 aa
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a aa in S 2 [1] Unique matchings
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a aa in S 2 [1] Unique matchings aab in S 2 [1] = S 1 [5..6-7] in S 2 [1]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a Unique matchings S 1 [5..6-7] in S 2 [1]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a Unique matchings S 1 [5..6-7] in S 2 [1]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a b b a Unique matchings S 1 [5..6-7] in S 2 [1] S 1 [3..6-…] in S 2 [2]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a b b a Unique matchings S 1 [5..6-7] in S 2 [1] S 1 [3..6-…] in S 2 [2]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a b b a Unique matchings S 1 [5..6-7] in S 2 [1] S 1 [3..6-…] in S 2 [2]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a b b a Unique matchings S 1 [5..6-7] in S 2 [1] S 1 [3..6-…] in S 2 [2]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a b b a Unique matchings S 1 [5..6-7] in S 2 [1] S 1 [3..6-8] in S 2 [2] S 1 [4..6-8] in S 2 [3]
Traversal using Suffix links a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,8α,9 Given S 2 = a a b a a b b a Unique matchings S 1 [5..8] in S 2 [4] S 1 [3..6-8] in S 2 [2] S 1 [4..6-8] in S 2 [3] S 1 [6..8] in S 2 [5] S 1 [7..8] in S 2 [6]
From UMs to MUMs Given S 2 = a a b a a b b a Unique matchings S 1 [5..8] in S 2 [4] S 1 [3..6-8] in S 2 [2] S 1 [4..6-8] in S 2 [3] S 1 [6..8] in S 2 [5] S 1 [7..8] in S 2 [6] Array of UMs and S 1 = a b a b a a b b α MUM: S 1 [3..6-8] in S 2 [2]
Bioinformatics PhD. Course Third part: Linear insertion algorithm
Quadratic insertion algorithm Given the string ………………………… P1: the leaves of suffixes from have been inserted and the suffix-tree …... Invariant Properties:
Linear insertion algorithm Given the string ………………………… P2: the string is the longest string that can be spelt through the tree. P1: the leaves of suffixes from have been inserted and the suffix-tree …... Invariant Properties:
Linear insertion algorithm: example Given the string ababaababb... ba baababb...,2 a ababb...,5 ba ababb...,3 baababb...,1 ababb...,4 aa
Linear insertion algorithm: example Given the string ababaababb... ba baababb...,2 a ababb...,5 ba ababb...,3 baababb...,1 ababb...,4 6 7 8
Linear insertion algorithm: example ba baababb...,2 a ababb...,5 ba ababb...,3 baababb...,1 ababb...,4 Given the string ababaababb...
Linear insertion algorithm: example ba baababb...,2 a ababb...,5 ba ababb...,3 baababb...,1 ababb...,4 Given the string ababaababb...
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 baababb...,1 ba baababb...,2 ababb...,4 Given the string ababaababb... baababb...,1 b b...,6 aababb...,1
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb... 7 89 b b...,6 aababb...,1
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb... 7 89 b b...,6 aababb...,1
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb... 7 89 b b...,6 aababb...,1
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb... 7 89 b b...,6 aababb...,1 baababb...,2 b aababb...,2
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb... 7 8… b b...,6 aababb...,1 baababb...,2 b b...,7 aababb...,2
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb... 89 b b...,6 aababb...,1 b b...,7 aababb...,2
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb... 89 b b...,6 aababb...,1 b b...,7 aababb...,2
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb... 89 b b...,6 aababb...,1 b b...,7 aababb...,2
Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb... 89 b b...,6 aababb...,1 b b...,7 aababb...,2
Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb... 89 ababb...,3 b b...,6 aababb...,1 b b...,7 aababb...,2 a
Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb... 89 ababb...,3 b b...,6 aababb...,1 b b...,7 aababb...,2 a b...,8
Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 aababb...,1 b b...,7 aababb...,2 a b...,8
Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 aababb...,1 b b...,7 aababb...,2 a b...,8
Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 aababb...,1 b b...,7 aababb...,2 a b...,8 a
Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 aababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9
Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9
Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9
Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb... 9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9
Index Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber
Suffix arrays Given string ababaa#: 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 7: # Suffixes:… but lexicographically sorted 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 1: # Which is the cost?O(n log(n))
Applications of suffix arrays 1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 1: # Binary search O(log(n) |P|) … which is the cost? O(log(n)+|P|) ? Can it be improved to …
Fast search with cost O(log(n)+|P|) Query: Invariant Properties: P1: α < query ≤ β α β 1 2 … n Suffix array P2: matches pref( query)
Fast search with cost O(log(n)+|P|) Query: Invariant Properties: P1: α < query ≤ β α β γ Algorithm: 1 2 … n Suffix array P2: matches pref( query) If suff( γ )<suff(query) then α = γ else β = γ