Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparison of large sequences

Similar presentations


Presentation on theme: "Comparison of large sequences"— Presentation transcript:

1 Comparison of large sequences
18/09/2018 First part: Alignment of large sequences

2 Dynamic programming What about genomes?
18/09/2018 accaccacaccacaacgagcata … acctgagcgatat a c . t acc agt | | | |xx acc a-- Quadratic cost of space and time. Quadratic cost of space and time. Short sequences (up to bps) can be aligned using dynamic programming As you have seen this morning .... What about genomes?

3 (1 minute becomes 2 years)
Genomic sequences 18/09/2018 Genomic sequences have millions of base pairs. The length of sequences is 1000 times longer. The running time is times higher ! (1 second becomes 11 days) (1 minute becomes 2 years) Dealing with genomes increase the length of sequences 100 times then the cost is inceased at least one million times …makes unpractical this algorithm In which case Dynamic Programming can be applied?

4 First assumption ………………………….………………...…………...…. Genome A
………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………… …………………………….

5 Realistic assumption? Unrealistic assumption!
………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A Unrealistic assumption! …………………………………………………………………. ………………………………………………...…………...…. Genome A Genome B More realistic assumption ………………… ……………… Genome A Genome B

6 Realistic assumptions?
………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A Unrealistic assumption! ………………………………………………………………… ………………………………………………...…………...…. Genome A Genome B More realistic assumption But, now is it a real case? ………………… ……………… Genome A Genome B

7 Preview in a real case    
Chlamidia muridarum: bps Chlamidia Thrachomatis: bps

8 Preview in a real case Pyrococcus abyssis: 1.790.334 bps
Pyrococcus horikoshu: bps

9 Methodology of an alignment
Make a preview: ……………………..…. …………………...…. 1st: 2nd: 3th: Identify the portions that can be aligned. Make the alignment: …..………… ………………. (Linear cost) (Linear cost)

10 Methodology of an alignment
Make a preview: ……………………..…. …………………...…. ? 1st: 2nd: 3th: Identify the portions that can be aligned. Make the alignment: …..………… ………………. (Linear cost)

11 Preview-Revisited MUM Maximal Unique Matching Connect to MALGEN
… a a t g….c t g... … c g t g….c c c ... Maximal Unique Matching MUM Connect to MALGEN

12 Methodology of an alignment
Make a preview: ……………………..…. …………………...…. 1st: 2nd: 3th: Linear cost Suffix trees with How can MUMs be found? Identify the portions that can be aligned. Make the alignment: …..………… ………………. How can these portions be determined? With CLUSTALW, TCOFFEE,…

13 Bioinformatics PhD. Course
18/09/2018 Second part: Introducing Suffix trees

14 Suffix trees Given string ababaas: Suffixes: What kind of queries?

15 Applications of Suffix trees
1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? ………………………… a ba baas,1 as,3 baas,2 as,4 s,6 as,5 s,7

16 Quadratic insertion algorithm
Invariant Properties: Given the string …………………………...... and the suffix-tree …... P1: the leaves of suffixes from  have been inserted

17 Quadratic insertion algorithm
Given the string ababaabbs ababaabbs,1

18 Quadratic insertion algorithm
Given the string ababaabbs ababaabbs,1 babaabbs,2

19 Quadratic insertion algorithm
Given the string ababaabbs aba baabbs,1 ababaabbs,1 babaabbs,2

20 Quadratic insertion algorithm
Given the string ababaabbs abbs,3 aba baabbs,1 babaabbs,2

21 Quadratic insertion algorithm
Given the string ababaabbs abbs,3 aba baabbs,1 ba baabbs,2 babaabbs,2

22 Quadratic insertion algorithm
Given the string ababaabbs abbs,3 aba baabbs,1 ba baabbs,2 abbs,4

23 Quadratic insertion algorithm
Given the string ababaabbs abbs,3 aba baabbs,1 abbs,3 ba a baabbs,1 ba abbs,4 abbs,4 baabbs,2

24 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 abbs,3 ba a baabbs,1 ba abbs,4 abbs,4 baabbs,2

25 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 abbs,3 ba a baabbs,1 ba abbs,4 abbs,4 baabbs,2

26 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 a b a abbs,3 baabbs,1 ba ba abbs,4 abbs,4 baabbs,2

27 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 ba abbs,4 abbs,4 baabbs,2

28 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 ba abbs,4 abbs,4 baabbs,2

29 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 a baabbs,2 b abbs,4 bs,7

30 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 a baabbs,2 b abbs,4 bs,7 s,8

31 Quadratic insertion algorithm
Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 s,9 a baabbs,2 b abbs,4 bs,7 s,7

32 Generalizad suffix tree
The suffix tree of many strings … is called the generalized suffix tree … and it is the suffix tree of the concatenation of strings. For instance, the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, :

33 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : Given the suffix tree of ababaabα : abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

34 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

35 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

36 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

37 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a a baabbα,2 b abbα,4 bα,7 baabbα,1 α,8

38 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a a baabbα,2 b abbα,4 bα,7 baabbα,1 α,8

39 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

40 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

41 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

42 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

43 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : β,5 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

44 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : β,5 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

45 Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ : β,5 β,4 β,6 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

46 Generalizad suffix tree
Generalized suffix tree of ababaabbαaabaaβ : a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

47 Applications of Suffix trees
1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? ………………………… a ba baas,1 as,3 baas,2 as,4 s,6 as,5 s,7

48 Applications of Suffix trees
2. The substring problem for a database of strings DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

49 Applications of Suffix trees
3. The longest common substring of two strings a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

50 Applications of Suffix trees
5. Finding MUMs. a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

51 Bioinformatics PhD. Course
18/09/2018 Third part: Suffix links

52 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8

53 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 ? α,9 bα,7 a
α,8

54 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

55 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

56 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

57 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

58 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

59 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

60 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8 ?

61 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8

62 Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a
α,8

63 Traversal using Suffix links
Given S2 = a a b a a abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

64 Traversal using Suffix links
Given S2 = a a b a a abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

65 Traversal using Suffix links
Unique matchings Given S2 = a a b a a aa in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

66 Traversal using Suffix links
Unique matchings Given S2 = a a b a a aa in S2 [1] aab in S2 [1] = S1[5..6-7] in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

67 Traversal using Suffix links
Unique matchings Given S2 = a a b a a S1[5..6-7] in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

68 Traversal using Suffix links
Unique matchings Given S2 = a a b a a S1[5..6-7] in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

69 Traversal using Suffix links
Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

70 Traversal using Suffix links
Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

71 Traversal using Suffix links
Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

72 Traversal using Suffix links
Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

73 Traversal using Suffix links
Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

74 Traversal using Suffix links
Unique matchings Given S2 = a a b a a b b a S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

75 From UMs to MUMs Unique matchings Given S2 = a a b a a b b a
S1[5..8] in S2 [4] and S1 = a b a b a a b b α S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] Array of UMs S1[6..8] in S2 [5] 1 2 5 8 6 8 7 8 8 9 S1[7..8] in S2 [6] MUM: S1[3..6-8] in S2[2]

76 Bioinformatics PhD. Course
18/09/2018 Third part: Linear insertion algorithm

77 Quadratic insertion algorithm
Invariant Properties: Given the string …………………………...... and the suffix-tree …... P1: the leaves of suffixes from  have been inserted

78 Linear insertion algorithm
Invariant Properties: Given the string …………………………...... and the suffix-tree …... P1: the leaves of suffixes from  have been inserted P2: the string  is the longest string that can be spelt through the tree.

79 Linear insertion algorithm: example
a Given the string ababaababb... ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

80 Linear insertion algorithm: example
Given the string ababaababb... 6 7 8 ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

81 Linear insertion algorithm: example
Given the string ababaababb... 6 7 8 ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

82 Linear insertion algorithm: example
Given the string ababaababb... 6 7 89 ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

83 Linear insertion algorithm: example
Given the string ababaababb... 6 7 89 a ababb...,5 ba ababb...,3 baababb...,1 baababb...,2 ababb...,4 baababb...,1 b b...,6 aababb...,1

84 Linear insertion algorithm: example
Given the string ababaababb... 7 89 a ababb...,5 ba ababb...,3 baababb...,2 ababb...,4 b b...,6 aababb...,1

85 Linear insertion algorithm: example
Given the string ababaababb... 7 89 a ababb...,5 ba ababb...,3 baababb...,2 ababb...,4 b b...,6 aababb...,1

86 Linear insertion algorithm: example
Given the string ababaababb... 7 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 baababb...,2

87 Linear insertion algorithm: example
Given the string ababaababb... 7 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b aababb...,2 baababb...,2 baababb...,2

88 Linear insertion algorithm: example
Given the string ababaababb... 7 8… a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 baababb...,2 b b...,7 aababb...,2 baababb...,2

89 Linear insertion algorithm: example
Given the string ababaababb... 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

90 Linear insertion algorithm: example
Given the string ababaababb... 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

91 Linear insertion algorithm: example
Given the string ababaababb... 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

92 Linear insertion algorithm: example
Given the string ababaababb... 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

93 Linear insertion algorithm: example
Given the string ababaababb... 89 a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b ba ababb...,4 b aababb...,2 b...,7

94 Linear insertion algorithm: example
Given the string ababaababb... 89 a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 ba ababb...,4 b aababb...,2 b...,7

95 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 ba ababb...,4 b aababb...,2 b...,7

96 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 ba ababb...,4 b aababb...,2 b...,7

97 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 a b ababb...,4 b aababb...,2 b...,7

98 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

99 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 ababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

100 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 ababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

101 Linear insertion algorithm: example
Given the string ababaababb... 9 a ababb...,5 ababb...,3 b b...,6 ababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

102 Index Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber

103 Suffix arrays Given string ababaa#: Suffixes:
… but lexicographically sorted 2: babaa# 1 2 3 4 5 6 7 1: # 3: abaa# 6: a# 4: baa# 5: aa# 5: aa# 6: a# 7: # 3: abaa# 1: ababaa# 4: baa# 2: babaa# Which is the cost? O(n log(n))

104 Applications of suffix arrays
1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 1: # 1 2 3 4 5 6 7 Binary search … which is the cost? O(log(n) |P|) Can it be improved to … O(log(n)+|P|) ?

105 Fast search with cost O(log(n)+|P|)
1 2 … … n Suffix array Query: Invariant Properties: P1: α < query ≤ β α β P2: matches pref( query)

106 Fast search with cost O(log(n)+|P|)
1 2 … … n Suffix array P2: matches pref( query) Query: Invariant Properties: P1: α < query ≤ β α β γ If suff(γ)<suff(query) then α = γ else β = γ Algorithm:


Download ppt "Comparison of large sequences"

Similar presentations


Ads by Google