Download presentation
Presentation is loading. Please wait.
Published byMadeline Jefferson Modified over 8 years ago
1
1 Выравнивание двух последовательностей
2
2 AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4
3
3 Sequence comparison: Motivation Finding similarity between sequences is important for many biological questions. u Find homologous proteins Allows to predict structure and function u Locate similar subsequences in DNA e.g: allows to identify regulatory elements u Locate DNA sequences that might overlap Helps in sequence assembly
4
4 Dot plots u Not technically an “alignment” u But gives picture of correspondence between pairs of sequences u Dot represents similarity between segments of the two sequences
5
5 Sequence Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences u Two basic variants of sequence alignment: Global – all characters in both sequences participate Needleman-Wunsch, 1970 Needleman-Wunsch, 1970 Local – find related regions within sequences Smith-Waterman, 1981 Smith-Waterman, 1981
6
6 Sequence Alignment - Example Input: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA Possible output: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A u Three elements: Perfect matches Mismatches Insertions & deletions (indel)
7
7 Scoring Function u Score each position independently: Match: +1 Mismatch: -1 Indel: -2 u Score of an alignment is sum of position scores u Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
8
8 Homology Example: Evolution of the Globins
9
9 Sequence vs. Structure Similarity Sequence 1 lcl|1A6M:_ MYOGLOBIN Length 151 (1..151) Sequence 2 lcl|1JL7:A MONOMER HEMOGLOBIN COMPONENT III Length 147 (1..147) Score = 31.6 bits (70), Expect = 10 Identities = 33/137 (24%), Positives = 55/137 (40%), Gaps = 17/137 (12%) Query: 2 LSEGEWQLVLHVWAKVEA--DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 59 LS + Q+V W + + AG G++ L + +HPE F + Sbjct: 2 LSAAQRQVVASTWKDIAGADNGAGVGKECLSKFISAHPEMAAVFG--------FSGASDP 53 Query: 60 DLKKHGVTVLTALGAI---LKKKGHHEAELKPLAQSH---ATKHKIPIKYLEFISEAIIH 113 + + G VL +G L +G AE+K + H KH I +Y E + +++ Sbjct: 54 GVAELGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGNKH-IKAEYFEPLGASLLS 112 Query: 114 VLHSRHPGDFGADAQGA 130 + R G A A+ A Sbjct: 113 AMEHRIGGKMNAAAKDA 129
10
10 Example Alignment: Globins u figure at right shows prototypical structure of globins u figure below shows part of alignment for 8 globins (-’s indicate gaps)
11
11 Insertions/Deletions and Protein Structure loop structures: insertions/deletions here not so significant u Why is it that two “similar” sequences may have large insertions/deletions? some insertions and deletions may not significantly affect the structure of a protein
12
12 Sequence vs. Structure Similarity 1A6M: Myoglobin1JL7: Hemoglobin u Myoglobin and hemoglobin are similar, but slight differences in structure let them perform different functions.
13
13 Myoglobin & Hemoglobin u http://higheredbcs.wiley.com/legacy/college/boyer/0471661791 /structure/HbMb/hbmb.htm Красивые ролики по структуре миоглобина и гемоглобина
14
14 The Space of Global Alignments some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS ELV-- --VIS E-LV VIS- EL-V -VIS
15
15 Number of Possible Alignments u given sequences of length m and n u assume we don’t count as distinct and u we can have as few as 0 and as many as min{m, n} aligned pairs u therefore the number of possible alignments is given by C- -G -C G-
16
16 Number of Possible Alignments u there are possible global alignments for 2 sequences of length n e.g. two sequences of length 100 have possible alignments but we can use dynamic programming to find an optimal alignment efficiently
17
17 Dynamic Programming u Algorithmic technique for optimization problems that have two properties: Optimal substructure: Optimal solution can be computed from optimal solutions to subproblems Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small
18
18 Dynamic Programming u Break problem into overlapping subproblems u use memoization: remember solutions to subproblems that we have already seen 1 3 2 7 6 8 5 4
19
19 Fibonacci example u 1,1,2,3,5,8,13,21,... u fib(n) = fib(n - 2) + fib(n - 1) u Could implement as a simple recursive function u However, complexity of simple recursive function is exponential in n
20
20 Fibonacci dynamic programming u Two approaches 1. Memoization: Store results from previous calls of function in a table (top down approach) 2. Solve subproblems from smallest to largest, storing results in table (bottom up approach) u Both require evaluating all (n-1) subproblems only once: O(n)
21
21 Dynamic Programming Graphs u Dynamic programming algorithms can be represented by a directed acyclic graph Each subproblem is a vertex Direct dependencies between subproblems are edges 321456 graph for fib(6)
22
22 Global Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences in which all characters in both sequences participate u The Needleman-Wunsch algorithm finds an optimal global alignment between two sequences Uses a scoring function A dynamic programming algorithm
23
23 Dynamic Programming Idea consider last step in computing alignment of AAAC with AGC u three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA CAG CAAAC CAG -AAA -AGC C consider best alignment of these prefixes score of aligning this pair +
24
24 u Suppose we have two sequences: s=s 1 …s n and t=t 1 …t m u Construct a matrix V[n+1, m+1] in which V(i, j) contains the score for the best alignment between s 1 …s i and t 1 …t j. The grade for cell V(i, j) is: V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) u d- штраф за открытие разрыва (gap-open) - linear gap penalty u V(n,m) is the score for the best alignment between s and t The Needleman-Wunsch (NW) Algorithm
25
25 NW Algorithm – An Example u Alphabet: DNA, ∑ = {A,C,G,T} u Input: s = AAAC t = AGC u Scoring scheme: Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )
26
26 Initializing Matrix: Global Alignment with Linear Gap Penalty A s A 2s2s CAG A 3s3s C 4s4s 0 3s3ss2s2s
27
27 NW Algorithm – An Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 AG-C AAAC -AGC AAAC A-GC AAAC V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Обратный проход: движемся обратно по тем ячейкам, из которых было вычислено u Лучший вес по определению
28
28 NW – Time and Space Complexity Time: u Filling the matrix: u Backtracing: u Overall: Space: u Holding the matrix: AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 O(n·m) O(n+m) O(n·m)
29
29 Local Alignment Motivation u useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere u useful for comparing DNA sequences that share a similar motif but differ elsewhere u useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) u more sensitive when comparing highly diverged sequences
30
30 Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene
31
31 Structure of a genome ABMake DC If B then NOT D If A and B then D Make BD If D then B C gene D gene B short sequences regulate expression of genes lots of “junk” sequence e.g. ~50% repeats selfish DNA
32
32 Cross-species genome similarity u 98% of genes are conserved between any two mammals u ~75% average similarity in protein sequence hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001 mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001 rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938 fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001 mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001 rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938 fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174 hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001 mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001 rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938 fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001 mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001 rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938 fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish
33
33 The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc
34
34 Smith-Waterman Algorithm u Два отличия от Нидлмана-Вунша Для каждого элемента матрицы дана возможность принять значение, равное нулю, если все другие значения отрицательны Выравнивание может заканчиваться в любом месте таблицы. Лучший вес - наибольшее значение всей матрицы. Оттуда и начинается обратный проход 0 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )
35
35 Local Alignment u Let gap = -2 match = 1 mismatch = -1. GATCACCTGATACCC 000000000 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0000 1 0 2 1 1 0 0 0 0 0 3 2 2 0 0 0 0 1 4 3 0 0 1 0 0 2 3 2 0 1 0 0 0 0 3 1 0 0 0 0 1 2 2 1 1 GATCACCT GAT _ ACCC
36
36 Overlap Alignment Перекрывающиеся выравнивания Consider the following problem: Find the most significant overlap between two sequences S,T ? Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences. Мы хотим получить разновидность глобального выравнивания, но в котором нет штрафа за свисающие концы То есть выравнивание начиналось на левой или верхней границе матрицы, а заканчивалось на правой или нижней
37
37 Формально: Исходя из S[ 1..n ], T[ 1..m ] найти i,j такие что d - максимально, где d: d=max{D(S[1..i],T[j..m]), D(S[i..n],T[1..j]), D(S[1..n],T[i..j]), D(S[i..j],T[1..m]) }. Решение: То же самое, что и глобальное выравнивание, за исключением того, что мы не штрафуем за висящие концы. Overlap Alignment
38
38 u Recurrence: as in global alignment u Score: maximum value at the bottom line and rightmost line Overlap Alignment Initialization: V[i,0]=0, V[0,j]=0 globallocaloverlap
39
39 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5
40
40 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5
41
41 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme: u Match: +4 u Mismatch: -1 u Indel: -5
42
42 The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE- Overlap Alignment (Example) Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5 -2
43
43 Динамическое программирование с более сложными моделями u До сих пор мы рассматривали простейшую модель разрывов, где штраф d - линейно зависел от его длины. Каждый следующий остаток наказывается так же, как и первый. (g)= - nd n - число остатков, d - штраф за открытие разрыва u Введем аффинную функцию. (n)= -d-(n-1)e n - число остатков, d - штраф за открытие разрыва, а e - штраф за его продолжение
44
44 Dynamic Programming for the Affine Gap Penalty Case u to do in time, need 3 matrices instead of 1 best score given that y[j] is aligned to a gap best score given that x[i] is aligned to a gap best score given that x[i] is aligned to y[j] IGAx i LGVy i AIGAx i GVy i -- GAx i -- SLGVy i
45
45 Why Three Matrices Are Needed WFP F W 0-5-6-7 -511-4 -6620 S( F, W ) = 1 S( W, W ) = 11 S( F, F ) = 6 S( W, P ) = -4 S( F, P ) = -4 consider aligning the sequences WFP and FW using d= -4 (gap opening), e = -1 (gap extension) and the following values from the BLOSUM-62 substitution matrix: the matrix shows the highest-scoring partial alignment for each pair of prefixes -WFP FW-- optimal alignment best alignment of these prefixes; to get optimal alignment, need to also remember WF FW -WF FW-
46
46 Global Alignment DP for the Affine Gap Penalty Case d+e e e M Ix Iy M Ix Iy M Ix Iy M Ix Iy M Ix Iy
47
47 Global Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest of –stop at any of –note that pointers may traverse all three matrices d+e
48
48 Global Alignment Example M : 0 I x : -3 I y : -3 -∞-∞ -∞ -4 -∞ -5 -∞ -7 -∞ -6 -∞ -8 1 -∞ -∞-∞ -5 -∞ -3 -7 -∞ -5 -4 -∞ -4 -8 -∞ -6 -∞-∞ -4 -∞-∞ -3 -∞-∞ 0 -9 -7 -5 -11 -5 -2 -8 -4 -6 -12 -6 -∞ -5 -∞ -6 -4 -∞ -4 -10 -3 -9 -5 -6 -8 -4 -10 -6 -∞ -6 -∞ A CACT A A T ACACT --AAT ACACT A--AT ACACT AA--T three optimal alignments:
49
49 Local Alignment DP for the Affine Gap Penalty Case d+e e e
50
50 Local Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest –stop at
51
51 Computational Complexity and Gap Penalty Functions u linear: affine: general: concave assuming two sequences of length n
52
52 (Global) with General Gap Penalty Function consider every previous element in the row consider every previous element in the column
53
53 Finite State Automation (FSA) Конечный автомат u В теории алгоритмов такая система называеся конечным автоматом u Выравнивание соответствует пути через состояния автомата, а символы в выравнивании переписаны из исходных последовательностей согласно значениям состояний
54
54
55
55 Additional (optional)
56
56 Semi-global Alignment Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.
57
57 Global Alignment Example: AAACCC A CCC Prefer to see: AAACCC AAACCC ACCC ACCC Do not want to penalize the end spaces empty AAACCC 0-2-4-6-8-10-12 A -21-3-5-7-9 C -40-2 -4-6 C -3-2 -3 C -8-5-4-3000
58
58 SemiGlobal Alignment Example: s = AAACCC t = ACCC empty AAACCC 0000000 A -2111 C -400200 C -6-3-2131 C -8-5-4-3024
59
59 SemiGlobal Alignment Example: s = AAACCCG t = ACCC empty AAACCC 0000000 A -2111 C -400200 C -6-3-2131 C -8-5-4-3024 2 -2 0 G
60
60 SemiGlobal Alignment u Summary of end space charging procedures: Place where spaces are not penalized for Action Beginning of 1 st sequence End of 1 st sequence Beginning of 2 nd sequence End of 2 nd sequence Initialize 1 st row with zeros Look for max in last row Initialize 1 st column with zeros Look for max in last column
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.