1 Выравнивание двух последовательностей
2 AGC A A A C
3 Sequence comparison: Motivation Finding similarity between sequences is important for many biological questions. u Find homologous proteins Allows to predict structure and function u Locate similar subsequences in DNA e.g: allows to identify regulatory elements u Locate DNA sequences that might overlap Helps in sequence assembly
4 Dot plots u Not technically an “alignment” u But gives picture of correspondence between pairs of sequences u Dot represents similarity between segments of the two sequences
5 Sequence Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences u Two basic variants of sequence alignment: Global – all characters in both sequences participate Needleman-Wunsch, 1970 Needleman-Wunsch, 1970 Local – find related regions within sequences Smith-Waterman, 1981 Smith-Waterman, 1981
6 Sequence Alignment - Example Input: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA Possible output: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A u Three elements: Perfect matches Mismatches Insertions & deletions (indel)
7 Scoring Function u Score each position independently: Match: +1 Mismatch: -1 Indel: -2 u Score of an alignment is sum of position scores u Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
8 Homology Example: Evolution of the Globins
9 Sequence vs. Structure Similarity Sequence 1 lcl|1A6M:_ MYOGLOBIN Length 151 (1..151) Sequence 2 lcl|1JL7:A MONOMER HEMOGLOBIN COMPONENT III Length 147 (1..147) Score = 31.6 bits (70), Expect = 10 Identities = 33/137 (24%), Positives = 55/137 (40%), Gaps = 17/137 (12%) Query: 2 LSEGEWQLVLHVWAKVEA--DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 59 LS + Q+V W + + AG G++ L + +HPE F + Sbjct: 2 LSAAQRQVVASTWKDIAGADNGAGVGKECLSKFISAHPEMAAVFG FSGASDP 53 Query: 60 DLKKHGVTVLTALGAI---LKKKGHHEAELKPLAQSH---ATKHKIPIKYLEFISEAIIH G VL +G L +G AE+K + H KH I +Y E Sbjct: 54 GVAELGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGNKH-IKAEYFEPLGASLLS 112 Query: 114 VLHSRHPGDFGADAQGA R G A A+ A Sbjct: 113 AMEHRIGGKMNAAAKDA 129
10 Example Alignment: Globins u figure at right shows prototypical structure of globins u figure below shows part of alignment for 8 globins (-’s indicate gaps)
11 Insertions/Deletions and Protein Structure loop structures: insertions/deletions here not so significant u Why is it that two “similar” sequences may have large insertions/deletions? some insertions and deletions may not significantly affect the structure of a protein
12 Sequence vs. Structure Similarity 1A6M: Myoglobin1JL7: Hemoglobin u Myoglobin and hemoglobin are similar, but slight differences in structure let them perform different functions.
13 Myoglobin & Hemoglobin u /structure/HbMb/hbmb.htm Красивые ролики по структуре миоглобина и гемоглобина
14 The Space of Global Alignments some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS ELV-- --VIS E-LV VIS- EL-V -VIS
15 Number of Possible Alignments u given sequences of length m and n u assume we don’t count as distinct and u we can have as few as 0 and as many as min{m, n} aligned pairs u therefore the number of possible alignments is given by C- -G -C G-
16 Number of Possible Alignments u there are possible global alignments for 2 sequences of length n e.g. two sequences of length 100 have possible alignments but we can use dynamic programming to find an optimal alignment efficiently
17 Dynamic Programming u Algorithmic technique for optimization problems that have two properties: Optimal substructure: Optimal solution can be computed from optimal solutions to subproblems Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small
18 Dynamic Programming u Break problem into overlapping subproblems u use memoization: remember solutions to subproblems that we have already seen
19 Fibonacci example u 1,1,2,3,5,8,13,21,... u fib(n) = fib(n - 2) + fib(n - 1) u Could implement as a simple recursive function u However, complexity of simple recursive function is exponential in n
20 Fibonacci dynamic programming u Two approaches 1. Memoization: Store results from previous calls of function in a table (top down approach) 2. Solve subproblems from smallest to largest, storing results in table (bottom up approach) u Both require evaluating all (n-1) subproblems only once: O(n)
21 Dynamic Programming Graphs u Dynamic programming algorithms can be represented by a directed acyclic graph Each subproblem is a vertex Direct dependencies between subproblems are edges graph for fib(6)
22 Global Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences in which all characters in both sequences participate u The Needleman-Wunsch algorithm finds an optimal global alignment between two sequences Uses a scoring function A dynamic programming algorithm
23 Dynamic Programming Idea consider last step in computing alignment of AAAC with AGC u three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA CAG CAAAC CAG -AAA -AGC C consider best alignment of these prefixes score of aligning this pair +
24 u Suppose we have two sequences: s=s 1 …s n and t=t 1 …t m u Construct a matrix V[n+1, m+1] in which V(i, j) contains the score for the best alignment between s 1 …s i and t 1 …t j. The grade for cell V(i, j) is: V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) u d- штраф за открытие разрыва (gap-open) - linear gap penalty u V(n,m) is the score for the best alignment between s and t The Needleman-Wunsch (NW) Algorithm
25 NW Algorithm – An Example u Alphabet: DNA, ∑ = {A,C,G,T} u Input: s = AAAC t = AGC u Scoring scheme: Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )
26 Initializing Matrix: Global Alignment with Linear Gap Penalty A s A 2s2s CAG A 3s3s C 4s4s 0 3s3ss2s2s
27 NW Algorithm – An Example AGC A A A C AG-C AAAC -AGC AAAC A-GC AAAC V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Обратный проход: движемся обратно по тем ячейкам, из которых было вычислено u Лучший вес по определению
28 NW – Time and Space Complexity Time: u Filling the matrix: u Backtracing: u Overall: Space: u Holding the matrix: AGC A A A C O(n·m) O(n+m) O(n·m)
29 Local Alignment Motivation u useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere u useful for comparing DNA sequences that share a similar motif but differ elsewhere u useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) u more sensitive when comparing highly diverged sequences
30 Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene
31 Structure of a genome ABMake DC If B then NOT D If A and B then D Make BD If D then B C gene D gene B short sequences regulate expression of genes lots of “junk” sequence e.g. ~50% repeats selfish DNA
32 Cross-species genome similarity u 98% of genes are conserved between any two mammals u ~75% average similarity in protein sequence hum_a : 57331/ mus_a : 78560/ rat_a : / fug_a : 36008/68174 hum_a : 57381/ mus_a : 78610/ rat_a : / fug_a : 36058/68174 hum_a : 57431/ mus_a : 78659/ rat_a : / fug_a : 36084/68174 hum_a : 57481/ mus_a : 78708/ rat_a : / fug_a : 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish
33 The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc
34 Smith-Waterman Algorithm u Два отличия от Нидлмана-Вунша Для каждого элемента матрицы дана возможность принять значение, равное нулю, если все другие значения отрицательны Выравнивание может заканчиваться в любом месте таблицы. Лучший вес - наибольшее значение всей матрицы. Оттуда и начинается обратный проход 0 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )
35 Local Alignment u Let gap = -2 match = 1 mismatch = -1. GATCACCTGATACCC GATCACCT GAT _ ACCC
36 Overlap Alignment Перекрывающиеся выравнивания Consider the following problem: Find the most significant overlap between two sequences S,T ? Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences. Мы хотим получить разновидность глобального выравнивания, но в котором нет штрафа за свисающие концы То есть выравнивание начиналось на левой или верхней границе матрицы, а заканчивалось на правой или нижней
37 Формально: Исходя из S[ 1..n ], T[ 1..m ] найти i,j такие что d - максимально, где d: d=max{D(S[1..i],T[j..m]), D(S[i..n],T[1..j]), D(S[1..n],T[i..j]), D(S[i..j],T[1..m]) }. Решение: То же самое, что и глобальное выравнивание, за исключением того, что мы не штрафуем за висящие концы. Overlap Alignment
38 u Recurrence: as in global alignment u Score: maximum value at the bottom line and rightmost line Overlap Alignment Initialization: V[i,0]=0, V[0,j]=0 globallocaloverlap
39 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5
40 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5
41 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme: u Match: +4 u Mismatch: -1 u Indel: -5
42 The best overlap is: PAWHEAE HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE- Overlap Alignment (Example) Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5 -2
43 Динамическое программирование с более сложными моделями u До сих пор мы рассматривали простейшую модель разрывов, где штраф d - линейно зависел от его длины. Каждый следующий остаток наказывается так же, как и первый. (g)= - nd n - число остатков, d - штраф за открытие разрыва u Введем аффинную функцию. (n)= -d-(n-1)e n - число остатков, d - штраф за открытие разрыва, а e - штраф за его продолжение
44 Dynamic Programming for the Affine Gap Penalty Case u to do in time, need 3 matrices instead of 1 best score given that y[j] is aligned to a gap best score given that x[i] is aligned to a gap best score given that x[i] is aligned to y[j] IGAx i LGVy i AIGAx i GVy i -- GAx i -- SLGVy i
45 Why Three Matrices Are Needed WFP F W S( F, W ) = 1 S( W, W ) = 11 S( F, F ) = 6 S( W, P ) = -4 S( F, P ) = -4 consider aligning the sequences WFP and FW using d= -4 (gap opening), e = -1 (gap extension) and the following values from the BLOSUM-62 substitution matrix: the matrix shows the highest-scoring partial alignment for each pair of prefixes -WFP FW-- optimal alignment best alignment of these prefixes; to get optimal alignment, need to also remember WF FW -WF FW-
46 Global Alignment DP for the Affine Gap Penalty Case d+e e e M Ix Iy M Ix Iy M Ix Iy M Ix Iy M Ix Iy
47 Global Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest of –stop at any of –note that pointers may traverse all three matrices d+e
48 Global Alignment Example M : 0 I x : -3 I y : -3 -∞-∞ -∞ -4 -∞ -5 -∞ -7 -∞ -6 -∞ ∞ -∞-∞ -5 -∞ ∞ ∞ ∞ -6 -∞-∞ -4 -∞-∞ -3 -∞-∞ ∞ -5 -∞ ∞ ∞ -6 -∞ A CACT A A T ACACT --AAT ACACT A--AT ACACT AA--T three optimal alignments:
49 Local Alignment DP for the Affine Gap Penalty Case d+e e e
50 Local Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest –stop at
51 Computational Complexity and Gap Penalty Functions u linear: affine: general: concave assuming two sequences of length n
52 (Global) with General Gap Penalty Function consider every previous element in the row consider every previous element in the column
53 Finite State Automation (FSA) Конечный автомат u В теории алгоритмов такая система называеся конечным автоматом u Выравнивание соответствует пути через состояния автомата, а символы в выравнивании переписаны из исходных последовательностей согласно значениям состояний
54
55 Additional (optional)
56 Semi-global Alignment Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.
57 Global Alignment Example: AAACCC A CCC Prefer to see: AAACCC AAACCC ACCC ACCC Do not want to penalize the end spaces empty AAACCC A C C C
58 SemiGlobal Alignment Example: s = AAACCC t = ACCC empty AAACCC A C C C
59 SemiGlobal Alignment Example: s = AAACCCG t = ACCC empty AAACCC A C C C G
60 SemiGlobal Alignment u Summary of end space charging procedures: Place where spaces are not penalized for Action Beginning of 1 st sequence End of 1 st sequence Beginning of 2 nd sequence End of 2 nd sequence Initialize 1 st row with zeros Look for max in last row Initialize 1 st column with zeros Look for max in last column