Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Выравнивание двух последовательностей. 2 AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4.

Similar presentations


Presentation on theme: "1 Выравнивание двух последовательностей. 2 AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4."— Presentation transcript:

1 1 Выравнивание двух последовательностей

2 2 AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4

3 3 Sequence comparison: Motivation Finding similarity between sequences is important for many biological questions. u Find homologous proteins  Allows to predict structure and function u Locate similar subsequences in DNA  e.g: allows to identify regulatory elements u Locate DNA sequences that might overlap  Helps in sequence assembly

4 4 Dot plots u Not technically an “alignment” u But gives picture of correspondence between pairs of sequences u Dot represents similarity between segments of the two sequences

5 5 Sequence Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences u Two basic variants of sequence alignment:  Global – all characters in both sequences participate  Needleman-Wunsch, 1970 Needleman-Wunsch, 1970  Local – find related regions within sequences  Smith-Waterman, 1981 Smith-Waterman, 1981

6 6 Sequence Alignment - Example Input: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA Possible output: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A u Three elements:  Perfect matches  Mismatches  Insertions & deletions (indel)

7 7 Scoring Function u Score each position independently:  Match: +1  Mismatch: -1  Indel: -2 u Score of an alignment is sum of position scores u Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23

8 8 Homology Example: Evolution of the Globins

9 9 Sequence vs. Structure Similarity Sequence 1 lcl|1A6M:_ MYOGLOBIN Length 151 (1..151) Sequence 2 lcl|1JL7:A MONOMER HEMOGLOBIN COMPONENT III Length 147 (1..147) Score = 31.6 bits (70), Expect = 10 Identities = 33/137 (24%), Positives = 55/137 (40%), Gaps = 17/137 (12%) Query: 2 LSEGEWQLVLHVWAKVEA--DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 59 LS + Q+V W + + AG G++ L + +HPE F + Sbjct: 2 LSAAQRQVVASTWKDIAGADNGAGVGKECLSKFISAHPEMAAVFG--------FSGASDP 53 Query: 60 DLKKHGVTVLTALGAI---LKKKGHHEAELKPLAQSH---ATKHKIPIKYLEFISEAIIH 113 + + G VL +G L +G AE+K + H KH I +Y E + +++ Sbjct: 54 GVAELGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGNKH-IKAEYFEPLGASLLS 112 Query: 114 VLHSRHPGDFGADAQGA 130 + R G A A+ A Sbjct: 113 AMEHRIGGKMNAAAKDA 129

10 10 Example Alignment: Globins u figure at right shows prototypical structure of globins u figure below shows part of alignment for 8 globins (-’s indicate gaps)

11 11 Insertions/Deletions and Protein Structure loop structures: insertions/deletions here not so significant u Why is it that two “similar” sequences may have large insertions/deletions?  some insertions and deletions may not significantly affect the structure of a protein

12 12 Sequence vs. Structure Similarity 1A6M: Myoglobin1JL7: Hemoglobin u Myoglobin and hemoglobin are similar, but slight differences in structure let them perform different functions.

13 13 Myoglobin & Hemoglobin u http://higheredbcs.wiley.com/legacy/college/boyer/0471661791 /structure/HbMb/hbmb.htm Красивые ролики по структуре миоглобина и гемоглобина

14 14 The Space of Global Alignments  some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS ELV-- --VIS E-LV VIS- EL-V -VIS

15 15 Number of Possible Alignments u given sequences of length m and n u assume we don’t count as distinct and u we can have as few as 0 and as many as min{m, n} aligned pairs u therefore the number of possible alignments is given by C- -G -C G-

16 16 Number of Possible Alignments u there are possible global alignments for 2 sequences of length n e.g. two sequences of length 100 have possible alignments but we can use dynamic programming to find an optimal alignment efficiently

17 17 Dynamic Programming u Algorithmic technique for optimization problems that have two properties:  Optimal substructure: Optimal solution can be computed from optimal solutions to subproblems  Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small

18 18 Dynamic Programming u Break problem into overlapping subproblems u use memoization: remember solutions to subproblems that we have already seen 1 3 2 7 6 8 5 4

19 19 Fibonacci example u 1,1,2,3,5,8,13,21,... u fib(n) = fib(n - 2) + fib(n - 1) u Could implement as a simple recursive function u However, complexity of simple recursive function is exponential in n

20 20 Fibonacci dynamic programming u Two approaches 1. Memoization: Store results from previous calls of function in a table (top down approach) 2. Solve subproblems from smallest to largest, storing results in table (bottom up approach) u Both require evaluating all (n-1) subproblems only once: O(n)

21 21 Dynamic Programming Graphs u Dynamic programming algorithms can be represented by a directed acyclic graph  Each subproblem is a vertex  Direct dependencies between subproblems are edges 321456 graph for fib(6)

22 22 Global Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences in which all characters in both sequences participate u The Needleman-Wunsch algorithm finds an optimal global alignment between two sequences  Uses a scoring function  A dynamic programming algorithm

23 23 Dynamic Programming Idea  consider last step in computing alignment of AAAC with AGC u three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA CAG CAAAC CAG -AAA -AGC C consider best alignment of these prefixes score of aligning this pair +

24 24 u Suppose we have two sequences:  s=s 1 …s n and t=t 1 …t m u Construct a matrix V[n+1, m+1] in which V(i, j) contains the score for the best alignment between s 1 …s i and t 1 …t j.  The grade for cell V(i, j) is: V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) u d- штраф за открытие разрыва (gap-open) - linear gap penalty u V(n,m) is the score for the best alignment between s and t The Needleman-Wunsch (NW) Algorithm

25 25 NW Algorithm – An Example u Alphabet:  DNA, ∑ = {A,C,G,T} u Input:  s = AAAC  t = AGC u Scoring scheme:  Match: score (x, x) = 1  Mismatch: score (x, y) = -1  Gap Opening d = -2 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )

26 26 Initializing Matrix: Global Alignment with Linear Gap Penalty A s A 2s2s CAG A 3s3s C 4s4s 0 3s3ss2s2s

27 27 NW Algorithm – An Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 AG-C AAAC -AGC AAAC A-GC AAAC V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Обратный проход: движемся обратно по тем ячейкам, из которых было вычислено u Лучший вес по определению

28 28 NW – Time and Space Complexity Time: u Filling the matrix: u Backtracing: u Overall: Space: u Holding the matrix: AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 O(n·m) O(n+m) O(n·m)

29 29 Local Alignment Motivation u useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere u useful for comparing DNA sequences that share a similar motif but differ elsewhere u useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) u more sensitive when comparing highly diverged sequences

30 30 Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene

31 31 Structure of a genome ABMake DC If B then NOT D If A and B then D Make BD If D then B C gene D gene B short sequences regulate expression of genes lots of “junk” sequence e.g. ~50% repeats selfish DNA

32 32 Cross-species genome similarity u 98% of genes are conserved between any two mammals u ~75% average similarity in protein sequence hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001 mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001 rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938 fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001 mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001 rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938 fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174 hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001 mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001 rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938 fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001 mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001 rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938 fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

33 33 The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc

34 34 Smith-Waterman Algorithm u Два отличия от Нидлмана-Вунша  Для каждого элемента матрицы дана возможность принять значение, равное нулю, если все другие значения отрицательны  Выравнивание может заканчиваться в любом месте таблицы. Лучший вес - наибольшее значение всей матрицы. Оттуда и начинается обратный проход 0 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )

35 35 Local Alignment u Let gap = -2 match = 1 mismatch = -1. GATCACCTGATACCC 000000000 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0000 1 0 2 1 1 0 0 0 0 0 3 2 2 0 0 0 0 1 4 3 0 0 1 0 0 2 3 2 0 1 0 0 0 0 3 1 0 0 0 0 1 2 2 1 1 GATCACCT GAT _ ACCC

36 36 Overlap Alignment Перекрывающиеся выравнивания Consider the following problem: Find the most significant overlap between two sequences S,T ? Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences. Мы хотим получить разновидность глобального выравнивания, но в котором нет штрафа за свисающие концы То есть выравнивание начиналось на левой или верхней границе матрицы, а заканчивалось на правой или нижней

37 37 Формально: Исходя из S[ 1..n ], T[ 1..m ] найти i,j такие что d - максимально, где d: d=max{D(S[1..i],T[j..m]), D(S[i..n],T[1..j]), D(S[1..n],T[i..j]), D(S[i..j],T[1..m]) }. Решение: То же самое, что и глобальное выравнивание, за исключением того, что мы не штрафуем за висящие концы. Overlap Alignment

38 38 u Recurrence: as in global alignment u Score: maximum value at the bottom line and rightmost line Overlap Alignment  Initialization: V[i,0]=0, V[0,j]=0 globallocaloverlap

39 39 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5

40 40 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5

41 41 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme: u Match: +4 u Mismatch: -1 u Indel: -5

42 42 The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE- Overlap Alignment (Example) Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5 -2

43 43 Динамическое программирование с более сложными моделями u До сих пор мы рассматривали простейшую модель разрывов, где штраф d - линейно зависел от его длины. Каждый следующий остаток наказывается так же, как и первый.  (g)= - nd n - число остатков, d - штраф за открытие разрыва u Введем аффинную функцию.  (n)= -d-(n-1)e n - число остатков, d - штраф за открытие разрыва, а e - штраф за его продолжение

44 44 Dynamic Programming for the Affine Gap Penalty Case u to do in time, need 3 matrices instead of 1 best score given that y[j] is aligned to a gap best score given that x[i] is aligned to a gap best score given that x[i] is aligned to y[j] IGAx i LGVy i AIGAx i GVy i -- GAx i -- SLGVy i

45 45 Why Three Matrices Are Needed WFP F W 0-5-6-7 -511-4 -6620 S( F, W ) = 1 S( W, W ) = 11 S( F, F ) = 6 S( W, P ) = -4 S( F, P ) = -4  consider aligning the sequences WFP and FW using d= -4 (gap opening), e = -1 (gap extension) and the following values from the BLOSUM-62 substitution matrix: the matrix shows the highest-scoring partial alignment for each pair of prefixes -WFP FW-- optimal alignment best alignment of these prefixes; to get optimal alignment, need to also remember WF FW -WF FW-

46 46 Global Alignment DP for the Affine Gap Penalty Case d+e e e M Ix Iy M Ix Iy M Ix Iy M Ix Iy M Ix Iy

47 47 Global Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest of –stop at any of –note that pointers may traverse all three matrices d+e

48 48 Global Alignment Example M : 0 I x : -3 I y : -3 -∞-∞ -∞ -4 -∞ -5 -∞ -7 -∞ -6 -∞ -8 1 -∞ -∞-∞ -5 -∞ -3 -7 -∞ -5 -4 -∞ -4 -8 -∞ -6 -∞-∞ -4 -∞-∞ -3 -∞-∞ 0 -9 -7 -5 -11 -5 -2 -8 -4 -6 -12 -6 -∞ -5 -∞ -6 -4 -∞ -4 -10 -3 -9 -5 -6 -8 -4 -10 -6 -∞ -6 -∞ A CACT A A T ACACT --AAT ACACT A--AT ACACT AA--T three optimal alignments:

49 49 Local Alignment DP for the Affine Gap Penalty Case d+e e e

50 50 Local Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest –stop at

51 51 Computational Complexity and Gap Penalty Functions u linear: affine: general: concave  assuming two sequences of length n

52 52 (Global) with General Gap Penalty Function consider every previous element in the row consider every previous element in the column

53 53 Finite State Automation (FSA) Конечный автомат u В теории алгоритмов такая система называеся конечным автоматом u Выравнивание соответствует пути через состояния автомата, а символы в выравнивании переписаны из исходных последовательностей согласно значениям состояний

54 54

55 55 Additional (optional)

56 56 Semi-global Alignment Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.

57 57 Global Alignment Example: AAACCC A  CCC Prefer to see: AAACCC AAACCC   ACCC   ACCC Do not want to penalize the end spaces empty AAACCC 0-2-4-6-8-10-12 A -21-3-5-7-9 C -40-2 -4-6 C -3-2 -3 C -8-5-4-3000

58 58 SemiGlobal Alignment Example: s = AAACCC t =   ACCC empty AAACCC 0000000 A -2111 C -400200 C -6-3-2131 C -8-5-4-3024

59 59 SemiGlobal Alignment Example: s = AAACCCG t =   ACCC  empty AAACCC 0000000 A -2111 C -400200 C -6-3-2131 C -8-5-4-3024 2 -2 0 G

60 60 SemiGlobal Alignment u Summary of end space charging procedures: Place where spaces are not penalized for Action Beginning of 1 st sequence End of 1 st sequence Beginning of 2 nd sequence End of 2 nd sequence Initialize 1 st row with zeros Look for max in last row Initialize 1 st column with zeros Look for max in last column


Download ppt "1 Выравнивание двух последовательностей. 2 AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4."

Similar presentations


Ads by Google