Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Similar presentations


Presentation on theme: "Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E."— Presentation transcript:

1 Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

2 “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in bioinformatics makes sense except in the light of Biology” Bioinformatics

3 The Genetic Code

4 A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

5 Searching for similarities What is the function of the new gene? The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques): – Find a set of similar protein sequences to the unknown sequence – Identify similarities and differences – For long proteins: first identify domains

6 Is similarity really interesting?

7 Evolutionary and functional relationships Reconstruct evolutionary relation: Based on sequence -Identity (simplest method) -Similarity Homology (common ancestry: the ultimate goal) Other (e.g., 3D structure) Functional relation: Sequence Structure Function

8 Common ancestry is more interesting: Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – it’s a nice tool: When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion. Searching for similarities

9 How to go from DNA to protein sequence A piece of double stranded DNA: 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’ DNA direction is from 5’ to 3’

10 How to go from DNA to protein sequence 6-frame translation using the codon table (last lecture): 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’

11 Bioinformatics tool Data Algorithm Biological Interpretation (model) tool

12 Example today: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

13 How to determine similarity? Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work We’ll only use these Z X Y Common ancestor, usually extinct available

14 Alignment - the problem Operations: –substitution, –Insertion –deletion GTACT--C GGA-TGAC algorithmsforgenomes |||||||||| algorithms4genomes Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||? ||||||| algorithms4--genomes

15 A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

16 – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)=ak Affine: gp(k)=b+ak Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores of all alignment columns Dynamic programming Scoring alignments

17 – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)=  k Affine: gp(k)=  +  k Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores over all alignment columns Dynamic programming Scoring alignments / Gap length penalty affine concave linear, General alignment score: S a,b =

18 Substitution Matrices: DNA define a score for match/mismatch of letters Simple: Used in genome alignments: ACGT A1 C 1 G 1 T 1 ACGT A91-114-31-123 C-114100-125-31 G -125100-114 T-123-31-11491

19 Substitution matrices for a.a. Amino acids are not equal: 1.Some are similar and easily substituted: biochemical properties structure 2.Some mutations occur more often due to similar codons The two above give us substitution matrices orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm http://www.people.virginia.edu/~rjh9u/aminacid.html

20 BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html

21 Constant vs. affine gap scoring Seq1G T A - -G - T - A Seq2- - A T G- A T G - Const-2 –2 1 –2 –2 (SUM = -7)-2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7)-3 –3 1 -3 –3 (SUM = -11)‏ -2 extensionintro Gap Scoring -3Affine -2Constant …and +1 for match

22 Dynamic programming Scoring alignments 101 Amino Acid Exchange Matrix Affine gap penalties (open, extension) 20  20 Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-P o -2P x + +s(L,I)+s(K,K) T D W V T A L K T D W L - - I K

23 Amino acid exchange matrices How do we get one? And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure 20  20

24 A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V PAM250 matrix amino acid exchange matrix (log odds) Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

25 Amino acids are not equal: 1. Some are easily substituted because they have similar: physico-chemical properties structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices Amino acid exchange matrices

26 Pair-wise alignment Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~10 600 alignments! T D W V T A L K T D W L - - I K

27 Technique to overcome the combinatorial explosion: Dynamic Programming Alignment is simulated as Markov process, all sequence positions are seen as independent Chances of sequence events are independent –Therefore, probabilities per aligned position need to be multiplied –Amino acid matrices contain so-called log-odds values (log 10 of the probabilities), so probabilities can be summed

28 To perform statistical analyses on messages or sequences, we need a reference model. The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. Given a probability distribution, P i, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k,... n is simply P i P j P k ···P n. As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log P i + log P j + log P k + ··· + log P n. In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score. To say the same more statistically…

29 Sequence alignment History of Dynamic Programming algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53. 1981 Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

30 Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

31 Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Value from residue exchange matrix This is a recursive formula

32 Global alignment The Needleman-Wunsch algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming –Solve smaller subproblem(s)‏ –Iteratively extend the solution The best alignment for X[1…i] and Y[1…j] is called M[i, j] X 1 … X i-1 - - X i Y 1 … - Y j-1 Y j -

33 The algorithm Y[1…j - 1] Y[j] X[1…i - 1] X[i] Y[1…j-1]Y[j] X[1…i] - Y[1…j] - X[1…i-1] X[i] M[i, j-1] - gM[i-1, j] - g M[i-1, j-1] + m - s M[i,j] = Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment

34 The algorithm – final equation M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j])‏ jj-1 i-1 i max * * -g X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j - Corresponds to:

35 Example: global alignment of two sequences Align two DNA sequences: –GAGTGA –GAGGCGA (note the length difference)‏ Parameters of the algorithm: –Match: score(A,A) = 1 –Mismatch: score(A,T) = -1 –Gap: g = -2 M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

36 The algorithm. Step 1: init Create the matrix Initiation –0 at [0,0] –Apply the equation… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max 6543210 jj 7 6 5 4 3 2 1 0 ii A G C G G A G 0 - AGTGAG-

37 The algorithm. Step 1: init Initiation of the matrix: –0 at pos [0,0] –Fill in the first row using the “  ” rule –Fill in the first column using “  ” M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

38 The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

39 The algorithm. Step 2: fill in We are done… Where’s the result? M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i The lowest-rightmost cell

40 The algorithm. Step 3: traceback Start at the last cell of the matrix Go against the direction of arrows Sometimes the value may be obtained from more than one cell (which one?)‏ j i M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

41 The algorithm. Step 3: traceback Extract the alignments j i GAGT-GA GAGGCGA GA-GTGA GAGGCGA a b a)‏ b)‏ The ‘low’ and the ‘high’ road You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?

42 Affine gaps M[i, j] = I x [i-1, j- 1 ] I y [i- 1, j-1] M[i- 1, j- 1 ] max score(X[i], Y[j]) + I x [i, j] = M[i, j-1] + g o I x [i, j- 1 ] + g e max I y [i, j] = M[i-1, j] + g o I y [i-1, j] + g x max Penalties: g o - gap opening (e.g. -8)‏ g e - gap extension (e.g. -1)‏ X 1 … - Y 1 … Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 … - - Y 1 …Y j-1 Y j X1… XiY1… YjX1… XiY1… Yj @ home: think of boundary values M[*,0], I[*,0] etc.

43 Semi-global pairwise alignment Global alignment: all gaps are penalised Semi-global alignment: N- and C-terminal (5’ and 3’) gaps (end-gaps) are not penalised MSTGAVLIY--TS----- ---GGILLFHRTSGTSNS End-gaps

44 Variation on global alignment Global alignment: previous algorithms is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for end gaps CAGCA-CTTGGATTCTCGG ---CAGCGTGG--------

45 Semi-global pairwise alignment Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other Danger: really bad alignments for divergent sequences (particularly if gap penalties are high) Protein sequences have N- and C-terminal amino acids that are often small and hydrophilic, and so these ends match seq X: seq Y:

46 Semi-global alignment Ignore 5’ or N- terminal end gaps –First row/column set to 0 Ignore C-terminal or 3’ end gaps –Read the result from last row/column T G A - GTGAG- 13000 -202 10 1 0 000000

47 Take-home messages Homology Why are we interested in similarity? Pairwise alignment: global alignment and semi-global alignment Three types of gap penalty regimes: linear, affine and concave Make sure you can calculate alignment using the DP algorithm

48 a heuristic –Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work

49 Global dynamic programming PAM250, Gap =6 (linear) SPEARE 0-6-12-18-24-30-36 S-62-4-10-16-22-28 H-12 -42-3-9-14-20-24 A-18-10-30-7-13-18 K-24-16-9-32-4-12 E-30-22-15-5-3-26-6 -30-24-18-12-60 SPEARE S210100 H01 21 A1102-20 K00 30 E0 40 4 These values are copied from the PAM250 matrix (see earlier slide) The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps Higgs & Attwood, p. 124

50 Global dynamic programming Linear, Affine or Concave gap penalties i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)Px} S i-1,j-1 Max{S i-1, 0<y<j-1 - Pi - (j-y-1)Px} Gap opening penalty Gap extension penalty All penalty schemes are possible because the exact length of the gap is known

51 Global dynamic programming Gap o =10, Gap e =2 DWVTALK 0-12-14-16-18-20-22-24 T-128-9-6-5-9-11-14 D 09223-5-3-34 W-16-1325115490-21 V-18-10-4372119 15-16 L-20-14-223463137261 K-22-12-9173353395014 -34-2917392750 DWVTALK T83811998 D12168848 W12523265 V621288106 L46 96145 K85687513 These values are copied from the PAM250 matrix (see earlier slide), after being made non- negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

52 Easy DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check –preceding row until j-2: apply appropriate gap penalties –preceding row until i-2: apply appropriate gap penalties –and cell[i-1, j-1]: apply score for cell[i-1, j-1] i-1 j-1

53 DP is a two-step process Forward step: calculate scores Trace back: start at highest score and reconstruct the path leading to the highest score –These two steps lead to the highest scoring alignment (the optimal alignment) –This is guaranteed when you use DP!

54 Global dynamic programming

55 Semi-global dynamic programming - two examples with different gap penalties - Global score is 65 – 10 – 1*2 –10 – 2*2 These values are copied from the PAM250 matrix (see earlier slide), after being made non- negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

56 Local dynamic programming (Smith & Waterman, 1981) LCFVMLAGSTVIVGTR EDASTILCGSEDASTILCGS Amino Acid Exchange Matrix Gap penalties (open, extension) Search matrix Negative numbers AGSTVIVG A-STILCG

57 Local dynamic programming (Smith & Waterman, 1981) i-1 j-1 S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0 Gap opening penalty Gap extension penalty

58 Local dynamic programming

59 Dot plots Way of representing (visualising) sequence similarity without doing dynamic programming (DP) Make same matrix, but locally represent sequence similarity by averaging using a window

60 Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.

61 Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions


Download ppt "Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E."

Similar presentations


Ads by Google