Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-

Similar presentations


Presentation on theme: "Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-"— Presentation transcript:

1 Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY- MHH-ALQRRTVWVNAY Blosum Score = 2 (end = -6) Score = 79 (gap = -6) An alignment must have equal length aligned sequences – For a full alignment we must add gaps at the start and the ends Combinatorially difficult problem to find best indel solution

2 Gap So far we ignored gaps A gap corresponds to an insertion or a deletion of a residue A conventional wisdom dictates that the penalty for a gap must be several times greater than the penalty for a mutation. That is because a gap/extra residue – Interrupts the entire polymer chain – In DNA shifts the reading frame in coding regions

3 Gap Penalties Gaps are penalised – Write w x to indicate the penalty for a gap of length x – For example, each gap scores -6, so w x = -6*x One common scheme is affine gap penalties – Score -12 for opening a gap – And -2 for every subsequent gap – i.e., w x = -12 - 2*(x-1) Start and end gap penalties often set to zero – But this can leave a doubt unless we have fragments About evolutionary conclusions

4 Dot Matrix Representations (Dotplots) To help visualise best alignments Plot where each pair is the same, then draw best line MNALSQLN N A L M S Q N H MNALSQLN N A L M S Q N H

5 Getting Alignments from Dotplot Paths MNALSQLN N A L M S Q N H Indicates that M matches with a gap Indicates that L matches with a gap Stage 1: – Align middle – Use triangles To indicate gaps NAL-SQLN NALMSQ-N Stage 2: – Sort the ends out MNAL-SQLN- -NALMSQ-NH

6 Dotplots for Real Proteins Need a way to automatically find the best path(s)

7 Dynamic Programming Approach BLAST is quick – But not guaranteed to find best alignment – Gapped blast has indels, but no guarantee… Dynamic Programming: – Also known as: Needleman-Wunsch Algorithm Can use it to draw the Dotplot paths – From that we can get the alignment Mathematically guaranteed – To find the best scoring alignment – Given a substitution scheme (scoring scheme, e.g., BLOSUM) – And given a gap penalty

8 The Needleman-Wunsch algorithm A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences. The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953!

9 Dynamic Programming A divide-and-conquer strategy: – Break the problem into smaller subproblems. – Solve the smaller problems optimally. – Use the sub-problem solutions to construct an optimal solution for the original problem. Dynamic programming can be applied only to problems exhibiting the properties of overlapping subproblems. Examples include – Travelling salesman problem – Finding the best chess move

10 Overview of Needleman-Wunsch Four Stages 1. Initialise a matrix for the sequences 2. Fill in the entries of that matrix (call these S i,j ) At the same time drawing arrows in the matrix – Diagonal, up, left 3. Use the arrows to find the best scoring path(s) 4. Interpret the paths as alignments as before Illustrate with: MNALQM & NALMSQA

11 Stage 1 Initialising the Matrix Draw the grid Put in increasing gap penalties Then put in BLOSUM scores

12 Stage 2 Putting Scores and Arrows in Put the score in Draw the arrow

13 Mathematically, we are calculating: Where: – S i,j is the matrix entry at (i,j) [the one we want to fill in] S i-1,j-1 is above and to the left of this – s(a i,b j ) is the BLOSUM score for the i-th residue from the horizontal sequence and j-th residue from the vertical sequance (i.e., just the scores we have written in brackets)

14 This diagram might help:

15 Fill in the next row and column

16 A Close up View

17 Continue filling in the S i,j entries

18 Stage 3 Finding the best path Scores S i,j in the matrix – Are the BLOSUM scores for alignments However! – We must take into account final gap penalties Look down the final column and along the final row – Find the highest scoring number – Remembering to take off the gap penalty the correct number of times

19 Finding the best path

20 So, the best path is:

21 Stage 4: Generating the Alignment Firstly, draw the Dotplot

22 Secondly, Generate the Alignment Using the technique previously mentioned – This path gives us an alignment with three gaps M N A L - - Q M - N A L M S Q A S = -6 6 4 4 -6 -6 5 -1 = 0 Should check that you get the same score – As on the diagram

23 Other Alignments MNALQ-M- MNALQM-- -NALMSQA (score=-4) -NALMSQA (score=-5)

24 Smith - Waterman Alterations To make the algorithm find best local alignment Adjustments only to the scoring scheme for S i,j : – The scoring scheme must include: Some negative scores for mismatches – When S i,j becomes negative, set it to zero So local paths are not penalised for earlier bad routes To find best local alignment – Find highest scoring matrix position (anywhere) – And work backwards until a zero is reached

25 Local and Global Alignments Needleman & Wunsch best global alignments Smith & Waterman best local alignments For illustration purposes only – Calculations done slightly differently (don’t worry)

26 Smith - Waterman - Eggert To make the algorithm find n best alignments Find best local alignment Zero all cells used in the current alignment Find the highest remaining score – Second-best alignment Repeat until results are below a threshold

27 An example database search The Cystic Fibrosis Gene Found by Lap-Chee Tsui and Francis Collins in 1989 Pure bioinformatics analysis

28 Cystic Fibrosis Gene: No complete mRNA/cDNA clone

29 Cystic Fibrosis Gene: DNA sequence -> predicted protein

30 Cystic Fibrosis Gene: Database search top hits

31 Cystic Fibrosis Gene: Top hit was 19% identity

32 Cystic Fibrosis Gene: Dotplot shows what 19% means

33 Cystic Fibrosis Gene: Protein matches itself

34 Cystic Fibrosis Gene: Hydrophobic features

35 Window size 21 better than 9. Matches membrane size (12 helices marked)

36 Cystic Fibrosis Gene: Predicted structure diagram Remember: All this was derived from a predicted protein sequence. No cDNA = no protein to experiment on!


Download ppt "Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-"

Similar presentations


Ads by Google