Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.

Similar presentations


Presentation on theme: "Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap."— Presentation transcript:

1 Multiple Sequence Alignment

2 How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap penalty = can be constant or linear or nonlinear (just affects the computational complexity). MSA algorithm uses constant

3 Methods for MSA Exhaustive search – Extension of DP to multiple dimensions – E.g. MSA algorithm Progressive alignment – Compute tree of sequences, based on hierarchical clustering, and then merge closest first, greedily – E.g. ClustalW Anchor on locally conserved blocks – Find highly conserved regions and then grow alignment around these Iterative search – Based on genetic algorithm search (see Mount) Probabilistic/statistical (e.g. Gibbs Sampling)

4 Exhaustive search using Dynamic Programming Why not just use the same technique as for pairwise alignment? Instead of 2-dimensional matrix, use N­ dimensional. Complexity increases with the number of sequences, so only N < 10 and lengths ~ 200 can be accommodated.

5 MSA Algorithm Based on dynamic programming concept, with some optimizations (see Mount book, Fig 4.3). 1. Compute optimal pairwise alignments to get upperbound on any pair of alignments. MSA can’t do any better than sum of optimal pairwise alignments. 2. Create heuristic multiple alignment in ad hoc fashion to create lowerbound on MSA score. 3. Search N-dimensional scoring matrix (as in pairwise case) for optimal path, where S[i,j,k…] is the best score including ith element of sequence 1, jth of sequence 2, kth of sequence 3, etc…

6 Problem of sequence weights The available sequences are not randomly sampled, but reflect biases in how we collect sequences. If weight everything equally, then closely related sequences will be allowed to dominate the multiple alignment. As a result, conclusions about (1) conservation, (2) evolutionary distance, (3) reliability of predictions will be wrong.

7 Sequence weighting example Solutions: don’t weight the two humans equally with the others. Use a measure of similarity to down-weight their influence on the multiple alignment.

8 Computing sequence weights (many competing methods) Generally, want to make our scoring less biased by sampling of closely related species. More distance species represent better sample of evolutionary forces. Identical sequences in data set should not count extra. Near-identical as well…

9 Progressive methods Basic ideal: starting with the most alike sequence pairs, and building the alignment by adding sequences. (Fig 4.7 in Mount) Perform pairwise sequence alignments Starting with the most alike sequence, merge sequences to find ancestor sequences by finding sequences with minimum edit distance to the two children sequences. Assign weights to each branch of tree, based on distance between sequences Align sequences (starting from the closest) using weights in the score function

10 Consensus sequence with minimum edit distance 1. If exact match, accept 2. If inexact, place a letter whose sum of score differences (i.e. edit distances) to the two letters (in the two original sequences) is minimized. This is labeled as D below.

11 Calculating tree based weights (Fig 4.8 Mount)

12 Weighting an alignment (Fig 4.8 Mount)

13 Problems with progressive alignments In progressive alignment the ultimate multiple alignment is dependent on the initial pairwise alignments. The first sequences to be aligned are the most similar. If the initial alignments is good, with very few errors, the ultimate multiple alignment will be good. However, if the sequences aligned are distantly related, more errors can be made, so the final alignment is not reliable.

14 Problems with progressive alignments Another problem with progressive alignment is that the ultimate multiple alignment is dependent on choosing the correct scoring matrices, and correct gap penalty.

15 What else you need to know about all MSA methods? Almost all programs will align whatever sequences the user gives as input. They will always return an alignment, even if the sequences are completely unrelated. The biology thinking should be done by you. Most programs will insert gaps. However, if inserted, they are there to stay. You need to check how the program treats end gaps.

16 Localized MSA:Motifs, Profiles and Related Databases

17 Some concepts: Similarity vs. homology Sequence similarity can be measured in many ways: % of identical residues in an alignment % of “conservative” mutations (similar residues) in an alignment Homology implies a common ancestry of the two sequences Similarity may be used as evidence of homology, but does not necessarily imply homology.

18 Specificity and sensitivity Sensitivity: the ability of a method to detect “true positive” matches. The most sensitive search finds all true positive matches, but may also find many false positive matches. Specificity/selectivity: the ability of a method to reject “false positive” matches. The most selective search rejects all false positive matches, but may also reject many true positive matches.

19 What is a motif? A subsequence (substring) that occurs in multiple sequences with a biological importance. Motifs can be totally constant or have variable elements.

20 The needs of local MSA Proteins with several similar but short regions: aaa … … bbb … … … ccc aaa … bbb … … … … … ccc Proteins with extended motifs GV(X20)C(X30)C Proteins with inexact motifs, such as structural, electrostatic, hydrophobic, or hydrophilic motifs

21 Protein motif often results from structural features For example, structural features that are responsible for binding to the heme group in globins. Note: not all amino acids are oriented to affect the binding to the heme group.

22 Motifs in DNA sequence DNA sequences that provide signals for protein binding or nucleic acid folding. TRANSFAC Database holds information about experimentally verified transcription factors.

23 Sequence motifs rely on multiple alignments for definition Use local alignment method (eg Smith- Waterman) to find local areas in protein sequences that are high scoring. Create a multiple alignment of all pairs that share same local areas (multiple pairwise comparisons). Use this alignment to extract a summary of the key features of the motif.

24 Methods for representing motifs Consensus sequence: a single string with the most likely sequence (+/- wildcards) Regular expression (as in Unix grep/egrep): a string with wildcards, constrained selection Profile: a list of the amino acid frequencies at each position position Sequence Logo: A graphical depiction of a profile

25 Consensus sequence: A simple conserved motif Two globin sequences: FLASDFTGAAMTWGKALVALH FFSTNASGPAMLAGRGVIMPH After looking at a number of globin motifs, we can build a CONSENSUS, which shows the most likely or most representative amino acid in each position. Alternatively, it can be the sequence that has the highest score to all the members of family. FFSDAWAGPTMVIGRGILMPH

26 Regular expression: Globin sequence signature F-[LF]-X(5)-G-[PA]-X(4)-G-[KRA]-X-[LIVM]-X(3)-H A regular expression [ ] = choice X(N) = wildcard of length N H = conserved histidine at heme binding location LIVM = all hydrophobic amino G = conserved glycine

27 PROSITE database A manually created collection of regular expressions associated with different protein families/functions. Purpose: have a description of sequence motifs associated with function, for elucidating function of new sequences.

28 BLOCKS database Block: a conserved region lacking insert and delete positions. Collection of protein sequences with high level of similarity and which occur in PROSITE database, aligned in “blocks”. Purpose: understand the sequence variability of a particular motif. Also, can use to create substitution matrices, e.g. BLOSUM matrices.

29 Profile: A sample (Fig 4.11) C* = consensus sequence down first column Columns = scores for using amino acid listed at top of column

30 Sequence Logos Graphical summary of the conservation of elements in a motif. Relative heights of letters reflect their abundance in the alignment. Total height of “stack” = entropy-based measurement of conservation. Highly conserved = low entropy = tall stack. Very variable = high entropy = low stack.

31 Measure of conservation Entropy(i) = SUM _base f(base,i) * ln[f(base,i)] Conservation(i) = 2 - Entropy(i) Units of conservation = bits of information Bit = one binary decision 4 bases specified by 2 bits (00, 01, 10,11) Entropy measures variability/disorder. If no variability, Entropy = 0. Conservation = 2. If very variable, Entropy = 2. Conservation = 0.

32

33

34 Pitfalls of creating motifs Depend on quality of multiple alignment. Multiple alignments are easier for related sequences. Much harder for distantly related sequences. The database of sequences does not have a random sample of sequences, it is a biased selection. Therefore, motifs will tend to be too specific and not sensitive enough.

35 FASTA (Fig 7.2) Use hash table of short words of the query sequence. Short = 1 to 6 characters. Go through database and look for matches in the query hash table (linear in size of database). Score matching segments based on content of these matches. Extend the good matches empirically.

36 Basic idea of hashing (Table 7.3) Position 1 2 3 4 5 6 7 8 9 10 11 Sequence 1 n c s p t a... Position 1 2 3 4 5 6 7 8 9 10 11 Sequence 2 a c s p r k AAPositions (S1 S2)Offset (S1-S2) a 6 6 0 c 2 7 -5 k - 11 n 1 - p 4 9 -5 r - 10 s 3 8 -5 t 5 -

37

38

39

40

41 The significance of matches are evaluated by z score and E score. The known statistical distribution of alignment scores is used to calculate the probability that a Z score could be greater than z, P(Z>z), from extreme value distribution. The expectation E of observing, in a database of D sequences, no alignments with scores higher than z, E(Z>z)~D x P.

42 BLAST Finds inexact, ungapped “seeds” using a hashing technique (like FASTA) and then extends the seed to maximum length possible. Based on strong statistical/significance framework “What is a significantly high score of two segments of length N and M?” Most commonly used for fast searches and alignments. New versions now do gapped segments.

43

44

45

46

47

48

49

50

51

52


Download ppt "Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap."

Similar presentations


Ads by Google