Download presentation
Presentation is loading. Please wait.
1
Pairwise Sequence Alignments
2
Topics to be Covered Comparison methods Global alignment
Local alignment
3
Introduction to Alignment
Analyze the similarities and differences at the individual base level or amino acid level Aim is to infer structural, functional and evolutionary relationships among sequences
4
Sequence Alignment Two sequences written out , one on top of the other
982 TGTTTGCTAAAGCTTCAGCTATCCACAACCCAATTGACCTCTAC 1022 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 961 TCTTTGCTAAGACCGCCTCCATCTACAACCCAATCA TCTAC 1001 Two sequences written out , one on top of the other Identical or similar characters placed in same column Nonidentical characters either placed in same column as mismatch or opposite gap in the other sequence Overall quality of the alignment is then evaluated based on a formula that counts the number of identical (or similar) pairs minus the number of mismatches and gaps
5
Pairwise Sequence Alignments
Why to compare Similarity search is necessary for: Family assignment Sequence annotation Construction of phylogenetic trees Learn about evolutionary relationships Classify sequences Identify functions Homology Modeling
6
Essential Elements of an Alignment Algorithm
Defining the problem (Global, local alignment) Scoring scheme (Gap penalties) Distance Matrix (PAM, BLOSUM series)
7
Global and Local Alignments
Global – attempt is made to align the entire sequence using as many characters as possible, up to both ends of the sequences Local – stretches of sequence with the highest density of matches are aligned L G P S S K Q T G K G S – S R I W D N | | | | | | | Global Alignment L N – I T K S A G K G A I M R L G D A T G K G | | | Local Alignment A G K G Sequences that are quite similar and approximately the same length are suitable candidates for global alignment Local alignments are more suitable for aligning sequences that are similar along some of their lengths, but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain.
8
Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAC |||||||||||| AATTGCCGCCGTCGTTTTCAGCAGTTATGTCAGATC
9
Global and Local Alignments
Global - When two sequences are of approximately equal length. Here, the goal is to obtain maximum score by completely aligning them Local - When one sequence is a sub-string of the other or the goal is to get maximum local score Protein motif searches in a database
10
Dynamic programming algorithm
Build up optimal alignment using previous solutions for optimal alignments of subsequences
11
Aligning Sequences without Insertions and Deletions: Hamming Distance
Given two DNA sequences v and w : v : A T w : A T The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar
12
Aligning Sequences with Insertions and Deletions
By shifting one sequence over one position: v : A T -- w : -- A T The edit distance: dH(v, w) = 2. Hamming distance neglects insertions and deletions in DNA
13
Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v w
14
Edit Distance vs Hamming Distance
always compares i-th letter of v with i-th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task.
15
Edit Distance vs Hamming Distance
may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task
16
Edit Distance vs Hamming Distance
may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ???
17
Edit Distance: Example
TGCATAT ATCCGAT in 5 steps TGCATAT (delete last T) TGCATA (delete last A) TGCAT (insert A at front) ATGCAT (substitute C for 3rd G) ATCCAT (insert G before last A) ATCCGAT (Done)
18
Edit Distance: Example
TGCATAT ATCCGAT in 5 steps TGCATAT (delete last T) TGCATA (delete last A) TGCAT (insert A at front) ATGCAT (substitute C for 3rd G) ATCCAT (insert G before last A) ATCCGAT (Done) What is the edit distance? 5?
19
Edit Distance: Example (cont’d)
TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6th T) ATGCATA (substitute G for 5th A) ATGCGTA (substitute C for 3rd G) ATCCGTA (Done)
20
Edit Distance: Example (cont’d)
TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6th T) ATGCAAT (substitute G for 5th A) ATGCGAT (substitute C for 3rd G) ATCCGAT (Done) Can it be done in 3 steps???
21
The Alignment Grid Every alignment path is from source to sink
22
Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v A T _ G T T A T _ A T C G T _ A _ C (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path -
23
Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v Every path in the edit graph corresponds to an alignment:
24
Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v Old Alignment v= AT_GTTAT_ w=ATCGT_A_C New Alignment v= AT_GTTAT_ w=ATCG_TA_C
25
From LCS to Alignment: Change up the Scoring
The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty
26
Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)
27
Dynamic programming algorithm
define a matrix Fij: Fij is the optimal alignment of subsequence A1...i and B1...j iterative build up: F(0,0) = 0 define each element i,j from (i-1,j): gap in sequence A (i, j-1): gap in sequence B (i-1, j-1): alignment of Ai to Bj
28
Dynamic programming
29
Sequence Comparison Scoring Matrices
• The choice of a scoring matrix can strongly influence the outcome of sequence analysis • Scoring matrices implicitly represent a particular theory of evolution • Elements of the matrices specify the similarity or the distance of replacing one residue (base) by another • Distance and similarity matrices are inter-convertible by some mathematical transformation.
30
Protein Scoring Matrices
The two most popular matrices are the PAM and the BLOSUM matrix
31
Scoring Insertions and Deletions
A T G T A A T G C A T A T G T G G A A T G A A T G T - - A A T G C A T A T G T G G A A T G A insertion / deletion The creation of a gap is penalized with a negative score value.
32
Why Gap Penalties? The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps.
33
Why Gap Penalties? Gaps not permitted Score: 0 Match = 5 Mismatch = -4
1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Match = 5 Mismatch = -4 Gaps allowed but not penalized Score: 88 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
34
Linear gap penalty score:
Gap Penalties Linear gap penalty score: γ(g) = - gd Affine gap penalty score: γ(g) = -d - (g -1)e γ(g) = gap penalty score of a gap of length g d = gap opening penalty e = gap extension penalty g = gap length
35
Scoring Indels: Naive Approach
A fixed penalty σ is given to every indel: -σ for 1 indel, -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels
36
Affine Gap Penalties In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments
37
Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.
38
Affine Gap Penalty Recurrences
si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom
39
Scoring Insertions and Deletions
match = 1 mismatch = 0 Total Score: 4 A T G T T A T A C T A T G T G C G T A T A Total Score: = 4.8 A T G T T A T A C T A T G T G C G T A T A Gap parameters: d = 3 (gap opening) e = 0.1 (gap extension) g = 3 (gap lenght) γ(g) = -3 - (3 -1) 0.1 = -3.2 insertion / deletion
40
Modification of Gap Penalties
Score Matrix: BLOSUM62 1 ...VLSPADKFLTNV 12 |||| 1 VFTELSPAKTV gap opening penalty = 3 gap extension penalty = 0.1 score = 6.3 1 V...LSPADKFLTNV 12 | |||| | | | 1 VFTELSPA.K..T.V 11 gap opening penalty = 0 gap extension penalty = 0.1 score = 11.3
41
Pairwise Sequence Alignment Local Alignment Semi-Global Alignment
42
Local Alignment A local Alignment between sequence s and
sequence t is an alignment with maximum similarity between a substring of s and a substring of t. T. F. Smith & M. S. Waterman, “Identification of Common Molecular Subsequences”, J. Mol. Biol., 147:
43
Why choose a local alignment algorithm?
More meaningful – point out conserved regions between two sequences Aligns two sequences of different lengths to be matched Aligns two partially overlapping sequences Aligns two sequences where one is a subsequence of another 43
44
Dynamic Programming Local Alignment
Si,j = MAXIMUM [ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), Si,j-1 + w (gap in sequence #1), Si-1,j + w (gap in sequence #2), 0] 44
45
Initialization Step
46
Matrix Fill Step
47
Traceback Step
48
Traceback Step
49
Traceback Step
50
An Introduction To Multiple Sequence Alignment (MSA)
51
Topics To Be Discussed Motivation for MSA What is MSA
Extension of Dynamic Programming The STAR Method Progressive Alignment Scoring Multiple Alignments
52
Multiple Alignment versus Pairwise Alignment
Up until now we have only tried to align two sequences.
53
Multiple Alignment versus Pairwise Alignment
Up until now we have only tried to align two sequences. What about more than two? And what for?
54
Multiple Alignment versus Pairwise Alignment
Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences becomes significant if present in many Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal
55
Motivation For MSA A natural extension of Pairwise Sequence Alignment
MSA gives Biologist the ability to extract biologically important but perhaps widely dispersed sequence similarities that can give biologist hints about the evolutionary history of certain sequences. In pairwise alignment, when two sequences align, it is concluded that there is probably a functional relationship between the two sequences. Whereas for MSA, if it is known that there is a functional similarity amongst a number of sequences, we can use MSA to find out where the similarity comes from.
56
What is MSA MSA is the alignment of N sequences (Protein/Nucleotide)
simultaneously, where N > 2 . Let Si denote a sequence than the Global Multiple Sequence Alignment of N > 2 sequences S = { S1 , …, SN } is obtained by inserting gaps denoted by “ - “ at any possibly the beginning or end, position. The new set of N sequences denoted by S’ = { S1’ , …, SN’ } will all have length L Ovar STCVLSAYWKD-LNNYH Bota STCVLSAYWKD-LNNYH Susc STCVLSAYWRNELNNFH Hosa STCMLGTY-QD-FNKFH Rano STCMLGTY-QD-LNKFH Sasa STCVLGKLSQE-LHKLQ
57
Interpretation of positions
Generally there are two interpretations of a position in a multiple sequence alignment: Evolutionary/historical Functional/structural In many cases these are the same, but they may not be.
58
Multiple sequence alignment algorithm
Ideal approach to multiple sequence alignment is to extend dynamic programming. Instead of aligning two sequences (two dimensional grid) we align k sequences (k dimensional grid) Extension is relatively straightforward
59
Dynamic programming for sequence alignment
Recurrence relation Tabular computation Traceback Pairwise recurrence relation S(i,j) = max[S(i-1, j-1) + m(i,j), S(i-1, j) + g, S(i, j-1) + g] m(i,j) = similarity matrix eg BLOSUM g = gap penalty
60
Aligning Three Sequences
source Same strategy as aligning two sequences Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align For global alignments, go from source to sink sink
61
2-D cell versus 2-D Alignment Cell
In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube
62
Architecture of 3-D Alignment Cell
(i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i-1,j,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)
63
Multiple Alignment: Dynamic Programming
cube diagonal: no indels si,j,k = max δ(x, y, z) is an entry in the 3-D scoring matrix si-1,j-1,k-1 + δ(vi, wj, uk) si-1,j-1,k + δ (vi, wj, _ ) si-1,j,k δ (vi, _, uk) si,j-1,k δ (_, wj, uk) si-1,j,k + δ (vi, _ , _) si,j-1,k + δ (_, wj, _) si,j,k δ (_, _, uk) face diagonal: one indel edge diagonal: two indels
64
Extending dynamic programming
Based on the extrapolation from two to three sequences, we can define the recurrence relation for any number of sequences in the same way The other steps - tabular computation and traceback - are done in the same way as for pairwise alignment
65
There are seven cases when aligning three sequences
I I I I J J J J K K K K 23 -1 to choose the maximum similarity
66
Three sequence recurrence relation
S(i,j,k) = max[S(i-1, j-1, k-1) + m(i,j) + m(i,k) + m(j,k), S(i-1, j-1, k) + m(i,j) + g, S(i-1, j, k-1) + m(i,k) + g, S(i, j-1, k-1) + m(j,k) + g, S(i-1, j, k)+ g + g, S(i, j-1, k) + g + g, S(i, j, k-1) + g + g] m(i,j) = similarity matrix eg BLOSUM g = gap penalty
67
Dynamic programming time increases exponentially
Time taken for alignment by dynamic programming is O(n * m) for two sequences n, m characters long. Time taken for alignment by dynamic programming is O(n * m * p) for three sequences n, m, p characters long.
68
Dynamic programming time increases exponentially
Clearly, for N sequences, each sequence Li characters long, the time required will be N O( Π Li ) i=1 This is exponential - O( LN ) We need to fill out each ‘box’ in the grid
69
Pairwise Dynamic Programming Comparing Similar Sequences
Faster algorithm for aligning similar sequences. If two sequences are similar, the best alignments have their paths near the main diagonal of the dynamic programming matrix. To compute the optimal score and alignment, it is not necessary to fill in the entire matrix. A narrow band around the main diagonal should suffice
70
Global Alignment: Comparing Similar Sequences Match = 5, Mismatch = -4, Gap w= -7, K=2
71
Global Alignment: Comparing Similar Sequences Match = 5, Mismatch = -4, Gap w= -7, K=2
72
Heuristic multiple sequence alignment
Currently, most practical methods are hierarchial methods For example, pairwise alignments, defining hierarchy followed by progressive addition of sequences to alignment
73
Multiple Alignment Induces Pairwise Alignments
Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
74
Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC- GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them?
75
Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC- GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? NOT ALWAYS Pairwise alignments may be inconsistent
76
Inferring Multiple Alignment from Pairwise Alignments
From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal It is difficult to infer a ``good” multiple alignment from optimal pairwise alignments between all sequences
77
Combining Optimal Pairwise Alignments into Multiple Alignment
Can combine pairwise alignments into multiple alignment Can not combine pairwise alignments into multiple alignment
78
The STAR Alignment Method
Using a pairwise alignment method (DP,etc) find the sequence that is most similar to all the other sequences. Using this “best” sequence as the center (of a star, hence the name) align the other sequences following the once a gap always a gap rule . For example consider the following set of sequences S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C
79
STAR Alignment - 2 Now Consider the following similarity matrix for the pairwise comparing of the sequences. S1 S2 S3 S4 S5 SUM sim(Si, Sj) I≠J S S S S S For this example S1 is the center of the STAR
80
STAR Alignment - 3 Next we get the best alignment between S1 and the other sequences as follows: S1 | A T T G C C A T T S1 | A T T G C C A T T S2 | A T G G C C A T T S5 | A C T G A C C - - S1 | A T T G C C A T T - - S3 | A T C - C A A T T T T S1 | A T T G C C A T T S4 | A T C T T C - T T
81
STAR Alignment 4 Next to build the MSA we start with S1 & S2 as
A T T G C C A T T A T G G C C A T T adding S3 using once a gap always a gap A T T G C C A T T A T G G C C A T T - - A T C - C A A T T T T continuing in this fashion we obtain for our MSA of all the sequences
82
Star Alignment 5 A T T G C C A T T - - A T G G C C A T T - -
A T C - C A A T T T T A T C T T C - T T - - A C T G A C C Clearly, using the STAR method the time complexity is dominated by computing the pairwise alignment which again for N sequences we have O(N2) pairs. We consider each pairwise alignment to take L2 time where again L is the length of each sequence.
83
STAR Alignment - 6 Thus the time complexity for computing all pairwise alignments will be O[(NL)2] We still have to consider the time it takes to merge the sequences into a MSA . If Lmax is the upper bound of the alignment length then it will take N2(Lmax) time to merge the sequences into a MSA. Thus the time complexity for STAR is O( N2L2 + N2Lmax ) Clearly for large N, L this is less than the time complexity for SP which is O[ (2L)N (N2)] Recall SP is optimal whereas STAR is not, thus there is a trade- off between optimization and practicality .
84
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T
85
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T In the past we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile?
86
Aligning alignments Given two alignments, can we align them?
x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT-----
87
Aligning alignments Given two alignments, can we align them?
Hint: use alignment of corresponding profiles x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT-----
88
Multiple Alignment: Greedy Approach
Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k
89
Greedy Approach: Example
Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC
90
Greedy Approach: Example (cont’d)
There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1)
91
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine: s2 GTCTGA s4 GTCAGC s2,4 GTCt/aGa/cA (profile) new set of 3 sequences: s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c
92
Progressive Alignment
Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Use profiles to compare sequences
93
ClustalW Popular multiple alignment tool today
‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree
94
The CLUSTALW Algorithm
Step 1 : Determine all pairwise alignment between sequences and determine degrees of similarity between each pair. Step 2 : Construct a similarity tree * . Step 3 : Combine the alignments starting from the most closely related groups to the most distantly related groups, as in STAR we use the once a gap always a gap rule . * The PILEUP program is similar to CLUSTALW but uses a different method for producing the similarity tree .
95
Heuristic Multiple Alignment Methods
96
Clustal W progressive multiple alignment
Align two sequences to each other Align a sequence to an existing alignment Align two alignments to each other
98
Multiple Alignments: Scoring
As in the pairwise case, not all MSA’s are equally good. We need a method of scoring for determining when one MSA is better than another one. Number of matches (multiple longest common subsequence score) Entropy score Sum of pairs (SP-Score)
99
Multiple LCS Score A column is a “match” if all the letters in the column are the same Only good for very similar sequences AAA AAT ATC
100
Entropy Define frequencies for the occurrence of each letter in each column of multiple alignment pA = 1, pT=pG=pC=0 (1st column) pA = 0.75, pT = 0.25, pG=pC=0 (2nd column) pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column) Compute entropy of each column AAA AAT ATC
101
Entropy: Example Best case Worst case
102
Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the sum of entropies of its columns: Σ over all columns Σ X=A,T,G,C pX logpX
103
Entropy of an Alignment: Example
column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT) A C G T Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0 Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0 Alignment Entropy = =
104
Sum of Pairs Score(SP-Score)
Consider pairwise alignment of sequences ai and aj imposed by a multiple alignment of k sequences Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as s*(ai, aj) Sum up the pairwise scores for a multiple alignment: s(a1,…,ak) = Σi,j s*(ai, aj)
105
Computing SP-Score Aligning 4 sequences: 6 pairwise alignments
Given a1,a2,a3,a4: s(a1…a4) = Σs*(ai,aj) = s*(a1,a2) + s*(a1,a3) s*(a1,a4) + s*(a2,a3) s*(a2,a4) + s*(a3,a4)
106
SP-Score: Example a1 ATG-C-AAT . A-G-CATAT ak ATCCCATTT
To calculate each column: Pairs of Sequences A G 1 Score=3 1 −μ 1 Score = 1 – 2μ A A C G 1 −μ Column 1 Column 3
107
SP-Score: Example Consider aligning the following 4 portein sequences
S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the following MSA matrix M A Q P I L L L V A L R - L L A K - I L L L - C P P V L I L V
108
SP-Score: Example Assume s(match) = 1 , s(mismatch) = -1 , and s(gap) = -2 , also assume s(-, -) = 0 to prevent the double counting of gaps. Then the SP score for the 4th column of M would be SP(m4) = SP(I, -, I, V) = s(I,-) + s(I,I) + s(I,V) + s(-,I) + s(-, V) + s(I,V) = (-1) + (-2) + (-2) +(-1) = -7 To find SP(M) we would find the score of each mi and then SUM all the SP(mi) scores to get the score M . To find the optimal score using this method we need to consider all possible MSA matrices. We say more about this later.
109
Some Problems with the SP Score
Consider column 1 of our example ie A,A,A,C for this column we get SP(m4) = SP(A,A,A,C) = (-1) (-1) + (-1) = 0 whereas if we had A,A,A,A we get a score of SP(A,A,A,A) = = 6 , thus we get a difference of 6 for what could be explained by a single mutation. The SP method tends to overweight the influence of mutations The major problem with the SP method is that finding the optimal MSA is very time consuming.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.