Download presentation
Presentation is loading. Please wait.
Published byAugusta Andrews Modified over 9 years ago
1
Sequence Similarity
2
The Viterbi algorithm for alignment Compute the following matrices (DP) M(i, j):most likely alignment of x 1 …x i with y 1 …y j ending in state M I(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state I J(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state J M(i, j) = log( Prob(x i, y j ) ) + max{ M(i-1, j-1) + log(1-2 ), I(i-1, j) + log(1- ), J(i, j-1) + log(1- ) } I(i, j) = max{ M(i-1, j) + log , I(i-1, j) + log } M P(x i, y j ) I P(x i ) J P(y j ) log(1 – 2 ) log(1 – ) log log log(1 – ) log Prob(x i, y j )
3
One way to view the state paths – State M x1x1 xmxm y1y1 ynyn ……
4
State I x1x1 xmxm y1y1 ynyn ……
5
State J x1x1 xmxm y1y1 ynyn ……
6
Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) x1x1 xmxm y1y1 ynyn ……
7
Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) Optimal solution is the best scoring path from top-left to bottom-right corner This gives the likeliest alignment according to our HMM x1x1 xmxm y1y1 ynyn ……
8
Yet another way to represent this model Mx 1 Mx m Sequence X BEGIN IyIy IyIy IxIx IxIx END We are aligning, or threading, sequence Y through sequence X Every time y j lands in state x i, we get substitution score s(x i, y j ) Every time y j is gapped, or some x i is skipped, we pay gap penalty
9
From this model, we can compute additional statistics P(x i ~ y j | x, y)The probability that positions i, j align, given that sequences x and y align P(x i ~ y j | x, y) = α: alignment P(α | x, y) 1(x i ~ y j in α) We will not cover the details, but this quantity can also be calculated with DP M P(x i, y j ) I P(x i ) J P(y j ) log(1 – 2 ) log(1 – ) log log log(1 – ) log Prob(x i, y j )
10
Fast database search – BLAST (Basic Local Alignment Search Tool) Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB
11
BLAST Original Version Dictionary: All words of length k (~11 nucl.; ~4 aa) Alignment initiated between words of alignment score T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan
12
PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M
13
BLAST Variants BLASTN – genomic sequences BLASTP – proteins BLASTX – translated genome versus proteins TBLASTN – proteins versus translated genomes TBLASTX – translated genome versus translated genome PSIBLAST – iterated BLAST search http://www.ncbi.nlm.nih.gov/BLAST
14
Multiple Sequence Alignments
15
Protein Phylogenies Proteins evolve by both duplication and species divergence
18
Definition Given N sequences x 1, x 2,…, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments
19
Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
20
Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) = k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck
21
A Profile Representation Given a multiple alignment M = m 1 …m n Replace each column m i with profile entry p i Frequency of each letter in # gaps Optional: # gap openings, extensions, closings Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4
22
Multiple Sequence Alignments Algorithms
23
Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))
24
Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional DP
25
Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP
26
Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly! Require 2 N states, one per combination of gapped/ungapped sequences Running time: O(2 N 2 N L N ) = O(4 N L N ) XYXYZZ YYZ XXZ
27
Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw
28
Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)
29
Progressive Alignment When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree Align on the tree x w y z ?
30
Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …
31
Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT
32
Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary
33
Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA Variant: Refinement on a tree “tree partitioning”
34
Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA
35
Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing
36
Some Resources http://www.ncbi.nlm.nih.gov/BLAST BLAST & PSI-BLAST http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate
37
MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on D DRAFT, with a hierarchical clustering method (UPGMA) 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT 4.Measure distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept
38
PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
39
INSERTXINSERTY MATCH A pair-HMM model of pairwise alignment Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences Transition probabilities ~ gap penalties Emission probabilities ~ substitution matrix ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
40
Computing Pairwise Alignments The Viterbi algorithm conditional distribution P( α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y identifies highest probability alignment, α viterbi, in O(L 2 ) time Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy P(α) P(α | x, y) α viterbi
41
The Lazy-Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote! A-AB A B+B+B- C 4. F 4. T 4. F 4. T
42
Viterbi vs. Maximum Expected Accuracy (MEA) Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes E α * [1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes E α* [accuracy(α, α*)] A 4. T A-AB A B+B+B- C 4. F 4. T 4. F 4. T
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.