Download presentation
Presentation is loading. Please wait.
Published byPhoebe Norton Modified over 8 years ago
1
Sequence Similarity
2
PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
3
INSERTXINSERTY MATCH A pair-HMM model of pairwise alignment Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences Transition probabilities ~ gap penalties Emission probabilities ~ substitution matrix (from BLOSUM) ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
4
Computing Pairwise Alignments The Viterbi algorithm conditional distribution P( α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y identifies highest probability alignment, α viterbi, in O(L 2 ) time Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy P(α) P(α | x, y) α viterbi
5
The Lazy-Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote! A-AB A B+B+B- C 4. F 4. T 4. F 4. T
6
Viterbi vs. Maximum Expected Accuracy (MEA) Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes E α * [1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes E α* [accuracy(α, α*)] A 4. T A-AB A B+B+B- C 4. F 4. T 4. F 4. T
7
Computing MEA alignments Define accuracy (α, α*) = E α* (accuracy(α, α*) | x, y) ~ E α* (∑ (xi, yj) in α 1((x i, y j ) in α*) | x,y) = ∑ α’ P(α’ | x, y) ∑ (xi, yj) in α 1((x i, y j ) in α’) = ∑ (xi, yj) in α ∑ α’ P(α’ | x, y) 1((x i, y j ) in α’) = ∑ (xi, yj) in α P(x i, y j in α’ | x, y) Define M[i, j] = posterior probability that x i is aligned to y j # of correct predicted matches length of shorter sequence
8
Computing MEA alignments Define accuracy (α, α*) = Then, MEA alignment is highest summing path through the matrix M[i, j] = P(x i is aligned to y j | x, y) M[i, j] = posterior probability that x i is aligned to y j Can compute with forward, backward dynamic programming in O(L 2 ) time # of correct predicted matches length of shorter sequence
9
Computing MEA alignments Define accuracy (α, α*) = Then, MEA alignment is highest summing path through the matrix M[i, j] = P(x i is aligned to y j | x, y) M[I, j] = posterior probability that x i is aligned to y j Can compute with forward, backward dynamic programming in O(L 2 ) time # of correct predicted matches length of shorter sequence
10
The consistency signal z x y xixixixi yjyjyjyj y j’ zkzkzkzk
11
To estimate P(x i y j | x, y, z) Method 1:triplet-HMM P(x i ~ y j | x, y, z) = ∑ k P(x i ~y j ~z k | x, y, z) Parameters trained with unsupervised EM Running time: O(N 3 L 3 ) N: # sequences L: sequence lengths
12
Probabilistic consistency Compute P(x i is aligned to y j | x, y) P(x i is aligned to y j | x, y, z) 2 approaches: 1) Exact – triplet HMM, O(L 3 ) time 2) Approximate – use independence assumptions ∑ k P(x i ~ z k and z k ~ y j | x, y, z) = ∑ k P(x i ~ z k | x, z) P(z k ~ y j | x, y, z, x i ~ z k ) (assume indep.) ∑ k P(x i ~ z k | x, z) P(z k ~ y j | z, y)
13
Probabilistic consistency Compute P(x i is aligned to y j | x, y, z) To compute P(x i ~ y j | x, y, z) ~ ∑ k P(x i ~ z k | x, z) P(z k ~ y j | z, y) Notice that for any given i, most entries k and j will be close to 0 -- sparse matrices P xy|z P xz P zy Finally, let P xy|S 1/|S| ∑ z in S P xz P zy
14
Multiple sequence alignment A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement ABRACA-DABRA AB-ACARDI--- ABRA---DABI- AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI---
15
Multiple sequence alignment A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement ABRACA-DABRA AB-ACARDI--- ABRA---DABI- AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDIABRACADABRA ABRACA-DABRA AB-ACARDI--- ABRADABI ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDI ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI-
16
Summary of P ROB C ONS Algorithm Given K sequences to be aligned, (1)Compute M[i, j] for all pairs of sequences, x and y (2)Use probabilistic consistency to reestimate M[i, j] (3)Build a tree of the sequences by connecting closest first “Closest” defined according to expected accuracy EA(x, y) = E(accuracy) of MEA alignment of x and y (4)Perform progressive alignment along the tree Score of a column: sum-of-pairs M[i, j] (5)Apply iterative refinement
17
Training/testing methodology 3 reference benchmark sets PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE. Quality score: Q(α, α*) = BAliBASEPREFABSABmark # of correct predicted matches total # of true matches
18
Evaluation of Algorithm Components Algorithm Quality (74) Time (sec) Viterbi0.3750.72 MEA0.4031.6 PC (O(L 3 ))0.431584.2 PC x 1 (O(L 2 ))0.4221.7 PC x 2 (O(L 2 ))0.4271.9 Progressive PC x 2 (O(L 2 ))0.4321.9 Progressive PC x 2 (O(L 2 )) + IR0.4353.3 all-pairspairwise multiple
19
Performance of different alignment tools AlgorithmBAliBASE (237) PREFAB (1932) SABmark (698) QtQtQt Align-m0.80419:25--0.35256:44 DIALIGN0.8322:530.57212:25:000.4108:28 CLUSTALW0.8611:070.5892:57:000.4392:16 MAFFT0.8821:180.6482:36:000.4427:33 T-Coffee0.88321:310.636144:51:000.45659:10 MUSCLE0.8961:050.6483:11:000.46420:42 P ROB C ONS 0.9105:320.66819:41:000.50517:20
20
Resources for alignment Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used(1994) http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable(2004) http://probcons.stanford.edu/ PROBCONS – most accurate(2004) Some more protein multiple aligners: MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc……. ProbCons: Chuong (Tom) Do
21
Profile hidden Markov models for sequence famillies
23
PFAM Protein FAMilies database of alignments Profile HMMs describe each family For each family in Pfam you can: Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures
24
PFAM Pfam-A – curated multiple alignments Grows slowly; quality controlled by experts Pfam-B – automatic clustering (ProDom derived) New sequences instantly incorporated; unchecked Search by: Sequence, keyword, domain, taxonomy Browsing by family or genome Evolutionary tree Source of seed alignments: Pfam-B families Published articles ‘Domain hunting' studies
31
Profile HMMs Each M state has a position-specific pre-computed substitution table Each I state has position-specific gap penalties (and in principle can have its own emission distributions) Each D state also has position-specific gap penalties In principle, D-D transitions can also be customized per position M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1
32
Profile HMMs transition between match states – α M(i)M(i+1) transitions between match and insert states – α M(i)I(i), α I(i)M(i+1) transition within insert state – α I(i)I(i) transition between match and delete states – α M(i)D(i+1), α D(i)M(i+1) transition within delete state – α D(i)D(i+1) emission of amino acid b at a state S – ε S (b) M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1
33
Profile HMMs transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1
34
Alignment of a protein to a profile HMM To align sequence x 1 …x n to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm Define V j M (i):score of best alignment of x 1 …x i to the HMM ending in x i being emitted from M j V j I (i):score of best alignment of x 1 …x i to the HMM ending in x i being emitted from I j V j D (i):score of best alignment of x 1 …x i to the HMM ending in D j (x i is the last character emitted before D j ) Denote by q a the frequency of amino acid a in a ‘random’ protein
35
Alignment of a protein to a profile HMM V j-1 M (i – 1) + log α M(j-1)M(j) V j M (i) = log (ε M(j) (x i ) / q xi ) + max V j-1 I (i – 1) + log α I(j-1)M(j) V j-1 D (i – 1) + log α D(j-1)M(j) V j M (i – 1) + log α M(j)I(j) V j I (i) = log (ε I(j) (x i ) / q xi ) + max V j I (i – 1) + log α I(j)I(j) V j D (i – 1) + log α D(j)I(j) V j-1 M (i) + log α M(j-1)D(j) V j D (i) = max V j-1 I (i) + log α I(j-1)D(j) V j-1 D (i) + log α D(j-1)D(j)
36
Weight of each sequence One simple weighting scheme is to find how much edge length each leaf contributes Example: edge 1 belongs to a Example: edge 3 belongs both to a, and to b: e 3 e 1 /(e 1 +e 2 ) goes to a Δ wi = e current w i / ( leaves k below e current w k ) a b c d e f g h i 1 3 2
37
How to build a profile HMM
38
Resources on the web HMMer – a free profile HMM software http://hmmer.wustl.edu/ http://hmmer.wustl.edu/ SAM – another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html http://www.cse.ucsc.edu/research/compbio/sam.html PFAM – database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/Software/Pfam/ http://www.sanger.ac.uk/Software/Pfam/ SCOP – a structural classification of proteins http://scop.berkeley.edu/data/scop.b.html http://scop.berkeley.edu/data/scop.b.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.