Download presentation
Presentation is loading. Please wait.
Published byLouise Knight Modified over 9 years ago
1
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014
2
RECAP Scores for multiple sequence alignment – Sum of pairs – Minimum entropy based Heuristic algorithms for performing multiple sequence alignment – Progressive Star alignment Guide tree-based – ClustalW – Iterative MUSCLE
3
Goals for today General description of iterative algorithms A practical implementation – MUSCLE
4
Iterative algorithms for multiple sequence alignment Key idea: revisit the alignments Algorithms vary depending upon how exactly the alignments are changing between iterations
5
Simple iterative algorithm (Also called the Barton-Sternberg alignment algorithm) 1.Align two sequences with highest alignment score using standard dynamic programming techniques for pairwise alignment 2.Repeat until all sequences are in the alignment – Find the sequence most similar to current alignment – Add to alignment. 3.For all sequences x i, – Remove x i from alignment, re-align to the partial alignment of { x 1...x n }\x i. Repeat 3 until the score does not improve OR we have executed a fixed number of steps
6
MUSCLE: Multiple Sequence Comparison by log-expectation Progressive + iterative Has three main stages Stage1: Draft Progressive Stage 2: Improved Progressive Stage 3: Refinement: – Select pairs of subtrees and re-align the alignment for the subtrees. – Keep if it improves alignment Each stage returns an alignment – Could be terminated anywhere
7
Steps in MUSCLE Stage 1: Draft progressive Stage 2: Improved progressive Stage 3: Refinement
8
MUSCLE Stage 1 1.1 Compute k-mer distance matrix 1.2 Use UPGMA to make tree (TREE1) (We will see this in a bit) 1.3. Use guide tree to make first MSA
9
K-mer distance D K-mer distance is defined from common fractional k- mer count ( F ) For two sequences x and y D=1-F
10
K-mer distance example Sequencek=2-mers AKFLAAK,KF, FL,LA LKFLFLLK, KF, FL,LF,FL K-mer ( τ ) n x (τ)n y (τ)min(n x (τ), n y (τ)) AK100 KF111 FL121 LA100 LK010 LF020 x y
11
Stage 2: Improved progressive 2.1 Recompute similarity of sequences of pairs using mutual alignment in MSA 2.2 Construct a phylogenetic tree (TREE2) using an alignment-based distance 2.3 Build a new progressive alignment only for subtrees where branching order has changed between TREE1 and TREE2 2.4 Repeat 2.3 until number of “reordered nodes” does not decrease.
12
Stage 2.1. Recomputing pairwise sequence similarity from a multiple alignment -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC An MSA TGTTAAC TGT-AAC TGTTAAC TGT--AC -TGTTAAC ATGT---C -TGTTAAC ATGT-GGC … Derived pairwise alignmentFraction identity 6/7 5/7 4/8 … Exclude gaps in both sequences
13
Stage 2.2: Phylogenetic tree creation Construct a phylogenetic tree using a Kimura distance D: fractional identity of sequences
14
Stage 2.3 Re-align only when branching order is changed Branching order same Branching order different: x branches before v Recompute alignment for these nodes
15
Stage 3: Iterative Refinement 3.1 Delete an edge 3.2 Extract profiles from subtrees 3.3 Re-align profiles 3.4 Update MSA if its score is better than current MSA
16
3.1 Selecting a branch Select a branch in order of decreasing distance from the root MQTIF LH-IW LQSW MQTIF LHIW LSF LQSW L-SW 1 2 3 4 5 6 Branch selection order: 1,2,3,4,5,6 MQTIF LH-IW LQS-W L-S-W
17
3.2 Extracting a profile MQTIF LH-IW LQSW LHIW MQTIF LH-IW LQS-W L-S-W LSF LQSW L-SW 2 3 4 5 6 Delete branch 2 Re-align profiles for subtrees MQTIF LQS-W L-S-W Is score better? yes Keep new alignment Discard MQTIF LHIW LHI-W MQTIF LQS-W L-S-W 1
18
Summary of MUSCLE Three stage algorithm Stage 1: Draft progressive – k-mer distance – UPGMA tree (TREE1) – Guide tree based alignment (MSA1) Stage 2: Improved progressive – Distance derived from MSA1 – UPGMA tree (TREE2) – Redo alignment for nodes with changed orderings – Repeat until number of re-ordered nodes does not change Stage 3: Iterative refinement – Generate subtree profiles – Realign profiles – Keep realignment if of higher score – Repeat until no more improvement or fixed number of steps. MUSCLE-fast: Stage 1 MUSCLE-p: Stage1 and 2 Note different convergence criteria in Stages 2 and 3
19
Accuracy scores of different MSA algorithms on benchmark datasets Edgar, 2004, BMC Bioinformatics Accuracy measures the fraction of residues correctly aligned with the reference alignment
20
Run time of different MSA algorithm
21
Summary of algorithms ClustalW – Lots of heuristics for gaps – One guide tree and then alignment – Weights sequences – Dynamically selects scoring matrix depending upon sequence identity MUSCLE – Three-stage algorithm: Draft, Improved, Iterative refinement – Two guide trees – Uses k-mer distance for first tree – Selectively re-aligns using second tree – Refines iteratively by working on subtree-associated alignments – Fast and has as good or better quality alignments
22
How do MUSCLE and CLUSTALW work in practice Consider coding sequences of 15 yeast species Consider promoter sequences of 15 yeast species Align with MUSCLE and CLUSTALW
23
Protein sequence alignment MUSCLE CLUSTALW
24
Promoter sequence alignment MUSCLE CLUSTALW
25
Comparing alignment of promoters to shuffled sequences in CLUSTALW Original sequences Shuffled sequences
26
Comparing alignment of promoters to shuffled sequences in MUSCLE Original sequences Shuffled sequences
27
Conclusion Algorithms seemed similar for protein/coding sequences Algorithms gave different alignments for DNA sequence – Possibly DNA sequence is harder to align – DNA sequence in non-coding regions are even harder to align
28
Summary of sequence alignment Pairwise alignment – Algorithms Global: (Needleman-Wunsch) Local: (Smith-Waterman) Heuristic search to align large number of sequences – BLAST Multiple sequence alignment – Star alignment – Progressive alignment with guide tree: CLUSTALW – Progressive + Iterative alignment with guide tree: MUSCLE
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.