Presentation is loading. Please wait.

Presentation is loading. Please wait.

MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.

Similar presentations


Presentation on theme: "MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm."— Presentation transcript:

1 MUSCLE An Attractive MSA Application

2 Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm. MUSCLE software.

3 Background MUSCLE are the initials of “multiple sequence comparison by log expectation”. One of the most recent popular MSA softwares. Was developed by Robert C. Edgar in 2004. Considered to be one of the most accurate MSA software available today. The basic idea: Progressive Alignment.

4 Progressive Alignment – Quick Review Estimation of an evolutionary tree based on the input sequences. The tree is scanned from leaves to root. Construction of pairwise alignment of the subtrees found at each internal node. A subtree is represented by its profile => alignment between two profiles.

5 Progressive Alignment (cont.) 4321 0001 L 000.50 Q 0100 S 000 W 000 F 00 0 --

6 Progressive Alignment (cont.) Today, all popular MSA softwares use prog. alignment. Such programs are CLUSTALW and T-Coffee. But MUSCLE aims to achieve higher speed and accuracy. How exactly?

7 MUSCLE Innovations Faster distance estimation between the input sequences. Faster construction of an evolutionary tree. Applying new score function to the profile alignments. Refinement of the initial results. { { faster more accurate

8 Innovation I: Distance estimation Distances between input sequences are estimated for the construction of the evolutionary tree. Previous methods (CLUTALW, T-Coffee) perform alignment between all pairs of sequences. This approach is time consuming (O( )). MUSCLE computes the distances more efficiently, using K-mer distance (time complexity of O( )). K-mer distance is derived from the K-mer similarity.

9 K-mer similarity K-mer: a contiguous subsequence of length k. Related sequences tend to have more K-mers in common than expected by chance. Similarity between two sequences is calculated by the fraction of K-mers they have in common.

10 K-mer similarity (cont.) The formula of K-mer similarity between sequences 1 and 2: F = ∑ min [n1( t), n2( t)]/[min (L1, L2) - k + 1]. t is a K-mer, (L1, L2) are the sequences length, and (n1(t),n2(t)) are the number of times that ‘t’ is observed in sequences 1,2. F assumes that common K-mers are always alignable, and therefore point to similarity. The denominator is the maximum number of common K-mers that can be found in sequences 1 and 2. The numerator counts the maximum number of K-mer appearances that can be aligned. t

11 K-mer similarity (cont.) Justification for the use of K-mer similarity (F): Fractional identity K-mer similarity After understanding the logic behind K-mer similarity, K-mer distance is easily derived: D(kmer) = 1-F.

12 MUSCLE Innovations Faster distance estimation between the input sequences. Faster construction of an evolutionary tree. Applying new score function to the profile alignments. Refinement of the initial results. √

13 Innovation II: Tree Construction Most progressive alignment applications (e.g CLUSTALW) construct the evolutionary tree by using the neighbor joining algorithm (NJ). NJ gets a distance matrix as an input and tries to build a tree which fits best for this matrix. NJ is considered to be one of the most reliable methods for phylogenetic reconstruction. But MUSCLE prefers to use UPGMA which is much less accurate…..why?

14 UPGMA Algorithm Gets as an input a distance matrix between the sequences, just like NJ. Unlike NJ, UPGMA is not interested in finding a tree which fits best the distance matrix. Instead, at each step of the algorithm, it refers to two closest sequences as brothers in the tree. Thus, closer sequences in the matrix would be more closely related in the resulting tree.

15 NJ Reconstruction - example im jn 0.1 0.4 kl njmi 0.60.50.30i 0.50.600.3m 0.900.60.5j 00.90.50.6n Sequences i and m are the closest according to the matrix, but are not brothers in the resulting tree!!!! The distances in the tree match the ones in the matrix.

16 UPGMA Reconstruction - example j n njmi 0.60.50.30i 0.50.600.3m 0.900.60.5j 00.90.50.6n Sequences i and m are the closest according to the matrix, and therefore they are brothers in the resulting tree!!!! The distances in the tree don’t match the ones in the matrix!!!! i m 0.15 0.125 0.2750.39

17 So why using UPGMA?? UPGMA is faster than NJ (O( ) instead of O( )). In progressive alignment, the most important factor is the least distant profiles at each node. Thus, UPGMA reconstruction is expected to produce more accurate MSA in comparison to NJ.

18 MUSCLE Innovations Faster distance estimation between the input sequences. Faster construction of an evolutionary tree. Applying new score function to the profile alignments. Refinement of the initial results. √ √

19 Innovation III: New Scoring Function Alignment between profiles is based on the standard pairwise alignment algorithm (PW alignment). Here, instead of aligning two characters from two different sequences, we align two columns from two different profiles. Thus, a score for aligning two columns X and Y needs to be defined. MUSCLE uses the log expectation score.

20 From PW Alignment To Profile Alignment A T _ C A _ A _ A CA S1 + Score (G, A) S2 + Score (G, - ) S3 + Score (-, A) S1 + Score (TG, ACA) S2 + Score (TG, - - - ) S3 + Score (- -, ACA ) { max {

21 Log Expectation Score (LE) The formula for the alignment of two columns X and Y: - background probability of aa i. - the probability of aa’s i and j to be aligned to each other (taken from the PW alignment score matrix). - observed frequency of i in columns X. - observed frequency of gaps in column X.

22 LE Score (cont.) The scoring functions of other progressive alignment applications are similar. The difference: MUSCLE adds the multiplication in. This encourages more highly occupied columns to align => producing better MSA.

23 Gap Penalties LE is the score obtained from matching two columns. What about gap penalties? What is the score from aligning a column to a column of gaps? MUSCLE, just like most other progressive alignment applications, uses the position specific gap penalty scheme.

24 Position Specific Gap Penalty Consider two profiles X and Y. Suppose we want to open a gap of length λ in X. The beginning of the gap (gap-open) is aligned to position and the end of the gap (gap-close) is aligned to position. Then, in every sequence of X, the gap penalty will be:.

25 Position Specific Gap Penalty - the penalty of gap-open in X, which is aligned to position in Y. - the penalty of gap-close in X, which is aligned to position in Y. - the gap extension penalty. the principle is position specificity: high frequency of gap open in => low. high frequency of gap close in => low. on the other hand, lower frequencies => higher penalties.

26 Example – which gap is cheaper? Profile X: Profile Y: WSQL FSQL FITQM WI_HL FITQM WI_HL WS_QL FS_QL FITQM WI_HL W_SQL F_SQL

27 MUSCLE Innovations Faster distance estimation between the input sequences. Faster construction of an evolutionary tree. Applying new score function to the profile alignments. Refinement of the initial results. √ √ √

28 Innovation IV: Refinement Step An edge is chosen from the progressive alignment tree. The tree is divided into two subtrees by deleting this edge. The MSA from each subtree is computed by progressive alignment. The two MSAs are aligned, generating a new MSA. If the new MSA achieves higher score than the previous one, then keep it!!!

29 Refinement Step (cont.) MSA1 ------------- ------------- ------------- MSA2 ------------- ------------- NEW MSA --------------- --------------- --------------- --------------- --------------- OLD MSA --------------- --------------- --------------- --------------- ---------------

30 MUSCLE Innovations Faster distance estimation between the input sequences. Faster construction of an evolutionary tree. Applying new score function to the profile alignments. Refinement of the initial results. √ √ √ √ How do these improvements are combined in the algorithm?

31 MUSCLE Algorithm Consists of three stages: Stage 1 - Draft progressive Stage 2 - Improved progressive. Stage 3 – Refinement.

32 Stage 1: Draft Progressive Description: produce MSA, emphasizing speed over accuracy- 1. K-mer distance is computed for each pair of input sequences, giving distance matrix D1. 2. Construction of UPGMA tree, based on D1. 3. Progressive alignment – generates MSA1.

33 Stage 1: Flow Chart

34 Stage 2: Improved Progressive Description: improvement of the MSA, by re-estimating the tree. In this stage the tree is based on MSA1, and not on the K-mer distance- 1. MSA1 is used to calculate distances between all pairs of sequences, giving distance matrix D2. 2. Construction of the tree T2 by UPGMA, based on D2. 3. Progressive alignment based on T2– generates MSA2.

35 Stage 1+2: Flow Chart

36 Stage 3: Refinement Description: conducting several refinement steps. 1. Edges are chosen from T2 by decreasing distance from the root. 2. Each edge is used for refinement step. 3. If the score is improved then the new alignment is kept. 4. The last alignment that is kept is the final MSA!

37 Stage 1+2+3: Flow Chart

38 MUSCLE Achieves Higher Accuracy T-coffee MUSCLE

39 MUSCLE Is Faster!! CPU Time (Sec.)Method 97MUSCLE 52MUSCLE-p 1500T-Coffee 170NWNSI 170CLUSTAL-W MUSCLE-p: MUSCLE algorithm, not including the refinement stage

40 The MUSCLE software MUSCLE for windows is available for download at: http://www.drive5.com/muscle/download3.6.html. http://www.drive5.com/muscle/download3.6.html MUSCLE software for unix is available for use via your unix account. Those versions of MUSCLE are command line programs – the commands are given by the user in the DOS (windows) or shell (unix) environments. The syntax of the commands are identical both in unix and windows versions. So let’s learn some useful commands…

41 Some Useful Commands Typing the word “muscle” will provide you a set of useful commands. Here are the important ones: Basic usage: muscle -in -out Common options: -in Input file in FASTA format (default stdin) -out Output alignment in FASTA format (default stdout) -maxiters Maximum number of iterations (integer, default 16) -html Write output in HTML format (default FASTA) -clw Write output in CLUSTALW format (default FASTA)

42 Other interesting options Applying MUSCLE without the refinement stage (very fast, average accuracy is similar to T-Coffee): -maxiters 2. Applying MUSCLE in the fastest way possible (only stage 1, large k leading to faster distance estimation): -maxiters 1 -diags -sv –distance1 kbit20_3. Refinement of an existing MSA given in Fasta format: -refine. All the other options appear in the MUSCLE user guide, which comes with the zip file of the downloaded version of windows.

43 MUSCLE Web Server You don ’ t like command lines? No problem … there is also a web server! Disadvantages: - The web server does not supply the possibility to change the default parameters. - It is restricted to no more than 200 sequences in one MSA. - If there are more than 100 sequences– it does not perform the refinement stage. Its advantage is its simplicity. The address: http://phylogenomics.berkeley.edu/cgibin/muscle/input_muscle.py

44 Insert sequences in FASTA format Insert a name of a FASTA file Run and wait for results in the mail Enter your E- mail address

45 Useful Information Resources 1.Edgar, Robert C. (2004), MUSCLE: multiple alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97. 2. Edgar, R.C. (2004), MUSCLE: a multiple alignment method with reduced time and space complexity, BMC Bioinformatics 5(1): 113. 3. Edgar, R.C. (2004), Local Homology recognition and distance measures in linear time using compressed amino acids alphabet, Nucleic Acids Research, 32, 380-385. 4. http://www.drive5.com/muscle/. Thank you!!!


Download ppt "MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm."

Similar presentations


Ads by Google