Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja
What is MSA? MSA is an alignment generated from three or more sequences. MSA is usually a more global alignment, i.e., the aim is to align homologous residues (nucleotides or amino acids) in columns across the length of the whole sequences. GA--GTACA CAC-GTATA CACGGTAT- G-CGGTCTA
What is MSA? Picture shows protein multiple sequence alignment
Why MSA ”MSA emphasises signal observed in the pairwise alignment” (Liisa Holm) Improved alignments!! Alignment of more distant sequences with the help from intermediate sequences Highlight the conserved regions in sequences
Why MSA MSA is input to many analysis tasks: Detection of active site Generation sequence profiles Detection of protein domains and motifs Phylogenetics …
Remember First step of MSA: Good selection of sequences to the analysis Sequences need to be functionally/evolutionarily related Sometimes it is good to have some variation in the sequences (depends on the analysis task) Alternative: Rubbish in → Rubbish out
MSA methods Finding optimal multiple sequence alignment is computationally hard task “Correct” answer would always come by extending dynamic algorithm to multiple sequences In practice dynamic algorithm cannot be applied to MSA problems We need approximate solutions (heuristics) computational_complexity
MSA methods: heuristics Progressive Alignment (not much used) Iterative Alignment (most popular) Hidden Markov Models Pattern Based methods
Progressive alignment Divide unsolvable task into subtasks that can be solved Align first most similar pairs of sets of sequences –Sequence sets can have 1 or many sequences –First the sets include only single sequences Move progressively to more bigger sets and to more difficult pairs of sets Always align only two pairs of sets at the time
Progressive alignment Produce pairwise alignments between all the sequences you want to align with MSA. –Dynamic programming, ktup-methods.. Produce a “guide tree” on the basis of the pairwise distances calculated from pairwise alignments –UPGMA, neighbor joining Produce an MSA using the “guide tree”. –Sequences are aligned in the same order as the guide tree instructs.
Set of sequences All against all pairwise alignment Here demonstrated for 1. sequence Get pairwise similarities from alignments Create a cluster tree from similarities Join sequences in the order obtained From the cluster tree
Guide tree construction: UPGMA Unweighted Pair Group Method with Arithmetic mean One of the fastest tree construction methods
An example: Pairwise alignments
Pairwise distances, based on pairwise alignments Number of nucleotide differences Absolute distances, used in Pileup/ Clustal JC-distance
UPGMA based on JC-distances* 0,107 / 2 JC-distances = Jukes-Cantor distances. The observed distances, D, are corrected for multiple substitutions via correction function –(3/4)*ln(1-(4/3)D)
UPGMA, distance updates d(human,chimp),gorilla = [d(human, gorilla) + d(chimp, gorilla)] / 2 = [0, ,232] / 2 = 0,3075
U d(human & chimp),U = 0,3923/2 = 0,1962 d(gorilla & orangutan),U = 0,3923/2 = 0,1962 0, ,0537 = 0,1426 0, ,116 = 0,080
UPGMA / 2 0, , ,0537 0, , ,116 or
Alignment score 1234 ACGT match=1 ACGA mismatch=0 AGGA 1: A-A + A-A + A-A = = 3 2: C-C + C-G + C-G =1+0+0 = 1 3: G-G + G-G + G-G = = 3 4: T-A + T-A + A-A = =1 S(alignment) = S(1) + S(2) + S(3) + S(4) = = 8 The higher the score, the better the alignment
Progressive alignment - pros and cons Pros: –Fast Cons: –Once gaps are opened they can never be closed –Errors in the alignment of the first few sequences can have catastrophic effects on the whole alignment –Not much used (to my knowledge)
Iterative alignment Create a progressive alignment After obtaining the alignment calculate a quality score REPEAT THE FOLLOWING STEPS: –Redo the cluster tree –Realign the sequences using the new cluster tree –Calculate a quality score Loop above can be stopped when a maximum number is reached or when quality score is not improved
Iterative alignment Allows correction of errors that was not possible in progressive alignment Very popular among the MSA methods Increases the running time of the method
Diagram of typical iterative MSA program workflow. Figure from Do & Katoh Iterative alignment Iteration loop
What MSA program(s) to use? Depends on the application –Phylogenetic studies –Structure based studies Depends on the size of the data –Some programs cannot handle large dataset Remember to evaluate the alignment by eye
What MSA program(s) to use? Collection of MSA programs at EBI
Summary of MSA MSA is relevant for many analysis tasks –Improved signal from the alignment Solving MSA requires heuristics Selection of MSA methods depends on the application Results should be evaluated by eye –And the errors should be corrected with MSA editors
Manual editing of MSAs? Let’s say that your performed an MSA witn computer. However, biologically, it has some faults - needs manual editing -> Editors: Jalview and Seaview Input data can be in any of the most common MSA formats (Mase, Phylip, Clustal, MSF, Fasta, NEXUS, PIR and BCL)