Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Similar presentations


Presentation on theme: "Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja."— Presentation transcript:

1 Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

2 What is MSA? MSA is an alignment generated from three or more sequences. MSA is usually a more global alignment, i.e., the aim is to align homologous residues (nucleotides or amino acids) in columns across the length of the whole sequences. GA--GTACA CAC-GTATA CACGGTAT- G-CGGTCTA

3 What is MSA? Picture shows protein multiple sequence alignment http://en.wikipedia.org/wiki/Multiple_sequence_alignment

4 Why MSA ”MSA emphasises signal observed in the pairwise alignment” (Liisa Holm) Improved alignments!! Alignment of more distant sequences with the help from intermediate sequences Highlight the conserved regions in sequences http://ekhidna.biocenter.helsinki.fi/users/petri/public/opetus_jutut/Bioinf_Per_Lects/urease_output.txt

5 Why MSA MSA is input to many analysis tasks: Detection of active site Generation sequence profiles Detection of protein domains and motifs Phylogenetics …

6 Remember First step of MSA: Good selection of sequences to the analysis Sequences need to be functionally/evolutionarily related Sometimes it is good to have some variation in the sequences (depends on the analysis task) Alternative: Rubbish in → Rubbish out

7 MSA methods Finding optimal multiple sequence alignment is computationally hard task “Correct” answer would always come by extending dynamic algorithm to multiple sequences In practice dynamic algorithm cannot be applied to MSA problems We need approximate solutions (heuristics) http://en.wikipedia.org/wiki/Multiple_sequence_alignment#Dynamic_programming_and_ computational_complexity

8 MSA methods: heuristics Progressive Alignment (not much used) Iterative Alignment (most popular) Hidden Markov Models Pattern Based methods

9 Progressive alignment Divide unsolvable task into subtasks that can be solved Align first most similar pairs of sets of sequences –Sequence sets can have 1 or many sequences –First the sets include only single sequences Move progressively to more bigger sets and to more difficult pairs of sets Always align only two pairs of sets at the time

10 Progressive alignment Produce pairwise alignments between all the sequences you want to align with MSA. –Dynamic programming, ktup-methods.. Produce a “guide tree” on the basis of the pairwise distances calculated from pairwise alignments –UPGMA, neighbor joining Produce an MSA using the “guide tree”. –Sequences are aligned in the same order as the guide tree instructs.

11 Set of sequences All against all pairwise alignment Here demonstrated for 1. sequence Get pairwise similarities from alignments Create a cluster tree from similarities Join sequences in the order obtained From the cluster tree

12 Guide tree construction: UPGMA Unweighted Pair Group Method with Arithmetic mean One of the fastest tree construction methods

13 An example: Pairwise alignments

14 Pairwise distances, based on pairwise alignments Number of nucleotide differences Absolute distances, used in Pileup/ Clustal JC-distance

15 UPGMA based on JC-distances* 0,107 / 2 JC-distances = Jukes-Cantor distances. The observed distances, D, are corrected for multiple substitutions via correction function –(3/4)*ln(1-(4/3)D)

16 UPGMA, distance updates d(human,chimp),gorilla = [d(human, gorilla) + d(chimp, gorilla)] / 2 = [0,383 + 0,232] / 2 = 0,3075

17 UPGMA

18

19 U d(human & chimp),U = 0,3923/2 = 0,1962 d(gorilla & orangutan),U = 0,3923/2 = 0,1962 0,1962 - 0,0537 = 0,1426 0,1962 - 0,116 = 0,080

20 UPGMA 0.7083 / 2 0,3541 - 0,1426 - 0,0537 0,3541 - 0,080 - 0,116 or

21 Constructing MSA human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC maqaque CCCCCCCCCC human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC

22 Alignment score 1234 ACGT match=1 ACGA mismatch=0 AGGA 1: A-A + A-A + A-A = 1+1+1 = 3 2: C-C + C-G + C-G =1+0+0 = 1 3: G-G + G-G + G-G = 1+1+1 = 3 4: T-A + T-A + A-A = 0+0+1 =1 S(alignment) = S(1) + S(2) + S(3) + S(4) = 3+1+3+1 = 8 The higher the score, the better the alignment

23 Progressive alignment - pros and cons Pros: –Fast Cons: –Once gaps are opened they can never be closed –Errors in the alignment of the first few sequences can have catastrophic effects on the whole alignment –Not much used (to my knowledge)

24 Iterative alignment Create a progressive alignment After obtaining the alignment calculate a quality score REPEAT THE FOLLOWING STEPS: –Redo the cluster tree –Realign the sequences using the new cluster tree –Calculate a quality score Loop above can be stopped when a maximum number is reached or when quality score is not improved

25 Iterative alignment Allows correction of errors that was not possible in progressive alignment Very popular among the MSA methods Increases the running time of the method

26 Diagram of typical iterative MSA program workflow. Figure from Do & Katoh 2008 http://ai.stanford.edu/~chuongdo/papers/alignment_review.pdf Iterative alignment Iteration loop

27 What MSA program(s) to use? Depends on the application –Phylogenetic studies –Structure based studies Depends on the size of the data –Some programs cannot handle large dataset Remember to evaluate the alignment by eye

28 What MSA program(s) to use? Collection of MSA programs at EBI http://www.ebi.ac.uk/Tools/msa/

29 Summary of MSA MSA is relevant for many analysis tasks –Improved signal from the alignment Solving MSA requires heuristics Selection of MSA methods depends on the application Results should be evaluated by eye –And the errors should be corrected with MSA editors

30 Manual editing of MSAs? Let’s say that your performed an MSA witn computer. However, biologically, it has some faults - needs manual editing -> Editors: Jalview and Seaview http://www.csc.fi/english/research/sciences/bioscience/programs/index_html Input data can be in any of the most common MSA formats (Mase, Phylip, Clustal, MSF, Fasta, NEXUS, PIR and BCL)


Download ppt "Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja."

Similar presentations


Ads by Google