Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By.

Similar presentations


Presentation on theme: "Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By."— Presentation transcript:

1 Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/ January 20131TAU Bioinformatics Workshop

2 What are alignments good for? To compare sequences o Find homology o Similar sequence  similar function To learn about sequence evolution o Mismatch = point mutation o Gap = indel (insertion or deletion) o Reconstruct phylogenetic tree o Infer selection forces, e.g., detecting positive selection, co- evolving sites For structure prediction o Similar regions potentially have similar structure 2

3 Making an alignment (pairwise) ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ For 2 sequences – Pairwise alignment o Local alignment – finds regions of high similarity in parts of the sequences. o Global alignment – finds the best alignment across the entire two sequences Use exact solution o Needleman-Wunsch (for global) or Smith-Waterman (for local) - http://www.ebi.ac.uk/Tools/psa/ http://www.ebi.ac.uk/Tools/psa/ 3

4 Sequences evolution ATGAAATAA ATGTTTTAAATGCCCAAATAA ATGTTTTCAATGTTTTAA ATGCCCAAA AATTTT---GTA ACTTTT---GTA ---AAACCCGTA 30 MYA 5 MYA Today Human Chimp Mouse 4 AATTTT---GTA ACTTTT---GTA AAA---CCCGTA

5 Alignment and phylogeny are mutually dependent Inaccurate tree building MSA Sequence alignment Phylogeny reconstruction Unaligned sequences 5

6 Alignment and phylogeny are both challenging ~25% of residues are wrongly aligned Based on BAliBASE: a large representative set of proteins 6

7 Alignment and phylogeny are both challenging 5% of tree branches are wrong Based on simulations of 100 protein sequences

8 Making an alignment (MSA) For more sequences - Multiple sequence alignment (MSA) o Exact methods are not feasible (too slow) o We use heuristic methods o Several advanced MSA programs are available Basically two recommended methods: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its correct treatment of insertions/deletions 8

9 ABCDEABCDE Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table First step: compute pairwise distances Progressive alignment EDCBA A 8B 1715C 101416D 3231 32E 9

10 A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree similar sequences are neighbors in the tree distant sequences are distant from each other in the tree distant sequences are distant from each other in the tree Second step: build a guide tree EDCBA A 8B 1715C 101416D 3231 32E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences! 10

11 Third step: align sequences in a bottom up order A D C B E 1.Align the most similar (neighboring) pairs 2.Align pairs of pairs 3.Align sequences clustered to pairs of pairs deeper in the tree Sequence A Sequence B Sequence C Sequence D Sequence E 11

12 Multiple sequence alignment (MSA) progressive alignment ABCDEABCDE Guide tree A D C B E MSA Pairwise distance table Iterative 12

13 Sources of alignment errors Progressive alignment algorithms are greedy heuristics  Co-optimal solutions  Heads-or-Tails (HoT) scores (Landan & Graur 2007) GEELTNWPSPVCHNRLASGIDDSTAFRFPRPQKWIISYSLHCVI... GEELTLWPSPVCHNRLASGIDASIAFRFPRAQKRFYRYSLHCVI... TEELTHWPFPVCRNRLARGIGSAIAFRCPRSQEHI-RNSLPCVI... TEELRYWPFPVCQN--ARGNGSVIEARNPGSQ-----KVLPYVI......IVCHLSYSIIWKQPRPFRFATSDDIGSALRNHCVPSPWNTLEEG...IVCHLSYRYFRKQARPFRFAISADIGSALRNHCVPSPWLTLEEG...IVCPLSNRI-HEQSRPCRFAIASGIGRALRNRCVPFPWHTLEET...IVYPLVK-----QSGPNRAEIVSGNGRA--NQCVPFPWYRLEET 13

14 …MSA 1MSA 2MSA 99MSA 100 Progressive alignment …Tree 1Tree 2Tree 99Tree 100 Bootstrap sampling of NJ trees Base alignment GUIDANCE Scores Penn, Privman et al. MBE. 2010 GUIDANCE: Guide-tree based alignment confidence scores 14

15 Comparing alignments Common measures to quantify distance between two MSAs: 1. CS: Each column of the MSA that is identically aligned in the other MSA is given a score of 1; all other columns are given the score 0. 2. SP: Each pair of residues in the MSA that is identically aligned in the other MSA is given a score of 1; all other residue pairs are given the score 0. 3. Sum-of-pairs column score (SPC): The score of each column is simply the average of the SPs over all pairs in it.

16 Accuracy of GUIDANCE scores 16

17 http://guidance.tau.ac.il As a rule of thumb, use HoT for less than 8 sequences 17

18 http://guidance.tau.ac.il http://guidance.tau.ac.il Un-aligned sequences (FASTA format) Choose sequence type Choose alignment method 18

19 GUIDANCE results 10/18/2015Footer Text19 MSA colored by confidence score

20 ConfidentUncertain Sequence score Column score GUIDANCE results

21 GUIDANCE outputs 21 Download MSA for down-stream analysis Text files with all scores Mask residue by score Remove unreliable sequences

22 ConfidentUncertain Sequence score Column score GUIDANCE results 22

23 GUIDANCE outputs 23 Remove unreliable sequences Re-align sequences after filtration Sequences left after filtration

24 Filtering sequences with low scores and re-align 24 But always remember not to remove too much data and consider the biology…

25 GUIDANCE outputs 25 Remove unreliable columns MSA after filtration

26 Filtering columns with low scores 26

27 GUIDANCE outputs 27 Masking unreliably aligned residues

28 Filtering residues with low scores 28

29 Filtering unreliable regions can improve down-stream analysis 29 (Mol Biol Evol 2012;29:1-5)

30 Acknowledgments Prof. Tal Pupko Dr. Eyal Privman Dr. Osnat Penn Pupko’s lab members 1.Penn, O., Privman, E., Ashkenazy, H., Landan, G., Graur, D. and Pupko, T. (2010). GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443 [ABS] [PDF] [ABS][PDF] 2.Penn, O., Privman, E., Landan, G., Graur, D. and Pupko, T. (2010). An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066 [ABS] [PDF] [ABS][PDF] 3.Landan, G., and D. Graur. (2008). Local reliability measures from sets of co-optimal multiple sequence alignments. Pac Symp Biocomput 13:15-24 [ABS] [PDF][ABS][PDF] 30

31 Thanks for your attention! 31

32 1.Download and save the sequences file. (http://guidance.tau.ac.il/workshop_2013/) " Seq_For_GUIDANCE.fs " (File  “ Save as ” ). This file contains 20 protein sequences in FASTA format.http://guidance.tau.ac.il/workshop_2013 2.Run GUIDANCE web-server to create a protein alignment: a.Use GUIDANCE algorithm b.Select “amino acids” as the sequences type; c.Select MAFFT as the alignment method d.Run (press the “ Submit “ button). e.(In case it does not run for you, you can see the results at: http://guidance.tau.ac.il/results/13589321556364/output.php) http://guidance.tau.ac.il/results/13589321556364/output.php 3.What is the alignment score? What does it mean about the alignment achieved? 4.Which sequences can be removed to improve the alignment? What is the biological justification for that? Try it!

33 Appendix – MSA servers 33

34 MAFFT Web server & download: http://mafft.cbrc.jp/alignment/server/ http://mafft.cbrc.jp/alignment/server/ 34

35 Choosing a MAFFT strategy quick & dirty slow but accurate Efficiency-tuned variants  quick & dirty or slow but accurate

36 Choosing a MAFFT strategy quick & dirty slow but accurate

37 Choosing a MAFFT strategy quick & dirty slow but accurate

38 Choosing a MAFFT strategy L-INS-i ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------ --------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo------- ------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo------- --------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo --------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------ G-INS-i XXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXX XXXXX----XXXXXXXX---XXXXXXX XXXXX-XXXXXXXXXX----XXXXXXX XXXXXXXXXXXXXXXX----XXXXXXX E-INS-i oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo ---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------- -----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo ---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX------------- ---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-------- quick & dirty slow but accurate

39 MAFFT output A colored view of the alignment Choose a format: Clustal, Fasta and save as text file Run GUIDANCE also from here!!

40 PRANK

41 Classical alignment errors for HIV env

42 PRANK Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/http://www.ebi.ac.uk/goldman-srv/webPRANK/

43 PRANK output If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/ http://www-bimas.cit.nih.gov/molbio/readseq/


Download ppt "Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By."

Similar presentations


Ads by Google