Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training.

Similar presentations


Presentation on theme: "Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training."— Presentation transcript:

1 Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training School Rehovot, 2010

2 What are alignments good for? To compare sequences Find homology Similar sequence  similar function To learn about sequence evolution Mismatch = point mutation Gap = indel (insertion or deletion) Reconstruct phylogenetic tree Infer selection forces, e.g., detecting positive selection

3 Sequences evolution ATGAAATAA ATGTTTTAAATGCCCAAATAA ATGTTTTAAATGTTT ATGCCCAAATAA AATTTT---GTA ---TTT---GTA AATAAACCCGTA 30 MYA 5 MYA Today Human Chimp Mouse

4 Alignment and phylogeny are mutually dependant Inaccurate tree building MSA Sequence alignment Phylogeny reconstruction Unaligned sequences

5 Alignment and phylogeny are both challenging 25% of residues are aligned wrong Based on BAliBASE: a large representative set of proteins

6 Alignment and phylogeny are both challenging 5% of tree branches are wrong Based on simulations of 100 protein sequences

7 Making an alignment For 2 sequences : use exact methods. For more sequences: Exact methods are not feasible (too slow) We use heuristic methods

8 ABCDEABCDE Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table First step: compute pairwise distances Progressive alignment EDCBA A 8B 1715C 101416D 3231 32E

9 A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree similar sequences are neighbors in the tree distant sequences are distant from each other in the tree distant sequences are distant from each other in the tree Second step: build a guide tree EDCBA A 8B 1715C 101416D 3231 32E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

10 Third step: align sequences in a bottom up order A D C B E 1.Align the most similar (neighboring) pairs 2.Align pairs of pairs 3.Align sequences clustered to pairs of pairs deeper in the tree Sequence A Sequence B Sequence C Sequence D Sequence E

11 Multiple sequence alignment (MSA) progressive alignment ABCDEABCDE Guide tree A D C B E MSA Pairwise distance table Iterative

12 Multiple sequence alignment (MSA) Several advanced MSA programs are available. Today we will use two: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its correct treatment of insertions/deletions

13 MAFFT Web server & download: http://align.bmr.kyushu-u.ac.jp/mafft/online/server/ http://align.bmr.kyushu-u.ac.jp/mafft/online/server/ Efficiency-tuned variants  quick & dirty or slow but accurate Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066 © 2002 Oxford University PressOxford University Press MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Kazutaka Katoh, Kazuharu Misawa 1, Kei-ichi Kuma and Takashi Miyata *

14 Choosing a MAFFT strategy quick & dirty slow but accurate

15 Choosing a MAFFT strategy quick & dirty slow but accurate

16 Choosing a MAFFT strategy quick & dirty slow but accurate

17 Choosing a MAFFT strategy L-INS-i ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------ --------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo------- ------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo------- --------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo --------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------ G-INS-i XXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXX XXXXX----XXXXXXXX---XXXXXXX XXXXX-XXXXXXXXXX----XXXXXXX XXXXXXXXXXXXXXXX----XXXXXXX E-INS-i oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo ---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------- -----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo ---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX------------- ---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-------- quick & dirty slow but accurate

18 MAFFT output Saving the output Choose a format: Clustal, Fasta, or click "Reformat" to convert to a selection of other formats Save page as a text file e.g. save as "phylip" file and upload to PhyML for reconstructing the tree A colored view of the alignment

19 PhyML: tree reconstruction The most widely used maximum likelihood (ML) program Web server & download: http://www.atgc-montpellier.fr/phyml/http://www.atgc-montpellier.fr/phyml/

20 PRANK

21 Classical alignment errors for HIV env

22 PRANK Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/http://www.ebi.ac.uk/goldman-srv/webPRANK/

23 PRANK output If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/ http://www-bimas.cit.nih.gov/molbio/readseq/

24 1.Download and save the sequences file from Osnat's homepage (you can google “ Osnat Penn" and look for the workshop materials under "Teaching"). Save the file as "trim5a.AA.fas" (File  “ Save page as ” ). This file contains 20 protein sequences in FASTA format.Osnat's homepagetrim5a.AA.fas 2.Run PRANK web-server to create a protein alignment: a.In the “ Default alignment ” section browse for “ trim5a.AA.fas ”. b.Run (press the “ Start alignment “ button). 3.While you wait: copy the sequences into the MAFFT web server and run the "automatic" "moderately accurate" strategy – which strategy did MAFFT choose for you? Click on the "Fasta format “ link, and save as “ trim5a.AA.mafft.aln “ (File  “ Save page as ” ) and try the "Jalview" button. 4.When PRANK finishes click on the “ Show Fasta file ” button, and save the MSA by the name “ trim5a.AA.prank.aln “.

25 Sources of alignment errors Progressive alignment algorithms are greedy heuristics  Co-optimal solutions  Heads-or-Tails (HoT) scores (Landan & Graur 2007)  Guide-tree errors  GUIDANCE scores (Penn, Privman et al. MBE 2010)

26 GUIDANCE: Guide-tree based alignment confidence scores …MSA 1MSA 2MSA 99MSA 100 Progressive alignment …Tree 1Tree 2Tree 99Tree 100 Bootstrap sampling of NJ trees Base MSA GUIDANCE Scores 0 1 ConfidentUncertain Penn, Privman et al. MBE. 2010

27 http://guidance.tau.ac.il

28 HIV1 group M SIV chimp HIV1 group O HIV1 group N SIV cerco SIV gorilla Transmembrane domain Extracellular domain Cytoplasmic domain (a) GUIDANCE score Column GUIDANCE Scores ConfidentUncertain

29 HIV1 group M SIV chimp HIV1 group O Transmembrane domain Extracellular domain Cytoplasmic domain (b) GUIDANCE score Column

30 1.Run GUIDANCE web-server to calculate confidence scores for the MAFFT alignment: a.In the “ Upload your sequence file ” window browse for “ trim5a.AA.fas ”. b.Choose “ Amino Acids ” in the “ Sequences Type ” option. c.In order to speed the run, change the “ Number of bootstrap repeats ” in the “ Advanced options ” section to 30. Note that this is not recommended for real life. d.Run (press the “ Submit “ button).

31 Detecting selection forces  Positive selection

32 Empirical findings variation among genes: “Important” proteins evolve slower unimportantones than “unimportant” ones

33 Histone 3 protein

34 Empirical findings variation among sites: Functional sites evolve slower than nonfunctional sites

35

36 Silent and non-silent mutations Silent: UUU -> UUC (both encode phenylalanine) Non-silent: UUU -> CUU (phenylalanine to leucine)

37 For most proteins, the rate of silent substitutions is much higher than the non-silent rate purifying selection This is called purifying selection = conservation

38 rarenon-silent silent There are rare cases where the non-silent rate is much higher than the silent rate positive selection This is called positive selection

39 Positive Selection Examples: Pathogen proteins evading the host immune system Proteins of the immune system detecting pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system

40 http://selecton.tau.ac.il

41 Selecton results

42 False positive predictions Selecton uses an MSA as input The MSA may contain unreliable regions Errors in Selecton computations Errors in the positive selection inference

43 1.Go to the GUIDANCE results of the last exercise. 2.Which columns are not well aligned? Are these sites also predicted to evolve under positive selection? See Selecton results in: http://selecton.tau.ac.il/results/1268662868/colors.html http://selecton.tau.ac.il/results/1268662868/colors.html

44 Summary Different alignment programs may result different MSAs. Alignment uncertainty may cause errors in downstream analyses such as positive selection analysis. GUIDANCE can detect alignment errors.

45 Thanks for your attention!


Download ppt "Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training."

Similar presentations


Ads by Google