Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training School Rehovot, 2010
What are alignments good for? To compare sequences Find homology Similar sequence similar function To learn about sequence evolution Mismatch = point mutation Gap = indel (insertion or deletion) Reconstruct phylogenetic tree Infer selection forces, e.g., detecting positive selection
Sequences evolution ATGAAATAA ATGTTTTAAATGCCCAAATAA ATGTTTTAAATGTTT ATGCCCAAATAA AATTTT---GTA ---TTT---GTA AATAAACCCGTA 30 MYA 5 MYA Today Human Chimp Mouse
Alignment and phylogeny are mutually dependant Inaccurate tree building MSA Sequence alignment Phylogeny reconstruction Unaligned sequences
Alignment and phylogeny are both challenging 25% of residues are aligned wrong Based on BAliBASE: a large representative set of proteins
Alignment and phylogeny are both challenging 5% of tree branches are wrong Based on simulations of 100 protein sequences
Making an alignment For 2 sequences : use exact methods. For more sequences: Exact methods are not feasible (too slow) We use heuristic methods
ABCDEABCDE Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table First step: compute pairwise distances Progressive alignment EDCBA A 8B 1715C D E
A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree similar sequences are neighbors in the tree distant sequences are distant from each other in the tree distant sequences are distant from each other in the tree Second step: build a guide tree EDCBA A 8B 1715C D E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!
Third step: align sequences in a bottom up order A D C B E 1.Align the most similar (neighboring) pairs 2.Align pairs of pairs 3.Align sequences clustered to pairs of pairs deeper in the tree Sequence A Sequence B Sequence C Sequence D Sequence E
Multiple sequence alignment (MSA) progressive alignment ABCDEABCDE Guide tree A D C B E MSA Pairwise distance table Iterative
Multiple sequence alignment (MSA) Several advanced MSA programs are available. Today we will use two: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its correct treatment of insertions/deletions
MAFFT Web server & download: Efficiency-tuned variants quick & dirty or slow but accurate Nucleic Acids Research, 2002, Vol. 30, No © 2002 Oxford University PressOxford University Press MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Kazutaka Katoh, Kazuharu Misawa 1, Kei-ichi Kuma and Takashi Miyata *
Choosing a MAFFT strategy quick & dirty slow but accurate
Choosing a MAFFT strategy quick & dirty slow but accurate
Choosing a MAFFT strategy quick & dirty slow but accurate
Choosing a MAFFT strategy L-INS-i ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo XXXXXXXXXXXXXXXX----XXXXXXX G-INS-i XXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXX XXXXX----XXXXXXXX---XXXXXXX XXXXX-XXXXXXXXXX----XXXXXXX XXXXXXXXXXXXXXXX----XXXXXXX E-INS-i oooooooooXXX------XXXX XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo XXXXXXXXXXXXXooo XXXXXXXXXXXXXXXXXX-XXXXXXXX ooooXXXXXX---XXXXooooooooooo XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX XXXXX----XXXX XXXXX---XXXXXXXXXX--XXXXXXXooooo quick & dirty slow but accurate
MAFFT output Saving the output Choose a format: Clustal, Fasta, or click "Reformat" to convert to a selection of other formats Save page as a text file e.g. save as "phylip" file and upload to PhyML for reconstructing the tree A colored view of the alignment
PhyML: tree reconstruction The most widely used maximum likelihood (ML) program Web server & download:
PRANK
Classical alignment errors for HIV env
PRANK Web server:
PRANK output If you need a different format – copy the results to the READSEQ sequence converter:
1.Download and save the sequences file from Osnat's homepage (you can google “ Osnat Penn" and look for the workshop materials under "Teaching"). Save the file as "trim5a.AA.fas" (File “ Save page as ” ). This file contains 20 protein sequences in FASTA format.Osnat's homepagetrim5a.AA.fas 2.Run PRANK web-server to create a protein alignment: a.In the “ Default alignment ” section browse for “ trim5a.AA.fas ”. b.Run (press the “ Start alignment “ button). 3.While you wait: copy the sequences into the MAFFT web server and run the "automatic" "moderately accurate" strategy – which strategy did MAFFT choose for you? Click on the "Fasta format “ link, and save as “ trim5a.AA.mafft.aln “ (File “ Save page as ” ) and try the "Jalview" button. 4.When PRANK finishes click on the “ Show Fasta file ” button, and save the MSA by the name “ trim5a.AA.prank.aln “.
Sources of alignment errors Progressive alignment algorithms are greedy heuristics Co-optimal solutions Heads-or-Tails (HoT) scores (Landan & Graur 2007) Guide-tree errors GUIDANCE scores (Penn, Privman et al. MBE 2010)
GUIDANCE: Guide-tree based alignment confidence scores …MSA 1MSA 2MSA 99MSA 100 Progressive alignment …Tree 1Tree 2Tree 99Tree 100 Bootstrap sampling of NJ trees Base MSA GUIDANCE Scores 0 1 ConfidentUncertain Penn, Privman et al. MBE. 2010
HIV1 group M SIV chimp HIV1 group O HIV1 group N SIV cerco SIV gorilla Transmembrane domain Extracellular domain Cytoplasmic domain (a) GUIDANCE score Column GUIDANCE Scores ConfidentUncertain
HIV1 group M SIV chimp HIV1 group O Transmembrane domain Extracellular domain Cytoplasmic domain (b) GUIDANCE score Column
1.Run GUIDANCE web-server to calculate confidence scores for the MAFFT alignment: a.In the “ Upload your sequence file ” window browse for “ trim5a.AA.fas ”. b.Choose “ Amino Acids ” in the “ Sequences Type ” option. c.In order to speed the run, change the “ Number of bootstrap repeats ” in the “ Advanced options ” section to 30. Note that this is not recommended for real life. d.Run (press the “ Submit “ button).
Detecting selection forces Positive selection
Empirical findings variation among genes: “Important” proteins evolve slower unimportantones than “unimportant” ones
Histone 3 protein
Empirical findings variation among sites: Functional sites evolve slower than nonfunctional sites
Silent and non-silent mutations Silent: UUU -> UUC (both encode phenylalanine) Non-silent: UUU -> CUU (phenylalanine to leucine)
For most proteins, the rate of silent substitutions is much higher than the non-silent rate purifying selection This is called purifying selection = conservation
rarenon-silent silent There are rare cases where the non-silent rate is much higher than the silent rate positive selection This is called positive selection
Positive Selection Examples: Pathogen proteins evading the host immune system Proteins of the immune system detecting pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system
Selecton results
False positive predictions Selecton uses an MSA as input The MSA may contain unreliable regions Errors in Selecton computations Errors in the positive selection inference
1.Go to the GUIDANCE results of the last exercise. 2.Which columns are not well aligned? Are these sites also predicted to evolve under positive selection? See Selecton results in:
Summary Different alignment programs may result different MSAs. Alignment uncertainty may cause errors in downstream analyses such as positive selection analysis. GUIDANCE can detect alignment errors.
Thanks for your attention!