Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training.

Slides:



Advertisements
Similar presentations
Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.
Homework Assignments due next session 1.Find a entry of interest in OMIM ( )
Clustal W and Clustal X version 2.0 김영호, 박준호, 최현희 The 9 th Protein Folding Winter School.
Molecular Evolution Revised 29/12/06
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 Functional prediction in proteins (purifying and positive selection)
Multiple Sequence Alignments
1 HW Clarifications Homology implies shared ancestry Partial sequence identity does not necessarily imply homology A high coverage of sequence identity.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Multiple Sequence Alignment School of B&I TCD May 2010.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
From basic Concepts to Advanced applications Molecular Evolution & Phylogeny By Ofir Cohen The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel.
Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
1 Phylogeny Workshop By Eyal PrivmanEyal Privman The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel November 2009
Multiple sequence alignment
Copyright OpenHelix. No use or reproduction without express written consent1.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogeny and visualization: MEGA and iTOL Yanbin Yin Spring
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
From basic Concepts to Advanced applications Molecular Evolution & Phylogeny By Ofir Cohen The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel.
From basic Concepts to Advanced applications Molecular Evolution and Phylogeny By Ofir Cohen The Bioinformatics Unit G.S. Wise Faculty of Life Science.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Aidan Budd, EMBL Heidelberg Multiple Sequence Alignments.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
The ideal approach is simultaneous alignment and tree estimation.
Overview of Multiple Sequence Alignment Algorithms
Multiple Sequence Alignment
Adva Yeheskel Bioinformatics Unit, Tel Aviv University 8/5/2018
BNFO 602 Phylogenetics Usman Roshan.
Presentation transcript:

Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training School Rehovot, 2010

What are alignments good for? To compare sequences Find homology Similar sequence  similar function To learn about sequence evolution Mismatch = point mutation Gap = indel (insertion or deletion) Reconstruct phylogenetic tree Infer selection forces, e.g., detecting positive selection

Sequences evolution ATGAAATAA ATGTTTTAAATGCCCAAATAA ATGTTTTAAATGTTT ATGCCCAAATAA AATTTT---GTA ---TTT---GTA AATAAACCCGTA 30 MYA 5 MYA Today Human Chimp Mouse

Alignment and phylogeny are mutually dependant Inaccurate tree building MSA Sequence alignment Phylogeny reconstruction Unaligned sequences

Alignment and phylogeny are both challenging 25% of residues are aligned wrong Based on BAliBASE: a large representative set of proteins

Alignment and phylogeny are both challenging 5% of tree branches are wrong Based on simulations of 100 protein sequences

Making an alignment For 2 sequences : use exact methods. For more sequences: Exact methods are not feasible (too slow) We use heuristic methods

ABCDEABCDE Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table First step: compute pairwise distances Progressive alignment EDCBA A 8B 1715C D E

A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree similar sequences are neighbors in the tree distant sequences are distant from each other in the tree distant sequences are distant from each other in the tree Second step: build a guide tree EDCBA A 8B 1715C D E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

Third step: align sequences in a bottom up order A D C B E 1.Align the most similar (neighboring) pairs 2.Align pairs of pairs 3.Align sequences clustered to pairs of pairs deeper in the tree Sequence A Sequence B Sequence C Sequence D Sequence E

Multiple sequence alignment (MSA) progressive alignment ABCDEABCDE Guide tree A D C B E MSA Pairwise distance table Iterative

Multiple sequence alignment (MSA) Several advanced MSA programs are available. Today we will use two: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its correct treatment of insertions/deletions

MAFFT Web server & download: Efficiency-tuned variants  quick & dirty or slow but accurate Nucleic Acids Research, 2002, Vol. 30, No © 2002 Oxford University PressOxford University Press MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Kazutaka Katoh, Kazuharu Misawa 1, Kei-ichi Kuma and Takashi Miyata *

Choosing a MAFFT strategy quick & dirty slow but accurate

Choosing a MAFFT strategy quick & dirty slow but accurate

Choosing a MAFFT strategy quick & dirty slow but accurate

Choosing a MAFFT strategy L-INS-i ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo XXXXXXXXXXXXXXXX----XXXXXXX G-INS-i XXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXX XXXXX----XXXXXXXX---XXXXXXX XXXXX-XXXXXXXXXX----XXXXXXX XXXXXXXXXXXXXXXX----XXXXXXX E-INS-i oooooooooXXX------XXXX XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo XXXXXXXXXXXXXooo XXXXXXXXXXXXXXXXXX-XXXXXXXX ooooXXXXXX---XXXXooooooooooo XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX XXXXX----XXXX XXXXX---XXXXXXXXXX--XXXXXXXooooo quick & dirty slow but accurate

MAFFT output Saving the output Choose a format: Clustal, Fasta, or click "Reformat" to convert to a selection of other formats Save page as a text file e.g. save as "phylip" file and upload to PhyML for reconstructing the tree A colored view of the alignment

PhyML: tree reconstruction The most widely used maximum likelihood (ML) program Web server & download:

PRANK

Classical alignment errors for HIV env

PRANK Web server:

PRANK output If you need a different format – copy the results to the READSEQ sequence converter:

1.Download and save the sequences file from Osnat's homepage (you can google “ Osnat Penn" and look for the workshop materials under "Teaching"). Save the file as "trim5a.AA.fas" (File  “ Save page as ” ). This file contains 20 protein sequences in FASTA format.Osnat's homepagetrim5a.AA.fas 2.Run PRANK web-server to create a protein alignment: a.In the “ Default alignment ” section browse for “ trim5a.AA.fas ”. b.Run (press the “ Start alignment “ button). 3.While you wait: copy the sequences into the MAFFT web server and run the "automatic" "moderately accurate" strategy – which strategy did MAFFT choose for you? Click on the "Fasta format “ link, and save as “ trim5a.AA.mafft.aln “ (File  “ Save page as ” ) and try the "Jalview" button. 4.When PRANK finishes click on the “ Show Fasta file ” button, and save the MSA by the name “ trim5a.AA.prank.aln “.

Sources of alignment errors Progressive alignment algorithms are greedy heuristics  Co-optimal solutions  Heads-or-Tails (HoT) scores (Landan & Graur 2007)  Guide-tree errors  GUIDANCE scores (Penn, Privman et al. MBE 2010)

GUIDANCE: Guide-tree based alignment confidence scores …MSA 1MSA 2MSA 99MSA 100 Progressive alignment …Tree 1Tree 2Tree 99Tree 100 Bootstrap sampling of NJ trees Base MSA GUIDANCE Scores 0 1 ConfidentUncertain Penn, Privman et al. MBE. 2010

HIV1 group M SIV chimp HIV1 group O HIV1 group N SIV cerco SIV gorilla Transmembrane domain Extracellular domain Cytoplasmic domain (a) GUIDANCE score Column GUIDANCE Scores ConfidentUncertain

HIV1 group M SIV chimp HIV1 group O Transmembrane domain Extracellular domain Cytoplasmic domain (b) GUIDANCE score Column

1.Run GUIDANCE web-server to calculate confidence scores for the MAFFT alignment: a.In the “ Upload your sequence file ” window browse for “ trim5a.AA.fas ”. b.Choose “ Amino Acids ” in the “ Sequences Type ” option. c.In order to speed the run, change the “ Number of bootstrap repeats ” in the “ Advanced options ” section to 30. Note that this is not recommended for real life. d.Run (press the “ Submit “ button).

Detecting selection forces  Positive selection

Empirical findings variation among genes: “Important” proteins evolve slower unimportantones than “unimportant” ones

Histone 3 protein

Empirical findings variation among sites: Functional sites evolve slower than nonfunctional sites

Silent and non-silent mutations Silent: UUU -> UUC (both encode phenylalanine) Non-silent: UUU -> CUU (phenylalanine to leucine)

For most proteins, the rate of silent substitutions is much higher than the non-silent rate purifying selection This is called purifying selection = conservation

rarenon-silent silent There are rare cases where the non-silent rate is much higher than the silent rate positive selection This is called positive selection

Positive Selection Examples: Pathogen proteins evading the host immune system Proteins of the immune system detecting pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system

Selecton results

False positive predictions Selecton uses an MSA as input The MSA may contain unreliable regions Errors in Selecton computations Errors in the positive selection inference

1.Go to the GUIDANCE results of the last exercise. 2.Which columns are not well aligned? Are these sites also predicted to evolve under positive selection? See Selecton results in:

Summary Different alignment programs may result different MSAs. Alignment uncertainty may cause errors in downstream analyses such as positive selection analysis. GUIDANCE can detect alignment errors.

Thanks for your attention!