Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
COFFEE: an objective function for multiple sequence alignments
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Bioinformatics and Phylogenetic Analysis
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Analysis Tools
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
That have been aligned so that homologous residues are arranged in columns as much as possible. The sequences have different lengths, which means that.
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Pairwise Sequence Alignment and Database Searching
The ideal approach is simultaneous alignment and tree estimation.
Presentation transcript:

Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for morphological characters and must also be true for molecular characters or the entire analysis is meaningless Two different types of homology – –Paralogous sequences are homologous due to duplication –Orthologous sequences are homologous due to speciation Paralogous comparisons can be useful but in most cases we are interested in orthologous comparisons

Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis It is relatively easy to determine orthology/paralogy in the cases of genes/sequences We must determine homology for each and every nucleotide/amino acid position within the sequences. This is accomplished via sequence alignment It is THE CRITICAL first step in most phylogenetic analyses Always remember that the sequence alignment itself is a hypothesis about the homology of multiple positions in a set of protein of nucleotide sequences

Sequence Alignment A multiple sequence alignment aims to find homology among as many residues in a group of sequences as possible Most of the time, in order to align the sequences gaps must be introduced Gaps represent indels – insertion/deletion events - that have presumably occurred since the sequences diverged in evolutionary history All of this works on the assumption that they began as the same sequence and diverged over time due to mutations (substitutions, insertions, deletions).

The problem of repeats Repeated nucleotides, SSRs, make alignment difficult Sequence Alignment

Substitutions – point changes in sequences over time Sequence identity – the number of identical residues in an alignment divided by the number of aligned positions Gaps are not counted so it can be a misleading number Example- an amino acid sequence alignment Sequence Alignment

Note the indels – They represent an assumption that there has been an insertion and/or deletion in one or both sequences relative to each other (we can rarely know which it is for sure) Note the blocks of identical residues – They likely represent functionally important amino acids –Functional importance can be structural or enzymatic or both Sequence Alignment

Amino acid alignments have an advantage over DNA/RNA alignments Side chains of amino acids can be grouped according to chemical properties (basic, acidic, polar, nonpolar, charged, uncharged, hydrophobic, hydrophilic) Evolutionary theory suggests that similar substitions to similar amino acids will be tolerated more readily than more drastic changes Sequence Alignment

We can take advantage of this pattern to inform and aid the process of aligning protein sequences Dayhoff et al. (1978) developed a matrix to inform alignments based on assigning weights to various substitutions Based on 1572 observed changes in closely related protein sequences Higher weight – less likely change PAM Sequence Alignment

Most modern analyses use variations of a BLOSUM (BLOck SUbstitution Matrix) matrix by Jorja and Henikoff (1992) High number = likely substitution The idea is to find an alignment with the highest score. BLOSUM62 Sequence Alignment

Gaps Gaps are introduced to help maximize an alignment score Gaps can easily be added willy-nilly by alignment programs –Think about it – to obtain the highest score, just keep moving along the sequence to which you are aligning until you find a matching base Gap penalties – subtractions from the alignment scores when a gap is introduced GP = g + hl g = gap opening penalty, h = gap extension penalty, l = gap length No real biological justification for the formula In reality the origin of the gap must be taken into account but no models exist to do this The best scoring alignment may not reflect biological reality Sequence Alignment

Gaps mean something – what that is is subject to debate Most software ignores gaps by default, others utilize them but with no biological model to support their weight All of the previous information applies in some ways to DNA/RNA sequence alignments Nucleic acids for secondary structures and may have blocks of conserved sequence Some nucleotides are more likely to change to other nucleotides Sequence Alignment

Multiple alignment algorithms Dot-matrix sequence comparison A dot-plot is constructed MNALSQLN N A L M S Q N H MNALSQLN N A L M S Q N H Sequence Alignment MNALSQLN NALMSQNH

Dot-matrix sequence comparison Gaps are indicated by deviations from a diagonal MNALSQLN N A L M S Q N H Indicates that M matches with a gap Indicates that L matches with a gap Stage 1: –Align middle –Use triangles To indicate gaps NAL-SQLN NALMSQ-N Stage 2: –Sort the ends out MNAL-SQLN- -NALMSQ-NH Sequence Alignment

Dot-matrix sequence comparison Same for nucleotide alignments Sequence Alignment

Dot-matrix sequence comparison Method is great for getting an overall picture of the quality of the alignment and for identifying features of the sequences Detecting exons and similar genes in divergent taxa Sequence Alignment

Dot-matrix sequence comparison Detecting repetitive sequences Self align using a dot matrix Sequence Alignment

Dynamic programming Keep in mind that until now, we’ve only been talking about TWO sequences Dynamic programming can be used to find scores for all possible pairs of aligned residues and all possible pairs of sequences A score for each pair (D ij ) is calculated and all possible D ij ’s are summed to get a score. Sequence pairs can be weighted to give preference to more reliable pairs Time and memory requirements grow exponentially with the number of sequences Prohibitive for more than a few sequences Some problems with Dynamic programming can be overcome by using short subsection alignments (instead of global alignments) via DIALIGN (Morgenstern, 1999) Sequence Alignment

Progressive alignments Typically, we are trying to find the phylogeny given the sequences It would make it easier to align the sequences if we knew the phylogeny Build a quick and dirty guide tree and use it as the basis for the alignment Fast and reasonably reliable Align all possible pairs, generate genetic distances and build a guide tree Build the multiple sequence alignment by following the branching order of the tree from the most similar sequences to the least similar Sequence Alignment

Progressive alignments ClustalW and ClustalX ClustalX is just ClustalW with a built-in GUI Uses a progressive method Downweights sequences according to guide tree relatedness Can vary the weight matrix for protein sequences automatically according to relatedness of the sequences Limitation - Final results are highly dependent on initial alignments –Initial alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. Sequence Alignment

Progressive alignments T-Coffee Corrects an inherent problem of progressive alignments – –Early alignment mistakes cannot be corrected later in the process Calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. Uses the output from other local alignment programs to finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors. Slower but more accurate than Clustal Sequence Alignment

Iterative alignments Work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. Iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general objective function such as finding a high-quality alignment score. Sequence Alignment

Iterative alignments The software package PRRN/PRRP uses a hill-climbing algorithm to optimize its MSA alignment score and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA. PRRP performs best when refining an alignment previously constructed by a faster method. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences.The distance measure is updated between iteration stages. Sequence Alignment

Hidden Markov model alignments Use probabalistic models of substitution and indel occurrence. Do not always reach the same alignment during multiple runs Sequence Alignment