COFFEE: an objective function for multiple sequence alignments

Slides:

Advertisements

Similar presentations

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group

Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.

Multiple Sequence Alignment

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.

Structural bioinformatics

BNFO 602 Multiple sequence alignment Usman Roshan.

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Bioinformatics and Phylogenetic Analysis

Sequence Analysis Tools

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Multiple alignment: heuristics

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

BNFO 602 Multiple sequence alignment Usman Roshan.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.

Multiple Sequence Alignments

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Needleman-Wunsch with affine gaps

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Multiple sequence alignment

Catherine S. Grasso Christopher J. Lee Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs.

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

An Introduction to Multiple Sequence Alignments Cédric Notredame.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Phylogenetic Analysis Dayong Guo. Introduction Phylogenetics is the study of evolutionary relatedness among various species, populations, or among a set.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Introduction to Bioinformatics Algorithms Sequence Alignment.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,

Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.

Multiple sequence alignment

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Expected accuracy sequence alignment Usman Roshan.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.

Sequence Alignment.

Dynamic Programming.  Decomposes a problem into a series of sub- problems  Builds up correct solutions to larger and larger sub- problems  Examples.

Protein Sequence Alignment Multiple Sequence Alignment

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune

Multiple sequence alignment (msa)

The ideal approach is simultaneous alignment and tree estimation.

Sequence comparison: Dynamic programming

Multiple Sequence Alignment

Introduction to Bioinformatics

Presentation transcript:

COFFEE: an objective function for multiple sequence alignments Wang Yi Computational Genomics Group Bioinformatics Institute

Why MSA Multiple Sequence Alignments (MSA) are among the most important tools for analyzing biological sequences Useful for: Structure prediction Phylogenetic analysis Function prediction Polymerase Chain Reaction (PCR) primer design And more…

What is COFFEE Consistency based Objective Function For alignmEnt Evaluation The COFFEE score reflects the level of consistency between a MSA and a library containing pairwise alignments of the same group of sequences

What is consistency? Why study consistency between MSA and pairwise alignment?

Why pairwise alignments MSA, unlike pairwise alignment, cannot guarantee optimality yet Pairwise alignments use dynamic programming to obtain optimal result While it is too expensive for MSA to adopt the same algorithm People try to exploit the optimality of pairwise alignment by progressively combine them into MSA

Pairwise alignments to MSA ClustalW is a widely recognized package among such attempts ClustalW generates a guide tree according to the distances between each pair of sequences Then it aligns all these sequences progressively, from the closest branches to the most distant ones

Problem with ClustalW Mistakes made at the beginning of this procedure are never corrected This problem stems from not considering the consistency between close pair and distant ones

Two solutions To solve this problem, we can do either: Check the consistency between one pairwise alignment and the rest of the library before the progressive alignment Or: after obtaining a MSA, check the consistency between each pair of residues with its counterpart in pairwise alignment library

Consistency Vs Consistency These two kinds of consistency are actually closely related: To increase the consistency between pairs will decrease the chance of inconsistency between a pair with its origin in the library T-COFFEE takes the first approach while COFFEE calculates the latter

A simple example Suppose we have four sequences: SeqA: THE LAST FAT CAT SeqB: THE FAST CAT SeqC: THE VERY FAST CAT SeqD: THE FAT CAT We make a pairwise alignment library of these sequences:

Compare the consistency SeqA THE LAST FAT CAT SeqB THE FAST CAT --- SeqA THE LAST FA-T CAT SeqC THE VERY FAST CAT SeqA THE LAST FAT CAT SeqD THE ---- FAT CAT SeqB THE ---- FAST CAT SeqC THE VERY FAST CAT SeqB THE FAST CAT SeqD THE FA-T CAT SeqC THE VERY FAST CAT SeqD THE ---- FA-T CAT SeqA THE LAST FA-T CAT SeqB THE FAST CA-T --- SeqC THE VERY FAST CAT SeqD THE ---- FA-T CAT Or SeqA THE LAST FA-T CAT SeqB THE ---- FAST CAT SeqC THE VERY FAST CAT SeqD THE ---- FA-T CAT

How COFFEE works Create a library of pairwise alignment for each possible pairs of sequences Compare each pair of aligned residues in the MSA to its counterpart in the library The overall consistency score is equal to the number of pairs that occur in both MSA and the library, divided by the total number of pairs in MSA.

How COFFEE works To decrease the amount of noise produced by inaccurate pairwise alignments in the library, we set a weight for each of them The weight equals the percent identity between the alignment For example: SeqA THE LAST FAT CAT SeqB THE FAST CAT --- The weight is 8/13*100%=61.5%

The idea of weight The lower the weight (the more mismatches in the pairwise alignment), the more distant these two sequences are, and the less necessary we need to keep such pair in MSA. Therefore, with weight taken into mind we can keep the consistency only when it’s necessary

COFFEE Score Aij is the pairwise projection of sequences i and j obtained from a MSA Len(Aij) is the length of Aij Wij is the weight of pairwise alignment on sequences i and j in the library Score(Aij) is the number of aligned pairs of residues that are shared between Aij and the library

Features of COFFEE There is no gap penalty, since they are already contained in the library The score is normalized by the value of maximum score, thus it’s between 0 and 1 The cost of substitution is made position dependent, i.e., we tolerate mismatch that already occurred in the library

Comments on COFFEE

Position-specific issue The current objective function is not position-specific enough It applies general weights in the whole pairwise alignments instead of functional parts Even very close alignment has non-functional parts, which contain more mismatches

Distant and close alignments A close alignment example: THE –FIRST GULF WAR IS FOR JUSTICE ||| || |||| ||| || ||| | THE THIRD- GULF WAR IS FOR ---OIL– A distant alignment example: GO ATTACK THIS WEAK BUT EVIL IRAQ-- || |||| DUN TOUCH THE ARMED AND EVIL NKOREA

Position-specific issue The current score function places the same weight to such non-important section It does reduce the amount of noise produced by inaccurate alignment of distant sequences However it fails to do so in close ones Nonetheless, it gives lower weight to functional part in distant sequences

Revision of COFFEE Score(Aijl) = 1 when the pair at position l in sequence i and j occurs with that in library, otherwise it is 0 W(Aijl) = 1 when the pair at position l in sequence i and j in the library are identical, otherwise it’s k (0<=k<1)

Features of the revision Dispose of the idea as to adopt overall weight Instead we check the identity of each pair of residues The value of k depends on how we evaluate mismatch It could be set according to substitution matrix

Alternative alignment Although pairwise alignment is optimal, it depends on its constraints, such as penalty Different constraints generate alignments of various purpose Instead of only one alignment of each possible pair of sequences in the library, we could add its alternative alignment(s) so as to include more information

Alternative alignment When using library with alternative alignments, we have to apply the revision of COFFEE introduced previously Otherwise pairs from different alignments can use only one weight from them However, till now scientists used to weigh different alignments of the same constraint How to weigh alignments of different constraints is yet a new challenge

Conclusion COFFEE evaluates the consistency of each pairwise projection with its pairwise alignment COFFEE can be used in iterative MSA algorithm at a judging point COFFEE is not position-specific enough to filter noise due to inaccurate alignments, which leads to a revision provided by our group Alternative pairwise alignments could be added to the library to include more information between sequences

Thanks for your attention! Wangyi@bii.a-star.edu.sg Feb 20th, 2003