ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetic Trees Lecture 4
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Molecular Evolution Revised 29/12/06
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
11 Ch6 multiple sequence alignment methods 1 Biologists produce high quality multiple sequence alignment by hand using knowledge of protein sequence evolution.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Expected accuracy sequence alignment
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Multiple alignment: heuristics
Multiple sequence alignment
BNFO 602 Multiple sequence alignment Usman Roshan.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Probabilistic methods for phylogenetic trees (Part 2)
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Christian M Zmasek, PhD 15 June 2010.
How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction Jeffrey Boucher Theobald Laboratory.
Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Analysis-III
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Expected accuracy sequence alignment Usman Roshan.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Construction of Substitution matrices
Expected accuracy sequence alignment Usman Roshan.
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen
Phylogenetic basis of systematics
Multiple sequence alignment (msa)
Multiple Sequence Alignment Methods
The ideal approach is simultaneous alignment and tree estimation.
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Multiple Alignment and Phylogenetic Trees
Multiple Sequence Alignment
Presentation transcript:

ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Probabilistic Masking using pair-HMMs Probabilistic formulation of alignment problem. Can answer additional questions – Alignment Reliability – Sub-optimal Alignments Durbin et al., Cambridge University Press (1998)

Probabilistic Masking What is the probability residues x i and y j are homologous? Posterior Probability the residues x i and y j are homologous Can be calculated efficiently for all pairs (and gaps) in quadratic time. y]Pr[x, y]x,,yPr[x ]yPr[x ji ji

An Ideal Weighting Scheme Accounts for correlations between pairs – e.g. A-C and A-D Accounts for distance between the sequences in a pair – e.g. C-D

The Zorro Weighting Scheme Calculate N e, the number of pairs that share an edge e.

The Zorro Weighting Scheme Normalize the edge weight by N e. Weight of a pair is sum of normalized weights of edges on the path.

Scoring Multiple Alignment Columns Calculate the “posterior probability matrix” and weights w ij for every pair of sequences. Weighted “sum of pairs” score for column r :  ji, ij ji ji, ij w ]rPr[rw

Some Notes Improve Running Time – Sample a subset of pairs – Performance almost similar Using Confidence Scores – Cutoff Based Scheme (we use 0.5) – Weighted Sampling of columns according to confidence scores.

Testing The Balibase 3.0 Benchmark Database

Testing Realign sequences using MSA programs like Clustalw. Sensitivity: for all correctly aligned columns, the fraction that has been masked as good Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

Performance Gblocks ZORRO SensitivitySpecificity 96.3%95.1% 54.4%94.7 %

Effect on Phylogenetic Inference Gblocks data-set – Protein Sequences obtained by simulating evolution on known trees – Diversity in data-set Topology (Symmetric/Asymmetric) Evolutionary Rates Alignment Lengths (not tested yet)

Effect on Phylogenetic Inference Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy No Masking95.17%91.95 % Gblocks84.14 %86.44 % Prob. Masking93.56%93.33 % Clustalw alignments, PhyML tree

Effect on Phylogenetic Inference Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking94.25 %69.23%91.95 %57.44% Gblocks89.2 %57.44%90.80 %51.88% Prob. Masking94.02%68.21%93.79 %62.05% MAFFT alignments, PhyML tree

Effect on Phylogenetic Inference Clustalw alignments, PhyML tree Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking95.17%62.05%91.95 %55.38% Gblocks84.14 %41.03%86.44 %37.95% Prob. Masking93.56%72.31%93.33 %63.59%

Effect on Phylogenetic Inference Muscle alignments, PhyML tree Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking94.71%71.28%93.10 %61.03% Gblocks89.43 %57.95%90.11 %50.26% Prob. Masking93.56%70.77%95.17 %64.62%

Conclusions/Future Work Technical Issues – What if a few sequences are “bad”/non- homologous? – Incorporate reliability in likelihood equation and Bayesian methods. With Dr. Darling in July Testing – “Real” Data Sets?