ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu
Probabilistic Masking using pair-HMMs Probabilistic formulation of alignment problem. Can answer additional questions – Alignment Reliability – Sub-optimal Alignments Durbin et al., Cambridge University Press (1998)
Probabilistic Masking What is the probability residues x i and y j are homologous? Posterior Probability the residues x i and y j are homologous Can be calculated efficiently for all pairs (and gaps) in quadratic time. y]Pr[x, y]x,,yPr[x ]yPr[x ji ji
An Ideal Weighting Scheme Accounts for correlations between pairs – e.g. A-C and A-D Accounts for distance between the sequences in a pair – e.g. C-D
The Zorro Weighting Scheme Calculate N e, the number of pairs that share an edge e.
The Zorro Weighting Scheme Normalize the edge weight by N e. Weight of a pair is sum of normalized weights of edges on the path.
Scoring Multiple Alignment Columns Calculate the “posterior probability matrix” and weights w ij for every pair of sequences. Weighted “sum of pairs” score for column r : ji, ij ji ji, ij w ]rPr[rw
Some Notes Improve Running Time – Sample a subset of pairs – Performance almost similar Using Confidence Scores – Cutoff Based Scheme (we use 0.5) – Weighted Sampling of columns according to confidence scores.
Testing The Balibase 3.0 Benchmark Database
Testing Realign sequences using MSA programs like Clustalw. Sensitivity: for all correctly aligned columns, the fraction that has been masked as good Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad
Performance Gblocks ZORRO SensitivitySpecificity 96.3%95.1% 54.4%94.7 %
Effect on Phylogenetic Inference Gblocks data-set – Protein Sequences obtained by simulating evolution on known trees – Diversity in data-set Topology (Symmetric/Asymmetric) Evolutionary Rates Alignment Lengths (not tested yet)
Effect on Phylogenetic Inference Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy No Masking95.17%91.95 % Gblocks84.14 %86.44 % Prob. Masking93.56%93.33 % Clustalw alignments, PhyML tree
Effect on Phylogenetic Inference Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking94.25 %69.23%91.95 %57.44% Gblocks89.2 %57.44%90.80 %51.88% Prob. Masking94.02%68.21%93.79 %62.05% MAFFT alignments, PhyML tree
Effect on Phylogenetic Inference Clustalw alignments, PhyML tree Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking95.17%62.05%91.95 %55.38% Gblocks84.14 %41.03%86.44 %37.95% Prob. Masking93.56%72.31%93.33 %63.59%
Effect on Phylogenetic Inference Muscle alignments, PhyML tree Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking94.71%71.28%93.10 %61.03% Gblocks89.43 %57.95%90.11 %50.26% Prob. Masking93.56%70.77%95.17 %64.62%
Conclusions/Future Work Technical Issues – What if a few sequences are “bad”/non- homologous? – Incorporate reliability in likelihood equation and Bayesian methods. With Dr. Darling in July Testing – “Real” Data Sets?