Presentation is loading. Please wait.

Presentation is loading. Please wait.

ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.

Similar presentations


Presentation on theme: "ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu."— Presentation transcript:

1 ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

2

3 Probabilistic Masking using pair-HMMs Probabilistic formulation of alignment problem. Can answer additional questions – Alignment Reliability – Sub-optimal Alignments Durbin et al., Cambridge University Press (1998)

4 Probabilistic Masking What is the probability residues x i and y j are homologous? Posterior Probability the residues x i and y j are homologous Can be calculated efficiently for all pairs (and gaps) in quadratic time. y]Pr[x, y]x,,yPr[x ]yPr[x ji ji

5 An Ideal Weighting Scheme Accounts for correlations between pairs – e.g. A-C and A-D Accounts for distance between the sequences in a pair – e.g. C-D

6 The Zorro Weighting Scheme 4 3 3 33 Calculate N e, the number of pairs that share an edge e.

7 The Zorro Weighting Scheme 4 3 3 33 Normalize the edge weight by N e. Weight of a pair is sum of normalized weights of edges on the path.

8 Scoring Multiple Alignment Columns Calculate the “posterior probability matrix” and weights w ij for every pair of sequences. Weighted “sum of pairs” score for column r :  ji, ij ji ji, ij w ]rPr[rw

9 Some Notes Improve Running Time – Sample a subset of pairs – Performance almost similar Using Confidence Scores – Cutoff Based Scheme (we use 0.5) – Weighted Sampling of columns according to confidence scores.

10 Testing The Balibase 3.0 Benchmark Database

11 Testing Realign sequences using MSA programs like Clustalw. Sensitivity: for all correctly aligned columns, the fraction that has been masked as good Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

12 Performance Gblocks ZORRO SensitivitySpecificity 96.3%95.1% 54.4%94.7 %

13 Effect on Phylogenetic Inference Gblocks data-set – Protein Sequences obtained by simulating evolution on known trees – Diversity in data-set Topology (Symmetric/Asymmetric) Evolutionary Rates Alignment Lengths (not tested yet)

14 Effect on Phylogenetic Inference Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy No Masking95.17%91.95 % Gblocks84.14 %86.44 % Prob. Masking93.56%93.33 % Clustalw alignments, PhyML tree

15 Effect on Phylogenetic Inference Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking94.25 %69.23%91.95 %57.44% Gblocks89.2 %57.44%90.80 %51.88% Prob. Masking94.02%68.21%93.79 %62.05% MAFFT alignments, PhyML tree

16 Effect on Phylogenetic Inference Clustalw alignments, PhyML tree Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking95.17%62.05%91.95 %55.38% Gblocks84.14 %41.03%86.44 %37.95% Prob. Masking93.56%72.31%93.33 %63.59%

17 Effect on Phylogenetic Inference Muscle alignments, PhyML tree Protocol Symmetric Tree Inference Accuracy Asymmetric Tree Inference Accuracy All High Support All High Support No Masking94.71%71.28%93.10 %61.03% Gblocks89.43 %57.95%90.11 %50.26% Prob. Masking93.56%70.77%95.17 %64.62%

18 Conclusions/Future Work Technical Issues – What if a few sequences are “bad”/non- homologous? – Incorporate reliability in likelihood equation and Bayesian methods. With Dr. Darling in July Testing – “Real” Data Sets?


Download ppt "ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu."

Similar presentations


Ads by Google