Benchmarking Statistical Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Genetic Statistics Lectures (5) Multiple testing correction and population structure correction.
Advertisements

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Expected accuracy sequence alignment
1 Transforming the efficiency of Partial EVSI computation Alan Brennan Health Economics and Decision Science (HEDS) Samer Kharroubi Centre for Bayesian.
BNFO 602 Multiple sequence alignment Usman Roshan.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
1 Terminating Statistical Analysis By Dr. Jason Merrick.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Accuracy and Precision Much of science has to do with the collection and manipulation of quantitative or numerical data. Much of science has to do with.
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Comp. Genomics Recitation 3 The statistics of database searching.
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
Protein World SARA Amsterdam Tim Hulsen.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
1/20 Study of Highly Accurate and Fast Protein-Ligand Docking Method Based on Molecular Dynamics Reporter: Yu Lun Kuo
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
MSE Performance Metrics and Tentative Results Summary Joint Technical Committee Northwest Fisheries Science Center, NOAA Pacific Biological Station, DFO.
1 Introduction to Science Investigations Chapter 1, page 4-26.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Expected accuracy sequence alignment Usman Roshan.
BIOSTATISTICS Hypotheses testing and parameter estimation.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
MSE Performance Metrics, Tentative Results and Summary Joint Technical Committee Northwest Fisheries Science Center, NOAA Pacific Biological Station, DFO.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Evaluation of protein alignments Usman Roshan BNFO 236.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
Chapter 8 Introducing Inferential Statistics.
Chapter 3: Measurement: Accuracy, Precision, and Error
Statistics for the Social Sciences
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Chapter 10 Verification and Validation of Simulation Models
Statistical Methods Carey Williamson Department of Computer Science
Tandy Warnow The University of Illinois

Large-Scale Multiple Sequence Alignment
Using Baseline Data in Quality Problem Solving
Tandy Warnow Founder Professor of Engineering
Generalizations of Markov model to characterize biological sequences
New methods for simultaneous estimation of trees and alignments
Statistics for the Social Sciences
Predicting Body Movement and Recognizing Actions: an Integrated Framework for Mutual Benefits Boyu Wang and Minh Hoai Stony Brook University Experiments:
Psych 231: Research Methods in Psychology
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Science Basics.
Propagation of Error Berlin Chen
Ultra-large Multiple Sequence Alignment
Simulation Berlin Chen
Propagation of Error Berlin Chen
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Presentation transcript:

Benchmarking Statistical Multiple Sequence Alignment Tandy Warnow Joint with Mike Nute and Ehsan Saleh https://www.biorxiv.org/content/early/2018/04/20/304659

Performance Study on bioRxiv Goal: Evaluate Bali-Phy (Redelings and Suchard) on both biological and simulated datasets, in comparison to leading alignment methods on small protein sequence datasets (at most 27 sequences) Metrics: Modeller score (precision), SP-score (recall), Expansion ratio (normalized alignment length), and running time Datasets: 120 simulated datasets (6 model conditions) and 1192 biological datasets (4 biological benchmarks) Specific note: For each dataset, Bali-Phy was run independently on 32 processors for 48 hours, the burn-in was discarded, and the posterior decoding (PD) alignment was then computed. These Bali-Phy analyses used 230 CPU years on Blue Waters (supercomputer at NCSA).

Alignment Accuracy on Simulated Datasets (120 datasets) BAli-Phy is best!

Alignment Accuracy on Protein Biological Benchmarks (1192 datasets) T-Coffee and PROMALS are best! BAli-Phy good for Modeler score, but not so good for SP-Score (e.g., MAFFT better)

Alignment Accuracy on 4 Protein Biological Benchmarks (1192 datasets) BAli-Phy has the best modeler score

Alignment Accuracy on 4 Protein Biological Benchmarks (1192 datasets) BAli-Phy not competitive for SP-score (but best method depends on %ID)

Running Time on 4 biological datasets with 17 sequences each BAli-Phy benefits from a long running time: we used >2 months for each dataset

Observations Bali-Phy is much more accurate than all other methods on simulated datasets Bali-Phy is generally less accurate than the top half of these methods on biological datasets, especially with respect to SP-score (recall) Average percent pairwise ID impacts all the measures of accuracy for all methods, and changes relative performance

Conclusions We do not know why there is a difference in accuracy. BAli-Phy MCMC works well (given enough time) so this is not the issue Possible explanations: Model misspecification (proteins don’t evolve under the Bali-Phy model) Structural alignments and evolutionary alignments are different The structural alignments are not correct (perhaps over-aligned) All these explanations are likely true, but the relative contributions are unknown.

Acknowledgments Mike Nute, PhD student Ehsan Saleh, PhD student National Science Foundation ABI-1458652, http://tandy.cs.illinois.edu/MSAproject.html Blue Waters (part of NCSA) See https://www.biorxiv.org/content/early/2018/04/20/304659

BAli-Phy under-aligns Alignment Accuracy on 4 Protein Biological Benchmarks (1192 datasets) BAli-Phy under-aligns

Expansion Ratios on Simulated Datasets (120 datasets) BAli-Phy is Best!