1.Interdisciplinary Graduate Program in Bioinformatics and Computational Biology 2.Department of Statistics, Iowa State University, Ames IA 50010 3.Department.

Slides:

Advertisements

Similar presentations

Lazy Paired Hyper-Parameter Tuning

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

Random Forest Predrag Radenković 3237/10

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.

Automation (21-541) Sharif University of Technology Session # 13

CMPUT 466/551 Principal Source: CMU

Chapter 7 – Classification and Regression Trees

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Heuristic alignment algorithms and cost matrices

Ensemble Learning: An Introduction

Lecture 5 (Classification with Decision Trees)

Three kinds of learning

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:

Intelligible Models for Classification and Regression

Ensemble Learning (2), Tree and Forest

Decision analysis and Risk Management course in Kuopio

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Metagenomic Analysis Using MEGAN4

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.

Summary Slide Presentation Are subtype differences important in HIV drug resistance? Lessells RJ, Katzenstein DK, de Oliveira T. Are subtype differences.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Chapter 9 – Classification and Regression Trees

Benk Erika Kelemen Zsolt

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Supporting Scientific Collaboration Online SCOPE Workshop at San Diego Supercomputer Center March 19-22, 2008.

Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Super Learning in Prediction HIV Example Mark van der Laan Division of Biostatistics, University of California, Berkeley.

Analysis of Recombination in the HIV-1pol Gene Richard Myers Division of Infection and Immunity University College London.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Lecture Notes for Chapter 4 Introduction to Data Mining

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Big data classification using neural network

Introduction to Bioinformatics Resources for DNA Barcoding

Misha L. Rajaram and Karin S. Dorman Iowa State University

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,

Combining Base Learners

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Predicting Student Performance: An Application of Data Mining Methods with an Educational Web-based System FIE 2003, Boulder, Nov 2003 Behrouz Minaei-Bidgoli,

Overfitting and Underfitting

Ensemble learning Reminder - Bagging of Trees Random Forest

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

1.Interdisciplinary Graduate Program in Bioinformatics and Computational Biology 2.Department of Statistics, Iowa State University, Ames IA Department of Genetics, Developmental and Cellular Biology, Iowa State University, IA IMPROVING HIV-1 RAPID GENOTYPING TOOLS USING BAYESIAN ADDITIVE REGRESSION TREES Misha Rajaram 1,2 and Karin S. Dorman 1,2,3 There are nine non-recombinant subtypes and over 34 Circulating Recombinant Forms recognized by HIV researchers. Accurate identification of infecting types is an important part of treatment since viral types vary in fitness, risk of transmission, rate of disease progression and response to diagnostics used to identify drug resistance. The standard in HIV genotyping uses phylogenetic-based assignments, but such methods are forbiddingly slow. We describe a rapid genotyping tool that uses Bayesian Additive Regression Trees (BART) to type query HIV-1 sequences. BART is a nonparametric Bayesian regression model that uses principles from boosting algorithms to model a response (genotype assignment) affected by many possible covariates (sequence features). BART was used to classify sequences by summarizing the data variously: uncorrected distances between the queries and the subtype consensus sequences, easily-obtained phylogenetic summaries such as informative site counts and the genotyping result from the NCBI tool was captured as a count of contiguous window-wise genotype assignments. Comparison of the classifiers showed the NCBI tool had an accuracy of 78.5% while BART achieved between 82% to 94.5% accuracy depending on how the data was summarized. Additionally, BART is also amenable to automated genotype assignment of a large number of query sequences. Abstract Introduction Methods Data Relatedness Measures We summarized the relatedness of the input query with the reference set in various ways. All statistics were computed in windows of length 300 bp, placed every 100 bp. To measure distance between the query and a reference subtype/CRF we used the pairwise uncorrected distances (UD300) or BLAST similarity scores from the NCBI Tool (NCBISimilarity). Other measures are characterized below. Classifiers – CART and BART Bayesian Additive Regression Trees (BART) [6] is a non-parametric Bayesian regression technique. It splits the regression model into many “weak learners”, constrained by a regularization prior to remain weak. Final regression is an “ensemble” of all weak learners. The R implementation in package BayesTree was used. Classification and Regression Trees (CART) [7] is a simple decision tree that partitions the dataset at every node based on a yes/no answer to the question posed at the node. Leaves contain classification/regression. The R implementation in package tree was used. Results Discussion References A 5-fold cross validation technique was used and standard measures of Accuracy, Specificity, Sensitivity, False Positive Rate and Matthew’s Correlation Coefficient were used to asses the quality of the classification. NCBIWindowCount: A 21 B 35 C 9 NCBIContiguous: A 14 B 35 C 9 NCBIDifference: A B C Fig 2. Sample NCBI output and data summaries computed from it. DatasetDetails NCBIWindowCountCount of number of windows for which each genotype was designated parent. NCBIContiguousCount of number of contiguous windows for which each genotype was designated parent. NCBIDifferenceAverage difference in similarity scores of genotype of interest and next highest genotype, in contiguous windows for which genotype is assigned parent. DatasetDetails NCBISimilaritySimilarity Scores for each analyzed window for all genotypes in reference set. UD300Pairwise Uncorrected distances between query and genotypes in reference set. InfositesCount of number of informative sites per window that put query with genotype of interest in a quartet. MatchNucCount of number of sites per window showing nucleotide match only for genotype of interest and query among four other closest (distance-wise) sequences. 1.W. S. Hu and H. M. Temin. Genetic consequences of packaging two RNA genomes in one retroviral particle: pseudodiploidy and high rate of genetic recombination. P Natl Acad Sci USA, 87:1556–1560, M. Peeters. Recombinant HIV Sequences: Their Role in the Global Epidemic Theoretical Biology and Biophysics Group, Los Alamos National Laboratory. 3.T. Leitner, B. Korber, M. Daniels, C. Calef, and B. Foley. HIV Sequence Compendium, chapter HIV-1 Subtype and Circulating Recombinant Form (CRF) Reference Sequences, 2005, pages 41– 48. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, New Mexico, R. Galetto and M. Negroni. Mechanistic features of recombination in HIV. AIDS Rev, 7:92–102, M. Rozanov, U. Plikat, C. Chappey, A. Kochergin, and T. Tatusova. A web-based genotyping resource for viral sequences. Nul. Acid Res, 32(Web server issue):W654–W659, T. de Oliveira, K. Deforche, S. Cassol, M. Salminem, D. Paraskevis, C. Seebregts, J. Snoeck, E. J. van Rensburg, A. M. J.Wensing, D. A. van de Vijver, C. A. Boucher, R. Camacho, and A. M. Vandamme. An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics, 21(19):3797–3800, H. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian additive regression trees. Technical report, Department of Mathematics and Statistics, Acadia University, Canada, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth, Inc, Recombination is acknowledged as one of the primary drivers of retroviral evolution [1]. HIV-1 viruses are classified into three main phylogenetic groups. Of these, the majority group M (for main) contains 9 subtypes, of which two have sub-subtypes leading to 11 pure subtypes. Newer and more virulent inter-subtypes recombinants are becoming established as CRFs and causing local epidemics [2]. HIV-1 genotyping has become important in the effective design and administration of treatment. Fig 1. Recombination in HIV [3] Fig 2. Global distribution of HIV-1and CRFs [4] Existing Genotyping Tools Phylogenetic methods are considered the gold standard but these are time intensive and require care in choice of sequences to include in the analysis. Two popular rapid genotyping tools are NCBI’s Viral Genotyping Tool [5] and REGA’s Genotyping Tool [5]. Use of machine learning classification methods will not only automate the classification process but also allow different types of relatedness measures to be used in combination to enhance accuracy. FeatureNCBI ToolREGA Tool Relatedness Measure BLAST-based similarity score Bootscan and Phylogenetic analysis Genotyping ProcessSliding window assigned genotype with highest similarity score Decision Trees make decisions based on bootscan and phylogenetic tree parameters Reference SetStandard or user provided Fixed reference set. Batch Model Compatible NoYes AutomatedNo Table 1. Comparison of chief features of current genotyping tools Fig 4. Classification Methodology For each tree, thresholds for determining which class the query belongs to were trained to simultaneously minimize false positive rate and false negative rate. MatchNuc count for Q and G 1 = 2 Fig. 4 Comparison of NCBI Tool, CART and BART using Simulated Pure NCBIContiguous and NCBIDifference datasets Fig 5 Comparison of CART and BART using Simulated CRFs NCBISimilarity dataset. NCBI results from Fig. 4 for comparison. An automated version of the NCBI tool does worse than CART and BART when using the NCBIContiguous set along with the NCBIDifference set for the Simulated Pure Dataset (Fig 4) CART and BART were then used to classify based on the NCBISimilarity set. The use of similarity scores significantly improves classification efficiency. Fig 5 shows that BART does better than CART overall with this dataset for the Simulated CRFs. BART does uniformly better than CART, especially in the classification of complex recombinants and with datasets MatchNuc and UD300. Infosites and UD300 used in combination achieve 82% accurate classification. Compared to the 85% achieved by use of only UD300, this reduction indicates that InfoSites is increasing the noise in the dataset while not contributing new information. Use of machine learning tools enables automated genotyping that is as fast as current tools and has accuracy comparable to phylogenetic methods. The biggest advantage is to be able to combine different data summaries to achieve better classification, with one data type able to fill gaps in the information provided by the others. Additionally, trees can be trained on smaller, more specific datasets if a researcher has compiled a more relevant reference set for the queries they intend to genotype. BART is able to handle sophisticated data summaries better than CART, resulting in significant increase in accuracy of classification and making it the preferred of the two methods. A cause for concern currently, is the paucity of data for some genotypes and CRFs. Classification is still possible with a single full length representative genome although error rates may increase when one sequence cannot capture the diversity within a genotype.