Presentation is loading. Please wait.

Presentation is loading. Please wait.

1.Interdisciplinary Graduate Program in Bioinformatics and Computational Biology 2.Department of Statistics, Iowa State University, Ames IA 50010 3.Department.

Similar presentations


Presentation on theme: "1.Interdisciplinary Graduate Program in Bioinformatics and Computational Biology 2.Department of Statistics, Iowa State University, Ames IA 50010 3.Department."— Presentation transcript:

1 1.Interdisciplinary Graduate Program in Bioinformatics and Computational Biology 2.Department of Statistics, Iowa State University, Ames IA 50010 3.Department of Genetics, Developmental and Cellular Biology, Iowa State University, IA 50010 IMPROVING HIV-1 RAPID GENOTYPING TOOLS USING BAYESIAN ADDITIVE REGRESSION TREES Misha Rajaram 1,2 and Karin S. Dorman 1,2,3 There are nine non-recombinant subtypes and over 34 Circulating Recombinant Forms recognized by HIV researchers. Accurate identification of infecting types is an important part of treatment since viral types vary in fitness, risk of transmission, rate of disease progression and response to diagnostics used to identify drug resistance. The standard in HIV genotyping uses phylogenetic-based assignments, but such methods are forbiddingly slow. We describe a rapid genotyping tool that uses Bayesian Additive Regression Trees (BART) to type query HIV-1 sequences. BART is a nonparametric Bayesian regression model that uses principles from boosting algorithms to model a response (genotype assignment) affected by many possible covariates (sequence features). BART was used to classify sequences by summarizing the data variously: uncorrected distances between the queries and the subtype consensus sequences, easily-obtained phylogenetic summaries such as informative site counts and the genotyping result from the NCBI tool was captured as a count of contiguous window-wise genotype assignments. Comparison of the classifiers showed the NCBI tool had an accuracy of 78.5% while BART achieved between 82% to 94.5% accuracy depending on how the data was summarized. Additionally, BART is also amenable to automated genotype assignment of a large number of query sequences. Abstract Introduction Methods Data Relatedness Measures We summarized the relatedness of the input query with the reference set in various ways. All statistics were computed in windows of length 300 bp, placed every 100 bp. To measure distance between the query and a reference subtype/CRF we used the pairwise uncorrected distances (UD300) or BLAST similarity scores from the NCBI Tool (NCBISimilarity). Other measures are characterized below. Classifiers – CART and BART Bayesian Additive Regression Trees (BART) [6] is a non-parametric Bayesian regression technique. It splits the regression model into many “weak learners”, constrained by a regularization prior to remain weak. Final regression is an “ensemble” of all weak learners. The R implementation in package BayesTree was used. Classification and Regression Trees (CART) [7] is a simple decision tree that partitions the dataset at every node based on a yes/no answer to the question posed at the node. Leaves contain classification/regression. The R implementation in package tree was used. Results Discussion References A 5-fold cross validation technique was used and standard measures of Accuracy, Specificity, Sensitivity, False Positive Rate and Matthew’s Correlation Coefficient were used to asses the quality of the classification. NCBIWindowCount: A 21 B 35 C 9 NCBIContiguous: A 14 B 35 C 9 NCBIDifference: A 33.56 B 58.96 C 40.3 7 14 9 35 Fig 2. Sample NCBI output and data summaries computed from it. DatasetDetails NCBIWindowCountCount of number of windows for which each genotype was designated parent. NCBIContiguousCount of number of contiguous windows for which each genotype was designated parent. NCBIDifferenceAverage difference in similarity scores of genotype of interest and next highest genotype, in contiguous windows for which genotype is assigned parent. DatasetDetails NCBISimilaritySimilarity Scores for each analyzed window for all genotypes in reference set. UD300Pairwise Uncorrected distances between query and genotypes in reference set. InfositesCount of number of informative sites per window that put query with genotype of interest in a quartet. MatchNucCount of number of sites per window showing nucleotide match only for genotype of interest and query among four other closest (distance-wise) sequences. 1.W. S. Hu and H. M. Temin. Genetic consequences of packaging two RNA genomes in one retroviral particle: pseudodiploidy and high rate of genetic recombination. P Natl Acad Sci USA, 87:1556–1560, 1990 2.M. Peeters. Recombinant HIV Sequences: Their Role in the Global Epidemic. 2000. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory. 3.T. Leitner, B. Korber, M. Daniels, C. Calef, and B. Foley. HIV Sequence Compendium, chapter HIV-1 Subtype and Circulating Recombinant Form (CRF) Reference Sequences, 2005, pages 41– 48. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, New Mexico, 2005. 4.R. Galetto and M. Negroni. Mechanistic features of recombination in HIV. AIDS Rev, 7:92–102,2005 5.M. Rozanov, U. Plikat, C. Chappey, A. Kochergin, and T. Tatusova. A web-based genotyping resource for viral sequences. Nul. Acid Res, 32(Web server issue):W654–W659, 2004. 6.T. de Oliveira, K. Deforche, S. Cassol, M. Salminem, D. Paraskevis, C. Seebregts, J. Snoeck, E. J. van Rensburg, A. M. J.Wensing, D. A. van de Vijver, C. A. Boucher, R. Camacho, and A. M. Vandamme. An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics, 21(19):3797–3800, 2005. 7.H. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian additive regression trees. Technical report, Department of Mathematics and Statistics, Acadia University, Canada, 2008. 8.L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth, Inc, 1984. Recombination is acknowledged as one of the primary drivers of retroviral evolution [1]. HIV-1 viruses are classified into three main phylogenetic groups. Of these, the majority group M (for main) contains 9 subtypes, of which two have sub-subtypes leading to 11 pure subtypes. Newer and more virulent inter-subtypes recombinants are becoming established as CRFs and causing local epidemics [2]. HIV-1 genotyping has become important in the effective design and administration of treatment. Fig 1. Recombination in HIV [3] Fig 2. Global distribution of HIV-1and CRFs [4] Existing Genotyping Tools Phylogenetic methods are considered the gold standard but these are time intensive and require care in choice of sequences to include in the analysis. Two popular rapid genotyping tools are NCBI’s Viral Genotyping Tool [5] and REGA’s Genotyping Tool [5]. Use of machine learning classification methods will not only automate the classification process but also allow different types of relatedness measures to be used in combination to enhance accuracy. FeatureNCBI ToolREGA Tool Relatedness Measure BLAST-based similarity score Bootscan and Phylogenetic analysis Genotyping ProcessSliding window assigned genotype with highest similarity score Decision Trees make decisions based on bootscan and phylogenetic tree parameters Reference SetStandard or user provided Fixed reference set. Batch Model Compatible NoYes AutomatedNo Table 1. Comparison of chief features of current genotyping tools Fig 4. Classification Methodology For each tree, thresholds for determining which class the query belongs to were trained to simultaneously minimize false positive rate and false negative rate. MatchNuc count for Q and G 1 = 2 Fig. 4 Comparison of NCBI Tool, CART and BART using Simulated Pure NCBIContiguous and NCBIDifference datasets Fig 5 Comparison of CART and BART using Simulated CRFs NCBISimilarity dataset. NCBI results from Fig. 4 for comparison. An automated version of the NCBI tool does worse than CART and BART when using the NCBIContiguous set along with the NCBIDifference set for the Simulated Pure Dataset (Fig 4) CART and BART were then used to classify based on the NCBISimilarity set. The use of similarity scores significantly improves classification efficiency. Fig 5 shows that BART does better than CART overall with this dataset for the Simulated CRFs. BART does uniformly better than CART, especially in the classification of complex recombinants and with datasets MatchNuc and UD300. Infosites and UD300 used in combination achieve 82% accurate classification. Compared to the 85% achieved by use of only UD300, this reduction indicates that InfoSites is increasing the noise in the dataset while not contributing new information. Use of machine learning tools enables automated genotyping that is as fast as current tools and has accuracy comparable to phylogenetic methods. The biggest advantage is to be able to combine different data summaries to achieve better classification, with one data type able to fill gaps in the information provided by the others. Additionally, trees can be trained on smaller, more specific datasets if a researcher has compiled a more relevant reference set for the queries they intend to genotype. BART is able to handle sophisticated data summaries better than CART, resulting in significant increase in accuracy of classification and making it the preferred of the two methods. A cause for concern currently, is the paucity of data for some genotypes and CRFs. Classification is still possible with a single full length representative genome although error rates may increase when one sequence cannot capture the diversity within a genotype.


Download ppt "1.Interdisciplinary Graduate Program in Bioinformatics and Computational Biology 2.Department of Statistics, Iowa State University, Ames IA 50010 3.Department."

Similar presentations


Ads by Google