Misha L. Rajaram and Karin S. Dorman Iowa State University

Misha L. Rajaram and Karin S. Dorman Iowa State University
Improving HIV-1 Rapid Genotyping Tools with Bayesian Additive Regression Trees Misha L. Rajaram and Karin S. Dorman Iowa State University

Genotyping With 9 pure and at least 34 circulating recombinant forms (CRFs) of HIV-1, genotyping of HIV-1 sequences is important from the perspective of epidemiology, vaccine development, drug resistance studies and treatment. Phylogenetic inference methods are most reliable but time intensive. Popular genotyping tools: NCBI viral genotyping tool (Rozanov et al, 2004) REGA HIV-1 genotyping tool (de Oliveira et al,2005)

NCBI viral genotyping tool (http://www. ncbi. nlm. nih
Genotype with the maximum similarity score in a window is assigned as genotype for the window BLAST similarity scores against a database of genotypes, with a sliding window. Default uses 300 bp window with steps of 100

REGA HIV-1 subtyping tool
From

While these do very well…
Issues with identifying non-B subtypes and complex recombinants. NCBI tool does best in this study, in identifying CRF06_cpx regions in sequences from 5 patients.

Motivation for a new tool
Phylogenetic Recombination breakpoint inference model needs lists of potential parental genotypes. Need for automated tool that is at least as fast as current tools. Summarize and make use of different kinds of data, phylogenetic , distance information and others Current tools train only on a small dataset of known pure genotypes and/or recombinant (104 in NCBI’s pure + CRFs 2005 dataset). Make use of all the information available in the form of publicly available sequences. Ultimately be able to classify complex recombinants

Classification Methods
Unsupervised learning and classification is an important area of much extensive research in Machine learning. Two methods explored in this study Classification and Regression Trees (CART) (Brieman et. al. 1984) Bayesian Additive Regression Trees (BART) (Chipman et. al 2005)

Classification and Regression Trees (Brieman et. al 1984)
Query Win1 Win2 Win3 Win4 Win5 …… Query ……. Query ……. Query 1 N=10 Not Class A P=1 N=5 Class A P=0.7 N=15 Not Class A P=0.64 N= 10 Class A P = 1

Bayesian Additive Regression Trees (BART)
BART is a nonparametric Bayesian regression technique. (Chipman et. al ,2006) Splits the basic regression model into smaller additive “weak learners”. Final decision is an “ensemble” of all weak learners.

Bayesian and Additive Regression Trees
Query Win1 Win2 Win3 Win4 Win ……. Y Query ……. 1 Win 1 Win 2 Win 3 Win 4 Node 1 Node 1

Data A dataset containing 150 near full length HIV-1 sequences.
Parents are known from phylogenetic analysis. 7200 bp long segments corresponding to HXB-2 positions 800 to 8000 from each sequence in the dataset were used for analysis Genotypes A1, B, C, D, F1, G, H and J were used for current analysis Number of parental genotypes Number in dataset 1 (non-recombinant) 48 2 78 >= 3 24

Methods True Positive False Positive False Negative True Negative
Actual assignment p n Total P’ N’ Predicted outcome n p True Positive False Positive False Negative True Negative Total P N Threshold tuning is done for optimal classification

Methods- Data from NCBI tool
The HIV-1 “pure” (2005) dataset was used as BLAST database. Two sets of parameters were used: Window size: 300, step :100 Window size: 100, step: 50 Dataset of parents with consecutive runs of windows Dataset of parents with total number if windows. Query A B C Query A B C Dataset: numwin100 contwin100 numwin300 contwin300

Query A mdA B mdB C mdC Dataset: contwin100 + sim

Dataset : infwin300, infwin100
C A B B B A R A R C R C A third type of dataset was produced by counts of informative sites for a quartet of the sequences. For each window, a quartet was composed of query sequence, genotype of interest and two other randomly selected genotypes. The number of phylogenetically informative sites that place the query sequence with the genotype of interest in a phylogenetic tree were counted and recorded. Dataset : infwin300, infwin100

CART/BART methodology
The tree package in R was used to fit CART and build a Classification tree. The BayesTree package in R was used to fit BART model to the data. A different classification/additive regression tree was built for each parental genotype i.e. a tree for genotype A1 classifies a query sequence as having a segment of type A1 or not. Final genotype assignments are made by listing the classification of the query in each tree.

Classification methodology
Query 1 Query 1 Query 1 Query 1 Classifier D Classifier B Classifier C Classifier A1 Yes No Yes No Query 1 gets genotypes assignments A1 and C

Analysis Methodology 5-fold cross validation was used to do the analysis. The data is divided into 5 equal sets and 4 are used at a time to fit the model and compute thresholds. Error terms and diagnostics for fit of the model are computed by using threshold values on the 5th set. Mean values for errors and diagnostics are reported.

Diagnostics TP FP FN TN Accuracy (TP + TN) / (P + N) Sensitivity
Actual assignment p n Accuracy (TP + TN) / (P + N) Sensitivity TP / P Specificity TN / N False positive rate FP / N Matthew’s Correlation Coefficient Total P’ N’ Predicted outcome n p TP FP FN TN Total P N

Results

Adding Similarity Score Data
* NCBI results for dataset contwin300

So far…… BART and CART do better with simple summaries of the NCBI output Adding similarity score data improves BART Informative site count holds information that provides good classification. BART with similarity score data does best.

Future Directions Use of larger training datasets
Better summaries of phylogenetic and other useful sequence information. Application of other machine learning techniques to enhance classification efficiency Addition of CRFs in list of potential parents. Store classification trees, to keep up speed.

Acknowledgements Dr. Karin Dorman Iowa State University Funding
NIH grant R01GM068955

References L. Breiman, J. H. Friedman, R. A. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, Belmont, California., 1984. H. A. Chipman, E. I. George, and R. E. McCulloch. Bart: Bayesian additive regression trees. J. Roy. Statistical Society Ser. B, 2006. A. Holguin, E. Lospitao, M. Lopez, E.R. de Arellano, M. J. Pena, J. Del Romero, C. Martin, V. Soriano. Genetic characterization of complex inter-recombinant HIV-1 strains circulating in Spain and reliability of distinct rapid subtyping tools. J. Med. Virol Mar; 80(3): M. Rozanov, U. Plikat, C. Chappey, A. Kochergin, and T. Tatusova. A web-based genotyping resource for viral sequences. Nuc. Ac. Res., 32 (Web server Issue):W654– W 659, 2004. T. de Oliveira,K. Deforche, S. Cassol, M. Salminem, D. Paraskevis, C. Seebregts, J. Snoeck, E. J. van Rensburg ,A. M. J. Wensing, D.A. van de Vijver, C. A. Boucher , R. Camacho , and A-M Vandamme. An Automated Genotyping System for Analysis of HIV-1 and other Microbial Sequences. Bioinfomatics 2005; 21 (19),

Misha L. Rajaram and Karin S. Dorman Iowa State University

Similar presentations

Presentation on theme: "Misha L. Rajaram and Karin S. Dorman Iowa State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Misha L. Rajaram and Karin S. Dorman Iowa State University

Similar presentations

Presentation on theme: "Misha L. Rajaram and Karin S. Dorman Iowa State University"— Presentation transcript:

Similar presentations

About project

Feedback