Presentation is loading. Please wait.

Presentation is loading. Please wait.

Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University.

Similar presentations


Presentation on theme: "Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University."— Presentation transcript:

1 Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University

2  HIV & Drug Resistance  Phenotype Prediction Models  Machine Learning  Language of Proteins  Document Classification of HIV Genotypes  Comparison to state-of-the-art & human experts  Other Area of Application: GPCR  Conclusions

3  Drug resistance is an obstacle in treatment and control of many infectious diseases  33.2 million living with AIDS in 2007  2.1 million died from AIDS in 2007  High mutation rate of HIV leads to quasi- species of virus strains inside each patient 25% diversity 4 %

4  Currently ~25 drugs in 4 main drug classes  Treatments with 3+ drugs (HAART) used to cover as many virus strains as possible in quasi-species  Personalized Medicine  Trial-and-error not an option due to cross resistance  Goal: Optimize treatment to take longest for virus population to develop resistance  Current: Phenotype predicted from genotype test results to identify resistance present now

5  Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype  Example: Rega and ANRS systems  If at least Z of are present, then predict resistance level Y to drug X.  Example: HIVdb  Sum the penalty scores from each mutation.  Advantage: easy to understand reason for prediction  Disadvantage: impossible to maintain as more data and drugs become available

6  Find db sequence most similar to test sequence at all selected mutation positions  Does not interpolate between partial matches  Example: VirtualPhenotype TM [from Virco]  Advantage: no rules to maintain  Disadvantages:  Human experts still needed to identify mutation positions  Large amount of data needed to ensure a db match

7  Systems can “learn” by detecting patterns in training data and deduction  Enables knowledge discovery  Varies in the type of features and learning algorithm  Features:  Presence of mutation  Mutation  Structure-based  Maintenance is just re-running learning algorithm on new data  Takes minutes ~ hours to train, seconds ~ minutes to test a sequence Sufficient for Protease Inhibitors Majority of studies

8  Glass-box alg. allows knowledge discovery  Black-box alg. more tolerant of extra features  Existing systems trade-off between black-box systems and expert-selected mutations Decision tree for EFV (Beerenwinkel, ‘02) Neural Network: 27 Mutations

9 PPI 3D Structure Secondary Structure Motifs Amino Acids Meaning Sentences Phrases Words Letters

10  Classify document by topic based on words  Trade-off between using all English words or select keyword  Chi-square feature selection found to be best at selecting keywords in text [Yang et al. ‘97] a to the ball a to the ball a to the ball hoop basket bat glove tackle touchdown

11  View target virus proteins as documents  Alphabet size: 20 amino acids  No word/motif boundaries (e.g. Thai, Japanese)  Features: position-independent n-grams, position-dependent n-grams (mutations)  Extract n-grams from every reading frame  Represent as vector of n-gram counts G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T…

12 N-gram \ Count≥ 5≥ 10…≥ 100 A … (unigrams) N-gram \ Count≥ 1≥ 2…≥ 20 AA … (bigrams) AAA … (trigrams) Mutation \ Count≥ 0.05≥ 0.1…≥ 1 188G

13 Observed # of sequences with feature x Expected # of seqs with feature x and resistance level c Total # of sequences # of sequences with feature x # of sequences with resistance level c Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997)

14 N-gram \ Count≥ 5≥ 10…≥ 100 A23.5…12.1 … (unigrams) N-gram \ Count≥ 1≥ 2…≥ 20 AA23.119.9… … (bigrams) AAA15.1…10.2 … (trigrams) 30.2 29.9 45.1 AAA ( ≥ 1)  …  A ( ≥ 10)  …  AA ( ≥ 20)

15 Classifier N-grams extracted at every reading frame of protein sequence 01001………5157 12 25 FFT……TFTFF Counts of all n-grams Selected n-grams occurring more frequently than their most discriminative thresholds Chi-Square Feature Selection G S G E R D S R E E I L K A F R L F D D D N S G T… G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T…

16  Previous study (Rhee et al., 2006) compared performance of 3 feature sets:  Expert-selected mutations  Treatment-selected mutations (TSM)  Mutations occurring more than 2x in dataset  TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein  Not possible to be specific to each drug  Found human experts or TSM to perform best

17 Drugχ 2 -SelectedRhee et al., 2006 3IndepPosDepTSMExpert Nucleoside RT Inhibitors (NRTI) 3TC0.8970.934 0.92 ABC0.6800.713 0.740.75 AZT0.7330.752 0.710.72 D4T0.7780.781 0.760.77 DDI0.7230.745 0.75 TDF0.705 0.690.67 Avg.0.7530.772 0.76 Non-Nucleoside RT Inhibitors (NNRTI) DLV0.8230.842 0.840.82 EFV0.8640.855 0.850.80 NVP0.9120.910 0.910.89 Avg.0.8660.8690.870.84 Drugχ 2 -SelectedRhee et al., 2006 3IndepPosDepTSMExpert Protease Inhibitors (PI) APV0.7880.786 0.780.77 ATV0.6600.678 0.65 IDV0.7530.732 0.75 LPV0.7640.797 0.740.73 NFV0.7610.774 0.80 RTV0.8400.837 0.850.84 SQV0.7930.812 0.800.77 Avg.0.7650.774 0.770.76 Avg.0.7800.7910.78  Using same dataset and classifier (decision tree), our X 2 -selected features performed comparably to TSM and expert-selected mutations

18  Evaluated on several learning algorithms  Glass-box: decision tree, naïve Bayes, random forest  Black-box: SVM  Average 100-120 X 2 features  Choice of classifier did not make much difference Learning Alg.PINRTINNRTIAvg. Decision Tree0.7740.7720.8690.791 Naïve Bayes0.7810.7670.8580.790 Random Forest0.8000.7850.8750.808 SVM0.8090.8070.8800.822

19  Used regression algorithms to predict resistance factor (IC 50 ratio)  Comparing the best models from each study for each drug, our model matched or outperformed Rhee et al. on 12 of 16 drugs  Average difference < 0.01 DrugOur MethodRhee et al., 2006 Regressionr2r2 r2r2 Protease Inhibitors (PI) APVSVM0.8210.82SVM ATVSVM0.7750.76LSR IDVSVM0.8260.83LSR LPVSVM0.8650.87SVM NFVLinear0.8540.84LSR RTVSVM0.9000.89LSR SQVSVM0.8380.84LSR DrugOur MethodRhee et al., 2006 Regressionr2r2 r2r2 Nucleoside RT Inhibitors (NRTI) 3TCSVM0.9350.95SVM ABCLinear0.7880.79LARS AZTLinear0.7670.74SVM D4TSVM0.7470.79SVM DDISVM0.7290.75SVM TDFSVM0.5270.59SVM Non-Nucleoside RT Inhibitors (NNRTI) DLVSVM0.8150.79LARS EFVSVM0.8640.85LARS NVPSVM0.8110.79LARS

20 53 of 54 expert-selected mutations for PI ranked 108 th or higher by χ 2

21 20 of 21 expert-selected mutations for NRTI ranked 120 th or higher by χ 2 All 15 expert-selected mutations for NNRTI ranked 107 th or higher by χ 2

22 RTV3TCEFV V54-184N103 V71V184-103 M90N67I100 A82W210A190 -10-215V74 I10L41C181 I46R65S190 V84Y215E101 -54D69P101 -71H228L188 -82D44G98 -90C181H225 R20-67R228 F33-41Q190 -46I118E190 RTV3TCEFV I24M75E179 I36E43-190 S73F215L227 T43A190N219 T82Y208Y221 I20K83E43 L46R82S103 -63I54L230 -20G98M135 F10L227Y208 I32E218R102 L53-210D179 -84A106I74 D37I69A106 V48A39Q102

23  Phenotype systems predict drug resistance the detected genotype has currently  Not a summation of resistance to individual drugs  Mutations can cause resistance to one drug while increasing sensitivity to another  Minor strains not detected by genotype testing  Treatment history  Variation in human host affects response  Adherence [Ying et al., 2007]  Haplotype? Gender? State of health?  Lifestyle habits?

24  Model impact of interaction between all these factors using a feature for each combination  χ 2 reduces to manageable number of important features before applying to glass-box model  Amortized optimization of HAART requires short- term and long-term response model Potency / Drug Resistance Patient Treatment History Future Drug Options Virus Genotype

25  Given a new protein sequence, classify it into the correct category at each level in the hierarchy  Subfamily classification based on function  G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs

26  Previous classification studies rely on alignment-based features  Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy  Document classification approach with χ 2 features and naïve Bayes or decision tree SVM, Neural Nets, Clustering Decision Trees, Naïve Bayes Hidden Markov Models (HMM) K-Nearest Neighbours Complex Simple

27 Classifier# of FeaturesType of FeaturesAccuracy Naïve Bayes7400Chi-square n-gram features93.2 % SVM9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 88.4 % BLASTLocal sequence alignment83.3 % Decision Tree2700Chi-square n-gram features78.0 % SAM-T2K HMMA HMM model built for each protein subfamily69.9 % kernNN9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 64.0 % Naïve Bayes with chi-square attained 39.7% reduction in residual error. Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.

28 Classifier# of FeaturesType of FeaturesAccuracy Naïve Bayes8100Chi-square n-gram features92.4 % SVM9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 86.3 % BLASTLocal sequence alignment74.5 % Decision Tree2300Chi-square n-gram features70.2 % SAM-T2K HMMA HMM model built for each protein subfamily70.0 % kernNN9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 51.0 % Naïve Bayes with chi-square attained 44.5% reduction in residual error.

29 N-grams selected by chi-square joined to form motifs found in literature.

30  Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations  Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise  χ 2 identified mutations overlap strongly with human experts  Similar approach had found success in previous work on GPCR proteins  Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible

31 Betty Cheng (ymcheng@cs.cmu.edu) Jaime Carbonell (jgc@cs.cmu.edu)


Download ppt "Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University."

Similar presentations


Ads by Google