Download presentation
Presentation is loading. Please wait.
Published byCaroline Singleton Modified over 8 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Module 6 Classification Position i Genotype k
4
Module 6: Classification bioinformatics.ca Classification What is classificiation? – Supervised learning – discriminant analysis Work from a set of objects with predefined classes – ie basal vs luminal or good responder vs poor responder Task: learn from the features of the objects: what is the basis for discrimination? Statistically and mathematically heavy
5
Module 6: Classification bioinformatics.ca Classification poor response good response learn a classifier poor response good response new patient What is the most likely response?
6
Module 6: Classification bioinformatics.ca Classification Input a set of measures, variables or features Output a discrete label for what the set of features most closely resemble – Classification can be probabilistic or deterministic How is classification different from clustering? – We know the groups or classes a priori – Classification is a prediction on ‘what’ an object is, not what other objects it most closely resembles Clustering is finding patterns in data Classification is using known patterns to predict an object type
7
Module 6: Classification bioinformatics.ca Example: DLBCL subtypes Module #: Title of Module Wright et al, PNAS (2003)
8
Module 6: Classification bioinformatics.ca DLBCL subtypes Wright et al, PNAS (2003)
9
Module 6: Classification bioinformatics.ca Classification approaches Wright et al PNAS (2003) Weighted features in a linear predictor score: a j : weight of gene j determined by t-test statistic X j : expression value of gene j Assume there are 2 distinct distributions of LPS: 1 for ABC, 1 for GCB
10
Module 6: Classification bioinformatics.ca Wright et al, DLBCL, cont’d Use Bayes’ rule to determine a probability that a sample comes from group 1: : probability density function that represents group 1
11
Module 6: Classification bioinformatics.ca Learning the classifier, Wright et al Choosing the genes (feature selection): – use cross validation – Leave one out cross validation Pick a set of samples Use all but one of the samples as training, leaving one out for test Fit the model using the training data Can the classifier correctly pick the class of the remaining case? Repeat exhaustively for leaving out each sample in turn – Repeat using different sets and numbers of genes based on t-statistic – Pick the set of genes that give the highest accuracy
12
Module 6: Classification bioinformatics.ca Overfitting In many cases in biology, the number of features is much larger than the number of samples Important features may not be represented in the training data This can result in overfitting – when a classifier discriminates well on its training data, but does not generalise to orthogonally derived data sets Validation is required in at least one external cohort to believe the results example: the expression subtypes for breast cancer have been repeatedly validated in numerous data sets
13
Module 6: Classification bioinformatics.ca Overfitting To reduce the problem of overfitting, one can use Bayesian priors to ‘regularize’ the parameter estimates of the model Some methods now integrate feature selection and classification in a unified analytical framework – see Law et al IEEE (2005): Sparse Multinomial Logistic Regression (SMLR): http://www.cs.duke.edu/~amink/software/smlr/ Cross validation should always be used in training a classifier
14
Module 6: Classification bioinformatics.ca Evaluating a classifier The receiver operator characteristic curve – plots the true positive rate vs the false positive rate Given ground truth and a probabilistic classifier –for some number of probability thresholds –compute the TPR –proportion of positives that were predicted as true –compute the FPR –number of false predictions over the total number of predictions
15
Module 6: Classification bioinformatics.ca Evaluating a classifier Important terms: – Prediction: classifier says the object is a ‘hit’ – Rejection: classifier says the object is a ‘miss’ True Positive (TP): Prediction that is a true hit True Negative (TN): Rejection that is a true miss False Positive (FP): Prediction that is a true miss False Negative (FN): Rejection that is a true hit False Positive Rate: FPR=FP/(FP+TN) – specificity: 1-FPR True Positive Rate: TPR=TP/(TP+FN) – sensitivity: TPR
16
Module 6: Classification bioinformatics.ca Evaluating a classifier Use Area under the ROC curve as a single measure – encodes the trade-off between FPR and TPR – only possible for probabilistic outputs – requires ground truth and probabilities as inputs – at a fixed number of ordered probability thresholds, calculate FPR and TPR and plot – for deterministic methods, it is possible to calculate a point in the FPR:TPR space *
17
Module 6: Classification bioinformatics.ca ROC curves are useful for comparing classifiers Evaluating a classifier ROC curves using depth thresholds of 0-7,10 Breast cancer data using Affy SNP 6.0 genotypes as truth
18
Module 6: Classification bioinformatics.ca All you need to know about ROC analysis http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf Tutorial by Tom Fawcett at HP (2003)
19
Module 6: Classification bioinformatics.ca Practical example: detecting single nucleotide variants from next gen sequencing data
20
Module 6: Classification bioinformatics.ca aattcaggaccca----------------------------- aattcaggacccacacga------------------------ aattcaggacccacacgacgggaagacaa------------- -attcaggacaaacacgaagggaagacaagttcatgtacttt ----caggacccacacgacgggtagacaagttcatgtacttt --------acccacacgacgggtagacaagttcatgtacttt ----------------gacgggaagacaagttcatgtacttt ---------------------------------atgtacttt aattcaggaccaacacgacgggaagacaagttcatgtacttt Reference sequence Aligning billions of short reads to the genome MAQ, BWA, SOAP, SHRiMP, Mosaik, BowTie, Rmap, ELAND Chopping, hashing and indexing the genome + string matching with mismatch tolerance
21
Module 6: Classification bioinformatics.ca SNVMix1: Modeling allelic counts Modeling allelic counts SNVMix1 the genotype at i Position i the number of reference reads at i the total number of reads at i the prior over genotypes Genotype k the parameter of the genotype specific Binomial distribution
22
Module 6: Classification bioinformatics.ca Querying SNVMix1 Given the model parameters, what is the probability that genotype k gave rise to the observed data at each position? SNVMix1 is a mixture of Binomial distributions with component weights aa ab bb
23
Module 6: Classification bioinformatics.ca SNVMix1 obviates the need for depth-based thresholding Output of SNVMix1 on simulated data with increasing depth ROC curves using depth thresholds of 0-7,10 Breast cancer data using Affy SNP 6.0 genotypes as truth
24
Module 6: Classification bioinformatics.ca Learning parameters by model fitting is important for cancer genomes Li et al (2008) – Maq and Li et al (2009) – SOAP – Use parameters of the Binomial assuming normal diploid genomes Cancer genomes: – often not diploid – mixed in with normal cells – exhibit intra-tumoral heterogeneity Need to fit the model to data in order to learn more representative parameters
25
Module 6: Classification bioinformatics.ca Recall: are unobserved Use Expectation Maximization to fit the model to data E-step: M-step: Fitting the model to data using EM Position i Genotype k
26
Module 6: Classification bioinformatics.ca Fitting the model confers increased accuracy 16 ovarian transcriptomes 144,271 coding positions with matched Affy SNP 6.0 data 10 repeats of 4 fold x-validation to estimate parameters – Run EM on ¾ of the data to estimate parameters – Compute on remaining positions p < 0.00001
27
Module 6: Classification bioinformatics.ca Other methods for classification Support vector machines Linear discriminant analysis Logistic regression Random forests See: – Ma and Huang Briefings in Bioinformatics (2008) – Saeys et al Bioinformatics (2007)
28
Module 6: Classification bioinformatics.ca Lab 6: Classification
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.