Classification by Machine Learning Approaches Michael J. Kerner – Center for Biological Sequence Analysis Technical University of Denmark
Outline Introduction to Machine Learning Datasets, Features Feature Selection Machine Learning Approaches (Classifiers) Model Evaluation and Interpretation Examples, Exercise
Machine Learning – Data Driven Prediction To Learn: “to gain knowledge or understanding of or skill in by study, instruction, or experience” (Merriam Webster English Dictionary, 2005) Machine Learning: Learning the theory automatically from the data, through a process of inference, model fitting, or learning from examples: Automated extraction of useful information from a body of data by building good probabilistic models. Ideally suited for areas with lots of data in the absence of a general theory.
Why do we need Machine Learning? Some tasks cannot be defined well, except by examples (e.g. recognition of faces or people). Large amounts of data may have hidden relationships and correlations. Only automated approaches may be able to detect these. The amount of knowledge about a certain problem / task may be too large for explicit encoding by humans (e.g. in medical diagnostics) Environments change over time, and new knowledge is constantly being discovered. A continuous redesign of the systems “by hand” may be difficult.
The Machine Learning Approach Input Data Classifier ML e.g. Gene Expression Profiles, … Machine Learning Prediction: Yes / No
Machine Learning Learning Task: –What do we want to learn or predict? Data and assumptions: –What data do we have available? –What is their quality? –What can we assume about the given problem? Representation: –What is a suitable representation of the examples to be classified? Method and Estimation: –Are there possible hypotheses? –Can we adjust our predictions based on the given results? Evaluation: –How well does the method perform? –Might another approach/model perform better?
Learning Tasks Classification: –Prediction of an item class. Forecasting: –Prediction of a parameter value. Characterization: –Find hypotheses that describe groups of items. Clustering: –Partitioning of the (unassigned) data set into clusters with common properties. (Unsupervised learning)
Emergence of Large Datasets Dataset examples: Image processing Spam detection Text mining DNA micro-array data Protein function Protein localization Protein-protein interaction …
Dataset Examples Edible or poisonous ?
Dataset Examples
mRNA Splicing
mRNA Splice Site Prediction
Protein Function Prediction: ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category Assign a probability for each category from the NN outputs
############## ProtFun 2.2 predictions ######## >KCNA1_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis Biosynthesis_of_cofactors Cell_envelope Cellular_processes Central_intermediary_metabolism Energy_metabolism Fatty_acid_metabolism Purines_and_pyrimidines Regulatory_functions Replication_and_transcription Translation Transport_and_binding => # Enzyme/nonenzyme Prob Odds Enzyme Nonenzyme => # Enzyme class Prob Odds Oxidoreductase (EC ) Transferase (EC ) Hydrolase (EC ) Lyase (EC ) Isomerase (EC ) Ligase (EC ) # Gene Ontology category Prob Odds Signal_transducer Receptor Hormone Structural_protein Transporter Ion_channel Voltage-gated_ion_channel => Cation_channel Transcription Transcription_regulation Stress_response Immune_response Growth_factor Metal_ion_transport
Complexity of datasets: Many instances (examples) Instances with multiple features (properties / characteristics) Dependencies between the features (correlations) Emergence of Large Datasets
Data Preprocessing Instance selection: –Remove identical / inconsistent / incomplete instances (e.g. reduction of homologous genes, removal of wrongly annotated genes) Feature transformation / selection: –Projection techniques (e.g. principal components analysis) –Compression techniques (e.g. minimum description length) –Feature selection techniques
Benefits of Feature Selection Attain good and often even better classification performance using a small subset of features –Less noise in the data Provide more cost-effective classifiers –Less features to take into account smaller datasets faster classifiers Identification of (biologically) relevant features for the given problem
Feature Selection Feature Subset Selection Learning Algorithm All Features Feature Subset Selection Learning Algorithm All Features Feature Subset Search Algorithm Selection Criterion Learning Algorithm Selected Features Evaluation Optimal Features Optimal Features Optimal Features Filter approachWrapper approach
Filter Approach Independent of the classification model A relevance measure for each feature is calculated Features with a value lower than a selected threshold t will be removed Example: Feature-class entropy Measures the “uncertainty” about the class when observing feature i f1 f2 f3 f4 class
Wrapper approach Specific to a classification algorithm The search for a good feature subset is guided by a search algorithm The algorithm uses the evaluation of the classifier as a guide to find good feature subsets Search algorithm examples: sequential forward or backward search, genetic algorithms Sequential backward elimination –Starts with the set of all features –Iteratively discards the feature whose removal results in the best classification performance
Wrapper approach Full feature set : f1,f2,f3,f4 f2,f3,f4 0.7 f1,f3,f4 0.8 f1,f2,f4 0.1 f1,f2,f f3,f f1,f4 0.1 f1,f3 0.8 f4 0.2 f3 0.7
Classification Methods -Decision trees -Hidden Markov Models (HMMs) -Support vector machines -Artificial Neural Networks -Bayesian methods -…
Decision Trees Simple, practical and easy to interpret Given a set of instances (with a set of features), a tree is constructed with internal nodes as the features and the leaves as the classes
Example Dataset: Shall we play golf? Instance Attributes / Features Class dayoutlooktemperaturehumiditywindyPlay Golf ? 1sunnyhothighFALSEno 2sunnyhothighTRUEno 3overcasthothighFALSEyes 4rainymildhighFALSEyes 5rainycoolnormalFALSEyes 6rainycoolnormalTRUEno 7overcastcoolnormalTRUEyes 8sunnymildhighFALSEno 9sunnycoolnormalFALSEyes 10rainymildnormalFALSEyes 11sunnymildnormalTRUEyes 12overcastmildhighTRUEyes 13overcasthotnormalFALSEyes 14rainymildhighTRUEno todaysunnycoolhighTRUE?
Example: Shall we play golf today? WEKA data file (arff format) outlook {sunny, overcast, temperature {hot, mild, humidity {high, windy {TRUE, play {yes, sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no InstanceIndependent features (attributes)Class DayOutlookTemperatureHumidityWindyPlay Golf? 1sunnyhothighFALSEno 2sunnyhothighTRUEno 3overcasthothighFALSEyes 4rainymildhighFALSEyes 5rainycoolnormalFALSEyes 6rainycoolnormalTRUEno 7overcastcoolnormalTRUEyes 8sunnymildhighFALSEno 9sunnycoolnormalFALSEyes 10rainymildnormalFALSEyes 11sunnymildnormalTRUEyes 12overcastmildhighTRUEyes 13overcasthotnormalFALSEyes 14rainymildhighTRUEno
Feature compositions sunnyovercastrainy hot coolmild high normal TrueFalseYES NO YES
Decision Trees J48 pruned tree outlook = sunny | humidity = high: no (3.0) | humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 Attributes / Features Attribute Values Classes
Artificial Neural Networks (ANNs) Artificial Neuron Neural Network
Overfitting Overfitting: A classifier that performs well on the training examples, but poorly on new examples. Training and testing on the same data will generally produce a good classifier (for this dataset) with high overfitting. To avoid overfitting: Use separate training and testing data Use cross-validation Use the simplest model possible
Performance Evaluation Cross-Validation (10 fold) Data Training Set Test Set Performance Evaluation Classifier ML (9/10) (1/10) 10x
Performance Evaluation Confusion Matrix TPTrue Positives TNTrue Negatives FPFalse Positives FNFalse Negatives PredictedLabel positivenegative Known positive TPFN Label negative FPTN
Performance Evaluation Precision (PPV)TP / (TP + FP) –Percentage of correct positive predictions Recall / SensitivityTP / (TP + FN) –Percentage of positively labeled instances, also predicted as positive SpecificityTN / (TN + FP) –Percentage of negatively labeled instances, also predicted as negative Accuracy(TP + TN) / (TP + TN + FP + FN) –Percentage of correct predictions Correlation Coefficient(TP * TN – FP * FN) (TP+FP)*(FP+TN)*(TN+FN)*(FN+TP) -1 ≤ cc ≤ 1cc = 1 : no FP or FN cc = 0 : random cc = -1: only FP and FN
ROC – Receiver Operating Characteristic ( FP / (FP + TN) ) False Positive Rate, (1 - Specificity) True Positive Rate, Sensitivity TP / (TP + FN)
ROC – Receiver Operating Characteristic 1 - Specificity Sensitivity
Case Study - Splice Site Prediction
Splice site prediction: Correctly identify the borders of introns and exons in genes (splice sites) Important for gene prediction Split up into 2 tasks: –Donor prediction (exon -> intron) –Acceptor prediction (intron -> exon)
Case Study - Splice Site Prediction Splice sites are characterized by a conserved dinucleotide in the intron part of the sequence –Donor sites : –Acceptor sites : Classification problem: –Distinguish between true GT, AG and false GT, AG.
Case Study - Splice Site Prediction Position dependent features e.g. an A on position 1, C on position 17, …. Position independent features e.g. subsequence “TCG” occurs, “GAG” occurs,… atcgatcagtatcgat GT ctgagctatgag Features:
Arff Data File _A -68_T -68_C -68_G -67_A -67_T -67_C -67_G {0,1} 20_A 20_T 20_C 20_G class 0,0,0,1,0,0,0,1, [...],1,0,0,0,true 0,0,0,1,1,0,0,0, [...],1,0,0,0,true 0,1,0,0,0,0,0,1, [...],1,0,0,0,true 0,1,0,0,0,0,0,1, [...],0,0,0,1,true [...] 1,0,0,0,0,1,0,0, [...],0,1,0,0,true 0,0,0,1,0,0,1,0, [...],0,0,1,0,true 0,0,1,0,0,0,1,0, [...],0,0,0,1,true 0,0,1,0,0,0,1,0, [...],0,0,1,0,true The original sequence files in FASTA format have been converted to represent the four DNA bases in a binary fashion A: T: C: G:
Case Study - Splice Site Prediction Local context of 88 nucleotides around the splice site 88 position dependent features A=1000, T=0100, C=0010, G=0001 352 binary features Reduce the dataset to contain fewer but relevant features
352 Binary features
15 Binary features
Case Study – Splice Site Sequence Logos Acceptor Sites: Donor Sites:
Exercise: Building a prediction tool for human mRNA splice sites Feature selection for classification of splice sites Tool: The WEKA machine learning toolkit. Go to and follow the instructions
Acknowledgements Slides and Exercises Adapted from and inspired by: Søren Brunak David Gilbert, Aik Choon Tan Yvan Saeys