Classification by Machine Learning Approaches Michael J. Kerner – Center for Biological Sequence Analysis Technical.

Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical University of Denmark

Outline Introduction to Machine Learning Datasets, Features Feature Selection Machine Learning Approaches (Classifiers) Model Evaluation and Interpretation Examples, Exercise

Machine Learning – Data Driven Prediction To Learn: “to gain knowledge or understanding of or skill in by study, instruction, or experience” (Merriam Webster English Dictionary, 2005) Machine Learning: Learning the theory automatically from the data, through a process of inference, model fitting, or learning from examples: Automated extraction of useful information from a body of data by building good probabilistic models. Ideally suited for areas with lots of data in the absence of a general theory.

Why do we need Machine Learning? Some tasks cannot be defined well, except by examples (e.g. recognition of faces or people). Large amounts of data may have hidden relationships and correlations. Only automated approaches may be able to detect these. The amount of knowledge about a certain problem / task may be too large for explicit encoding by humans (e.g. in medical diagnostics) Environments change over time, and new knowledge is constantly being discovered. A continuous redesign of the systems “by hand” may be difficult.

The Machine Learning Approach Input Data Classifier ML e.g. Gene Expression Profiles, … Machine Learning Prediction: Yes / No

Machine Learning Learning Task: –What do we want to learn or predict? Data and assumptions: –What data do we have available? –What is their quality? –What can we assume about the given problem? Representation: –What is a suitable representation of the examples to be classified? Method and Estimation: –Are there possible hypotheses? –Can we adjust our predictions based on the given results? Evaluation: –How well does the method perform? –Might another approach/model perform better?

Learning Tasks Classification: –Prediction of an item class. Forecasting: –Prediction of a parameter value. Characterization: –Find hypotheses that describe groups of items. Clustering: –Partitioning of the (unassigned) data set into clusters with common properties. (Unsupervised learning)

Emergence of Large Datasets Dataset examples: Image processing Spam email detection Text mining DNA micro-array data Protein function Protein localization Protein-protein interaction …

Dataset Examples Edible or poisonous ?

Dataset Examples

mRNA Splicing

mRNA Splice Site Prediction

Protein Function Prediction: ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category Assign a probability for each category from the NN outputs

############## ProtFun 2.2 predictions ######## >KCNA1_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis 0.042 1.893 Biosynthesis_of_cofactors 0.119 1.654 Cell_envelope 0.031 0.507 Cellular_processes 0.027 0.373 Central_intermediary_metabolism 0.046 0.731 Energy_metabolism 0.036 0.395 Fatty_acid_metabolism 0.019 1.485 Purines_and_pyrimidines 0.214 0.879 Regulatory_functions 0.013 0.083 Replication_and_transcription 0.019 0.073 Translation 0.129 2.925 Transport_and_binding =>0.717 1.748 # Enzyme/nonenzyme Prob Odds Enzyme 0.231 0.807 Nonenzyme =>0.769 1.078 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.040 0.193 Transferase (EC 2.-.-.-) 0.056 0.163 Hydrolase (EC 3.-.-.-) 0.062 0.195 Lyase (EC 4.-.-.-) 0.020 0.430 Isomerase (EC 5.-.-.-) 0.010 0.321 Ligase (EC 6.-.-.-) 0.017 0.326 # Gene Ontology category Prob Odds Signal_transducer 0.061 0.284 Receptor 0.055 0.323 Hormone 0.001 0.206 Structural_protein 0.002 0.086 Transporter 0.469 4.299 Ion_channel 0.207 3.633 Voltage-gated_ion_channel =>0.280 12.736 Cation_channel 0.348 7.560 Transcription 0.163 1.270 Transcription_regulation 0.166 1.331 Stress_response 0.011 0.125 Immune_response 0.031 0.370 Growth_factor 0.005 0.372 Metal_ion_transport 0.159 0.345

Complexity of datasets: Many instances (examples) Instances with multiple features (properties / characteristics) Dependencies between the features (correlations) Emergence of Large Datasets

Data Preprocessing Instance selection: –Remove identical / inconsistent / incomplete instances (e.g. reduction of homologous genes, removal of wrongly annotated genes) Feature transformation / selection: –Projection techniques (e.g. principal components analysis) –Compression techniques (e.g. minimum description length) –Feature selection techniques

Benefits of Feature Selection Attain good and often even better classification performance using a small subset of features –Less noise in the data Provide more cost-effective classifiers –Less features to take into account  smaller datasets  faster classifiers Identification of (biologically) relevant features for the given problem

Feature Selection Feature Subset Selection Learning Algorithm All Features Feature Subset Selection Learning Algorithm All Features Feature Subset Search Algorithm Selection Criterion Learning Algorithm Selected Features Evaluation Optimal Features Optimal Features Optimal Features Filter approachWrapper approach

Filter Approach Independent of the classification model A relevance measure for each feature is calculated Features with a value lower than a selected threshold t will be removed Example: Feature-class entropy Measures the “uncertainty” about the class when observing feature i f1 f2 f3 f4 class 1 0 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0

Wrapper approach Specific to a classification algorithm The search for a good feature subset is guided by a search algorithm The algorithm uses the evaluation of the classifier as a guide to find good feature subsets Search algorithm examples: sequential forward or backward search, genetic algorithms Sequential backward elimination –Starts with the set of all features –Iteratively discards the feature whose removal results in the best classification performance

Wrapper approach Full feature set : f1,f2,f3,f4 f2,f3,f4 0.7 f1,f3,f4 0.8 f1,f2,f4 0.1 f1,f2,f3 0.75 f3,f4 0.85 f1,f4 0.1 f1,f3 0.8 f4 0.2 f3 0.7

Classification Methods -Decision trees -Hidden Markov Models (HMMs) -Support vector machines -Artificial Neural Networks -Bayesian methods -…

Decision Trees Simple, practical and easy to interpret Given a set of instances (with a set of features), a tree is constructed with internal nodes as the features and the leaves as the classes

Example Dataset: Shall we play golf? Instance Attributes / Features Class dayoutlooktemperaturehumiditywindyPlay Golf ? 1sunnyhothighFALSEno 2sunnyhothighTRUEno 3overcasthothighFALSEyes 4rainymildhighFALSEyes 5rainycoolnormalFALSEyes 6rainycoolnormalTRUEno 7overcastcoolnormalTRUEyes 8sunnymildhighFALSEno 9sunnycoolnormalFALSEyes 10rainymildnormalFALSEyes 11sunnymildnormalTRUEyes 12overcastmildhighTRUEyes 13overcasthotnormalFALSEyes 14rainymildhighTRUEno todaysunnycoolhighTRUE?

Example: Shall we play golf today? WEKA data file (arff format) : @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no InstanceIndependent features (attributes)Class DayOutlookTemperatureHumidityWindyPlay Golf? 1sunnyhothighFALSEno 2sunnyhothighTRUEno 3overcasthothighFALSEyes 4rainymildhighFALSEyes 5rainycoolnormalFALSEyes 6rainycoolnormalTRUEno 7overcastcoolnormalTRUEyes 8sunnymildhighFALSEno 9sunnycoolnormalFALSEyes 10rainymildnormalFALSEyes 11sunnymildnormalTRUEyes 12overcastmildhighTRUEyes 13overcasthotnormalFALSEyes 14rainymildhighTRUEno

Feature compositions sunnyovercastrainy hot coolmild high normal TrueFalseYES NO YES

Decision Trees J48 pruned tree ------------------ outlook = sunny | humidity = high: no (3.0) | humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 Attributes / Features Attribute Values Classes

Artificial Neural Networks (ANNs) Artificial Neuron Neural Network

Overfitting Overfitting: A classifier that performs well on the training examples, but poorly on new examples. Training and testing on the same data will generally produce a good classifier (for this dataset) with high overfitting. To avoid overfitting: Use separate training and testing data Use cross-validation Use the simplest model possible

Performance Evaluation Cross-Validation (10 fold) Data Training Set Test Set Performance Evaluation Classifier ML (9/10) (1/10) 10x

Performance Evaluation Confusion Matrix TPTrue Positives TNTrue Negatives FPFalse Positives FNFalse Negatives PredictedLabel positivenegative Known positive TPFN Label negative FPTN

Performance Evaluation Precision (PPV)TP / (TP + FP) –Percentage of correct positive predictions Recall / SensitivityTP / (TP + FN) –Percentage of positively labeled instances, also predicted as positive SpecificityTN / (TN + FP) –Percentage of negatively labeled instances, also predicted as negative Accuracy(TP + TN) / (TP + TN + FP + FN) –Percentage of correct predictions Correlation Coefficient(TP * TN – FP * FN) (TP+FP)*(FP+TN)*(TN+FN)*(FN+TP) -1 ≤ cc ≤ 1cc = 1 : no FP or FN cc = 0 : random cc = -1: only FP and FN

ROC – Receiver Operating Characteristic ( FP / (FP + TN) ) False Positive Rate, (1 - Specificity) True Positive Rate, Sensitivity TP / (TP + FN)

ROC – Receiver Operating Characteristic 1 - Specificity Sensitivity

Case Study - Splice Site Prediction

Splice site prediction: Correctly identify the borders of introns and exons in genes (splice sites) Important for gene prediction Split up into 2 tasks: –Donor prediction (exon -> intron) –Acceptor prediction (intron -> exon)

Case Study - Splice Site Prediction Splice sites are characterized by a conserved dinucleotide in the intron part of the sequence –Donor sites : –Acceptor sites : Classification problem: –Distinguish between true GT, AG and false GT, AG.

Case Study - Splice Site Prediction Position dependent features e.g. an A on position 1, C on position 17, …. Position independent features e.g. subsequence “TCG” occurs, “GAG” occurs,… atcgatcagtatcgat GT ctgagctatgag 1 2 3 17 28 Features:

Original Data – Human Acceptor Splice Site Sites >HUMGLUT4B_3535 GGGCCCCTAGCGGAAGGAAAAAAATCATGGTTCCATGTGACATGCTGTGTCTTTGTGTCTGCCTGTTCAGGATGGGGAACCCCCTCAGCA >HUMGLUT4B_3763 GAGGACAGGTGTCTCGGGGGTGGTGGAAAGGGGACGGTCTGCAGGAAATCTGTCCTCTGCTGTCCCCCAGGTGATTGAACAGAGCTACAA >HUMGLUT4B_4028 TGGGGGAAACAGGAAGGGAGCCACTGCTGGGTGCCCTCACCCTCACAGCCTCACTCTGTCTGCCTGCCAGGAAAAGGGCCATGCTGGTCA >HUMGLUT4B_4276 TGGGCTTTCAGATGGGAATGGACACCTGCCCTCAGCCCTCTCTTCTTCCCTCGCCCAGGGCTGACATCAGGGCTGGTGCCCATGTACGTG >HUMGLUT4B_4507 ATATGGTGGGCTTCCAAGGTAAGGCAGAAGGGCTGAGTGACCTGCCTTCTTTCCCAACCTTCTCCCACAGGTGCTGGGCTTGGAGTCCCT >HUMGLUT4B_4775 GCCTCCGCCTCATCTTGCTAGCACCTGGCTTCCTCTCAGGTCCCCTCAGGCCTGACCTTCCCTTCTCCAGGTCTGAAGCGCCTGACAGGC >HUMGLUT4B_5125 CCAGCCTGTTGTGGCTGGAGTAGAGGAAGGGGCATTCCTGCCATCACTTCTTCTTCTCCCCCACCTCTAGGTTTTCTATTATTCGACCAG >HUMGLUT4B_5378 CCTCACCCACGCGGCCCCTCCTACTTCCCGTGCCCAAAAGGCTGGGGTCAAGCTCCGACTCTCCCCGCAGGTGTTGTTGGTGGAGCGGGC >HUMGLUT4B_5995 CTGAGTTGAGGGCAAGGGAAGATCAGAAAGGCCTCAACTGGATTCTCCACCCTCCCTGTCTGGCCCCTAGGAGCGAGTTCCAGCCATGAG >HUMGLUT4B_6716 CTGGTTGCCTGAAACTACCCCTTCCCTCCCCACCTCACTCCGTCAACACCTCTTTCTCCACCTGTCCCAGGAGGCTATGGGGCCCTACGT >HSRPS6G_1493 CTTTGTAGATGGCTCTACAATTACCTGTATAGATAGTTTCGTAAACTATTTCCCCCCTTTTAATCCTTAGCTGAACATCTCCTTCCCAGC [...]

Arff Data File - WEKA @RELATION splice-train @ATTRIBUTE -68_A {0,1} @ATTRIBUTE -68_T {0,1} @ATTRIBUTE -68_C {0,1} @ATTRIBUTE -68_G {0,1} @ATTRIBUTE -67_A {0,1} @ATTRIBUTE -67_T {0,1} @ATTRIBUTE -67_C {0,1} @ATTRIBUTE -67_G {0,1} [...] @ATTRIBUTE 20_A {0,1} @ATTRIBUTE 20_T {0,1} @ATTRIBUTE 20_C {0,1} @ATTRIBUTE 20_G {0,1} @ATTRIBUTE class {true,false} @DATA 0,0,0,1,0,0,0,1, [...],1,0,0,0,true 0,0,0,1,1,0,0,0, [...],1,0,0,0,true 0,1,0,0,0,0,0,1, [...],1,0,0,0,true 0,1,0,0,0,0,0,1, [...],0,0,0,1,true [...] 1,0,0,0,0,1,0,0, [...],0,1,0,0,true 0,0,0,1,0,0,1,0, [...],0,0,1,0,true 0,0,1,0,0,0,1,0, [...],0,0,0,1,true 0,0,1,0,0,0,1,0, [...],0,0,1,0,true The original sequence files in FASTA format have been converted to represent the four DNA bases in a binary fashion A: 1 0 0 0 T: 0 1 0 0 C: 0 0 1 0 G: 0 0 0 1

Case Study - Splice Site Prediction Local context of 88 nucleotides around the splice site 88 position dependent features A=1000, T=0100, C=0010, G=0001  352 binary features Reduce the dataset to contain fewer but relevant features

352 Binary features

15 Binary features

Case Study – Splice Site Sequence Logos Acceptor Sites: Donor Sites: + 3+ 2+ 1 - 2- 3 + 4 - 1 + 1 - 2- 3- 1- 4- 8- 9- 7- 5- 6 - 13- 14- 12- 10- 11- 15- 18- 16- 17

Exercise: Building a prediction tool for human mRNA splice sites Feature selection for classification of splice sites Tool: The WEKA machine learning toolkit. Go to http://www.cbs.dtu.dk/~kerner/GeneDisc_Course_2007_MJK/ http://www.cbs.dtu.dk/~kerner/GeneDisc_Course_2007_MJK/ and follow the instructions

Acknowledgements Slides and Exercises Adapted from and inspired by: Søren Brunak David Gilbert, Aik Choon Tan Yvan Saeys

Classification by Machine Learning Approaches Michael J. Kerner – Center for Biological Sequence Analysis Technical.

Similar presentations

Presentation on theme: "Classification by Machine Learning Approaches Michael J. Kerner – Center for Biological Sequence Analysis Technical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification by Machine Learning Approaches Michael J. Kerner – Center for Biological Sequence Analysis Technical.

Similar presentations

Presentation on theme: "Classification by Machine Learning Approaches Michael J. Kerner – Center for Biological Sequence Analysis Technical."— Presentation transcript:

Similar presentations

About project

Feedback