Dr John Mitchell (Chemistry, St Andrews, 2019)

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Random Forest Predrag Radenković 3237/10
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Three kinds of learning
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Application and Efficacy of Random Forest Method for QSAR Analysis
For Better Accuracy Eick: Ensemble Learning
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews.
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
COMP24111: Machine Learning Ensemble Models Gavin Brown
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Lecture 16. Bagging Random Forest and Boosting¶
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Introduction to Machine Learning
JMP Discovery Summit 2016 Janet Alvarado
Bagging and Random Forests
Classification with Gene Expression Data
An Artificial Intelligence Approach to Precision Oncology
Eco 6380 Predictive Analytics For Economists Spring 2016
Chapter 13 – Ensembles and Uplift
Trees, bagging, boosting, and stacking
Ch9: Decision Trees 9.1 Introduction A decision tree:
Boosting and Additive Trees
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
ECE 471/571 – Lecture 12 Decision Tree.
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
Lecture 18: Bagging and Boosting
ML – Lecture 3B Deep NN.
Ensemble learning.
Machine Learning in Practice Lecture 17
Lecture 06: Bagging and Boosting
Learning Chapter 18 and Parts of Chapter 20
Classification with CART
Humanity v The Machines
More on Maxent Env. Variable importance:
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Dr John Mitchell (Chemistry, St Andrews, 2019) Random Forest Dr John Mitchell (Chemistry, St Andrews, 2019)

Random Forest A Machine Learning Method

Random Forest A Machine Learning Method This is a decision tree.

Random Forest A decision tree is like a flow chart

Random Forest A Machine Learning Method Let’s visualise the decision tree ...

Random Forest A Machine Learning Method ... as a flow chart.

Random Forest A Machine Learning Method In detail, it looks like this. ... as a flow chart. In detail, it looks like this.

Random Forest A Machine Learning Method I came across Random Forest in the context of its application to chemical problems, that is chemoinformatics (or cheminformatics, the variant spellings are equivalent).

Encoding structure as features Mapping features to property

I refer to the entities about which predictions are to be made as items. In the context of chemistry, they are usually molecules. Each row of this matrix represents an item.

Each item is actually encoded by its descriptors Each item is actually encoded by its descriptors. The terms feature and descriptor are synonymous. Each column of the matrix contains the values of one descriptor for each of the different items.

And each row of the matrix contains all the descriptors for one item.

Mapping features to property The thing being predicted for each item is the property (output property). In this picture, it’s aqueous solubility. Mapping features to property

Classification is when the possible outputs of the prediction are discrete classes, so we are trying to put items into the correct pigeon hole, like [TRUE or FALSE] , or like [RED, GREEN, or BLUE]

Regression is when the possible outputs of the prediction are continuous numerical values, so we are trying to predict as accurately as possible. We normally measure this with the root mean squared error, and also look at the correlation coefficient.

Random Forest A Machine Learning Method A single decision tree can indeed make a decent classifier, but there’s an easy way to improve upon this …

Video explanation of wisdom of crowds

Wisdom of Crowds Francis Galton (1907) described a competition at a country fair, where participants were asked to estimate the mass of a cow. Individual entries were not particularly reliable, but Galton realised that by combining these guesses a much more reliable estimate could be obtained.

Wisdom of Crowds Guess the mass of the cow: Median of individual guesses is a good estimator: Francis Galton, Vox populi, Nature, 75, 450-451 (1907).

Wisdom of Crowds This is an ensemble predictor which works by aggregating individual independent estimates, and generates a result that is more reliable than the individual guesses and more accurate than the large majority of them.

Random Forest A Machine Learning Method Rather than having just one decision tree, we use lots of them to make a forest.

Random Forest Multiple trees are only useful if not identical! For regression, predictions of trees are averaged. So make them randomly different.

Random Forest So we randomly choose the data items for each tree; Randomly choose data for each tree. Unlike a cup draw, it’s done with replacement; We choose N items out of N for each tree, but an item can be repeated; The resulting set of N non-unique items is known as a bootstrap sample.

Randomly choose data for each tree.

Random Forest This kind of with replacement selection gives what’s known as a bootstrap sample. On average, e-1 or ~37% of items are not sampled by a given single tree. Randomly choose data for each tree. These form the “out of bag” set for that tree. The “out of bag” data are useful for validation.

Random Forest We also randomly choose questions to ask of the data. At each node, this is based on a fresh random sample of mtry of the descriptors. The descriptor used at each split is selected so as to optimise splitting or minimise training error, given that the split values (e.g. MW > 315) have already been optimised for each of the mtry available descriptors. Randomly choose data for each tree.

Random Forest Here’s an example of such a tree. LogP SMR VSA MW nHbond Node 1 <=3.05 >3.05 >255 >7.4 <=255 <=7.4 >3 <=3 >315 <=315 Node 2 Node 3 Node 4 Node 5 Node 6 Here’s an example of such a tree.

Random Forest Here’s an example of such a tree. LogP SMR VSA MW nHbond Node 1 <=3.05 >3.05 >255 >7.4 <=255 <=7.4 >3 <=3 >315 <=315 Node 2 Node 3 Node 4 Node 5 Node 6 Here’s an example of such a tree. Question at Node 1: Is the molecular weight > 315? If true go to Node 4; if false go to Node 3.

Random Forest The building of the decision trees is the training phase of the Random Forest algorithm. Once the trees are built, the query items are passed through each decision tree. Which node they end up at depends on their descriptor values. This node determines the tree’s individual prediction. Query items are passed through each tree.

A B D C E John Mitchell, Machine learning methods in chemoinformatics, WIREs Comput. Mol. Sci., 4, 468-481 (2014)

Random Forest: Consensus For a classification problem, the trees vote for the class to assign the object to. For classification, trees vote for class.

Random Forest: Consensus For a regression problem, the trees each predict a numerical value, and these are averaged. For regression, predictions of trees are averaged.

Random Forest So let’s summarise what we’ve said about Random Forest.

Random Forest Introduced by Leo Breiman and Adele Cutler (early 2000s) Development of Decision Trees (Recursive Partitioning): Random Forest can be used for either classification or regression, in the latter case the trees are regression trees. Leo Breiman, Random Forests, Machine Learning, 45, 5-32 (2001) Vladimir Svetnik, Andy Liaw, et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling, J Chem Inf Comput Sci, 43, 1947-1958 (2003)

Random Forest Introduced by Breiman and Cutler (2001) Development of Decision Trees (Recursive Partitioning): Dataset is partitioned into consecutively smaller subsets (of similar property value) Each partition is based upon the value of one descriptor The descriptor used at each split is selected so as to minimise the error Tree is not pruned. Leo Breiman, Random Forests, Machine Learning, 45, 5-32 (2001) Vladimir Svetnik, Andy Liaw, et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling, J Chem Inf Comput Sci, 43, 1947-1958 (2003)

Random Forest (Classification) Coupled Ensemble of Decision Trees Each tree is trained: from a bootstrap sample of the data in situ out-of-bag cross-validation without pruning back; for classification typically nodesize=1 from subset of descriptors at each split; Advantages: improved accuracy method for descriptor selection no overfitting easy to train human interpretable not a black box for classification typically mtry= SQRT(no. of descriptors) ntree=500 ...

Random Forest (Regression) LogP SMR VSA MW nHbond Node 1 <=3.05 >3.05 >255 >7.4 <=255 <=7.4 >3 <=3 >315 <=315 Node 2 Node 3 Node 4 Node 5 Node 6 Coupled Ensemble of Regression Trees Each tree is trained: from a bootstrap sample of the data in situ out-of-bag cross-validation without pruning back; for regression typically nodesize=5 from subset of descriptors at each split; mtry=(no. of descriptors)/3 Advantages: improved accuracy method for descriptor selection no overfitting easy to train human interpretable not a black box for regression typically ntree=500 ...

Random Forest (Summary) Random Forest is a collection of Decision Trees grown with the CART algorithm. Standard Parameters: Needs a moderately large number of trees. I’d suggest at least 100; generally 500 trees is plenty. No pruning back: Minimum node size > 5 (for regression) mtry descriptors tried at each split Can quantify descriptor importance: Incorporates descriptor selection Incorporates “Out-of-bag” validation

Random Forest (variants) Bagging If we allow each split to use any of the available descriptors, rather than a randomly chosen subset, then Random Forest is equivalent to Bagging.

Random Forest (variants) ExtraTrees The ExtraTrees variant (“extremely randomized trees”) uses all N items for each tree. It also chooses possible splits using a given descriptor at a node randomly (RF in contrast makes them as good as possible); ExtraTree does however then choose the best descriptor to carry out the split with.

Research Application: Computing Solubility

Which would you Prefer ... or ?

or ? Which would you Prefer ... Solubility in water (and other biological fluids) is highly desirable for pharmaceuticals!

Solubility is an important issue in drug discovery and a major cause of failure of drug development projects Expensive for the pharma industry Patients suffer lack of available treatments A good computational model for predicting the solubility of druglike molecules would be very valuable.

Solubility You might think that “How much solid compound dissolves in 1 litre of water” is a simple question to answer. However, experiments are prone to large errors. Solution takes time to reach equilibrium, and results depend on pH, temperature, ionic strength, solid form, impurities etc.

Humankind vs The Machines Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017) Image: scmp.com

Humankind vs The Machines Challenge is to predict solubilities of 25 molecules given 75 as training data.

Humankind vs The Machines Sent 229 emailed invitations to subject experts and students. Obtained 22 anonymous responses, of those 17 made full sets of predictions.

Humankind vs The Machines 10 machine learning algorithms were given the same training & test sets as the human panel.

0.99 0.94 Difference not significant Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017)

Machine Learning Algorithms Ranked 1st 2nd

Another Layer of Wisdom of Crowds We don’t know in advance which predictors will be good and which will be poor. However, we can make an algorithm that will allow us to generate a good (consensus) prediction without prior knowledge of results.

Wisdom of Crowds: Human Consensus Predictor Guess for the solubility of the molecule: Median of all (between 17 & 21) individual human guesses of logS0 for a given compound.

Wisdom of Crowds: Machine Consensus Predictor Guess for the solubility of the molecule: Median of all 10 individual machine guesses of logS for a given compound.

1.09 Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017)

1.14 1.09 Difference not significant Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017)

Conclusions: Humans v ML Best humans and best algorithms perform almost equally; Consensus of humans and consensus of algorithms perform almost equally; Less effective individual human predictors are notably weaker. Both humans and ML numerically clearly better than a physics-based first principles theory approach.* * On a similar but non-identical dataset; David Palmer, James McDonagh, John Mitchell, Tanja van Mourik & Maxim Fedorov, First-Principles Calculation of the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules, J Chem Theory Comput, 8, 3322-3337 (2012)

RF & other ML Methods for Solubility Expt data: errors unknown (0.5-0.7 logS0 units?) but limit possible accuracy of models; Differences in dataset size and composition often hinder comparisons of methods; ML numerically better than first principles (but FP not widely validated), at the cost of less insight.

Descriptor Importance Replace each descriptor in turn with random noise. Measure how much worse randomising this descriptor makes the prediction error. The more damaging the loss of the descriptor’s information, the higher its importance. We can also measure the same effect by looking instead at node purity.

Platforms Photo by Richard Webb

Platforms I use the randomForest package in R. Random Forest implementations in Python are also widely available. You’ll probably find available implementations for your own favourite language and platform.

Thanks Tanja van Mourik (St Andrews), Neetika Nath, James McDonagh (now IBM), Rachael Skyner (now Diamond, Oxford), Sam Boobier (now Leeds), Will Kew (now Edinburgh) Maxim Fedorov, Dave Palmer (Strathclyde) Laura Hughes (now Stanford), Toni Llinas (AZ), Anne Osbourn (JIC, Norwich)