1Module #: Title of Module. Machine Learning 101.

1Module #: Title of Module

Machine Learning 101

Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Choose internal validation strategy Picking features and choosing the model Estimating generalization error

Types of Machine Learning

Descriptive (Unsupervised) Goal: find ‘interesting’ patterns in the data. Unlabeled input

Predictive (Supervised) Dataset Training Set Sampling Strategy Model Prediction Function

Classification: Dataset Training Set Sampling Strategy Model Prediction Function

Regression: Dataset Training Set Sampling Strategy Model Prediction Function

Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Why Cluster? ii.Hierarchical clustering iii.Partitioning methods 3.Supervised machine learning

Unsupervised methods for discovering biomarkers Example in breast cancer: Good predictors can aid medical decision making A prognostic gene expression signature that can be measured in women who have breast cancer Can be used to predict survival

Intrinsic Subtypes of Breast Cancer Sorlie et al. 2001Parker et al. 2009 11

Introduction to unsupervised ML (clustering) Clustering...... is an example of unsupervised learning... is useful for the analysis of patterns in data... can lead to class discovery. Clustering is the partitioning of a data set into groups of elements that are more similar to each other than to elements in other groups. Clustering is a completely general method that can be applied to genes, samples, or both.

How is Clustering Done (Simple)? Outliers Cluster Stimulus #1 Stimulus #2 Gene Intra-cluster distance Inter-cluster distance Clustering aims to MINIMIZE intra-cluster and MAXIMIZE inter-cluster distance

Hierarchical clustering Given N items and a distance metric... 1. Assign each item to its own "cluster". Initialize the distance matrix between clusters as the distance between items. 2. Find the closest pair of clusters and merge them into a single cluster. 3. Compute new distances between clusters. 4. Repeat 2-3 until all clusters have been merged into a single cluster.

"Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality" Hierarchical clustering

Distance metrics Common metrics include: Manhattan distance: Euclidean distance: 1-correlation: (proportional to Euclidean distance, but invariant to range of measurement from one sample to the next). dissimilar similar

Distance metrics compared EuclideanManhattan1-Correlation Distance matters!

Other distance metrics Hamming distance for ordinal, binary or categorical data:

Agglomerative hierarchical clustering

Anatomy of a Clustergram Heatmap Dendrogram

Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters

Linkage methods single complete average distance between centroids

cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data

plot(hc.single) Single linkage Example: cell cycle data

Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data

Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data

12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data

12 34 Single linkage, k=4 1 2 34 Example: cell cycle data

12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data

Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object

Partitioning based methods Choose K groups initialise group centers A.K.A. centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam

Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

N items, assume K clusters Goal is to minimize over the possible assignments and centroids represents the location of the cluster. K-means

1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means

set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means

12 34 K-means, k=4 12 34 Single linkage, k=4 K-means

K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary

Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Splitting data Picking features and choosing the model Estimating generalization error

Why predict: Oncotype DX- Breast Cancer 22 gene signature Paik et al. 2004 49

Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Splitting data Picking features and choosing the model Estimating generalization error

Thinking like a Data Scientist: Questio n Data Feature s Algorith m Parameter choice Evaluation Bringing in a data scientist early on in the design of a project which will make use of machine learning is essential. Trained to formalize the above issues.

The Question Questio n Data Feature s Algorith m Parameter choice Evaluation John Wilder Tukey “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise”

The Question Questio n Data Feature s Algorith m Parameter choice Evaluation Must be specific and well defined What are you trying to Predict? Why is this an important question? How will the end user benefit from the answer? Most important part of any ML project.

The dataset: Posing the question. Questio n Data Feature s Algorith m Parameter choice Evaluation How predictions will be made. Publicly available data: what can/can’t be answered Experiment needed? Cost to get the data Very important: Garbage in Garbage out

The features: Making the data useful. Questio n Data Feature s Algorith m Parameter choice Evaluation Feature: a measurable property of a phenomenon being observed Numeric, strings or more complicated graphs ML algorithms need good features: informative, discriminating, independent Datas et Feature 1 Feature 2

Learning a model: Questio n Data Feature s Algorith m Parameter choice Evaluation Training Set Sampling Strategy Model Prediction Function Dataset Feature 1 Feature 2

Model parameterization: Questio n Data Feature s Algorith m Parameter choice Evaluation Prediction Function Parameters: Two types of parameters: Model parameters Tuned on the training set Hyperparameters: Tuned during evaluation (next step)

Algorithm Evaluation: Questio n Data Feature s Algorith m Parameter choice Evaluation Prediction Function Assessing model performance: In sample error (training set) Out of sample error (test set) Overfitting: In sample error << Out of sample error Cross validation ROC curves

Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Internal validation strategies Picking features and choosing the model Estimating generalization error

Study Design Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation set once

An example: The iris dataset Used as classic classification example throughout CS 'iris' 150 cases (rows) 4 features (columns) Goal is to predict the Species. Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 versi.. 4.7 3.2 1.3 0.2 verg..

A practical example: library(‘caret’); data(iris); features <- iris[1:4]; classes <- iris[,5]; model <- knn3( features, classes, k=1); # for now just a model predictions <- predict( model, as.matrix(features), type = 'class'); # predictions of model confusionMatrix(prediction, classes); # get model performance

A practical example: > confusionMatrix(prediction, classes) Overall Statistics: Accuracy : 0.9667 Statistics by Class: setosa versicolorvirginica Sensitivity 10.940.96 Specificity 10.980.97 Pos_Pred_Value 10.95920.9412 Neg_Pred_Value 10.97030.9798 Balanced_Accuracy 10.960.965

ROC curves Binary classification problems are categorical Alive/dead Good prognosis/ poor prognosis Predictions are often quantitative: Probability of being alive Score threshold used to decide prediction The cutoff chosen gives different results

ROC curves:

Comparing models with ROC curves What makes a good model is domain specific. In some disciplines AUC=0.7 is very good In other disciplines AUC=0.9 is essential http://www.sprawls.org/ppmi2/IMGCHAR/1IMCHAR12.gif

Internal Validation Strategy Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation 1X

In Sample error vs out of sample error: In Sample Error: The error rate you get on the same data set you used to build your predictor training error Out of sample error: The error rate you get on a new data set. Generalization error Generalization error is what we care about Overfitting to the sample you trained your model on

Internal Validation: The goal is to obtain a good generalization error not a good training error. A model with good training error but poor generalization error is suffering from ‘over-fitting’ Completely independent datasets are not easy to come by. An internal validation strategy can help approximate the generalization error.

Internal Validation Strategies: Split-sample Training set/test-set/validation set Cross-Validation alternating training/validation Bootstrap-validation Bootstrap sample for model development: n patients drawn with replacement Original sample for validation

Understanding model error: Imagine you could repeat the whole model building process: gather new data, run a new analysis, create a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions.

Feature selection/extraction strategies Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation 1X

The curse of dimensionality: Example: An RNA-Seq experiment has information about gene expression and how they vary between patients with differing outcomes.

The curse of dimensionality: The curse of dimensionality refers to problems that arise when analyzing and organizing data in high-dimensional spaces. As the dimensionality of the dataset increases: Volume of the space increases. Available data become sparse. Amount of data needed to support a result grows exponentially

The curse of dimensionality: Problems: Redundancy in feature set: Several genes may be indicative of the same underlying change. Many uninformative features: Many genes have nothing to do with outcome.

The curse of dimensionality: Unsupervised methods cluster objects with similar properties With many dimensions: Data points appear to be sparse and dissimilar in many ways Common data organization strategies can be confused.

Features: Good features: Lead to data compression Retain relevant information Incorporate expert domain knowledge Example: RNA expression as pathways (reactome).

Types of Feature Selection: Filter methods select features first and then use in model. more computationally efficient than wrapper methods selection criterion is not directly related to model performance. features evaluated separately so important interactions between variables will not be identified. Wrapper methods a evaluate multiple models use procedures that add and/or remove features to find the optimal combination that maximizes model performance.

Feature Extraction: Feature extraction starts from an initial set of measured data builds derived values intended to be informative and non- redundant Facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. Feature extraction is related to dimensionality reduction.

Principle component analysis By Nicoguaro - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46871195 Principal component analysis (PCA): Transformation to convert features into a set of values of linearly uncorrelated variables called principal components. Each component has the highest variance under the constraint that it is orthogonal to the preceding components. Usually fewer PCs than features.

ML models Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation 1X

Good ML models: Choosing the right model is often an accuracy tradeoff: The best models are: Interpretable Simple Fast to train/ test Scalable to big data Accurate

Machine learning Algorithms Support Vector Machines (SVMs) Random Forests (RF) Prediction Analysis of Microarrays (PAM) Naïve Bayes Linear Discriminant Analysis (LDA) Decision Trees Classification and Regression Trees (CART) Gaussian Mixtures Boltzmann Learning Neural Networks K-Nearest Neighbours (KNN) Maximum Likelihood Estimation (MLE) Multiple Discriminant Analysis (MDA) Logistic Regression Multivariate Adaptive Regression Splines (MARS) Flexible Discriminant Analysis A full list of models available in caret: https://topepo.github.io/caret/modelList.html

Supervised ML Example: K-Nearest Neighbours (KNN) Parameters to tune: K = number of neighbours to vote model <- knn3( features, classes, k=1)

Decision Tree: Parameters to tune: Minimal size for split Minimal leaf size Minimal gain Pruning method train(x,...) used to tune parameters for models in caret. More on this later J48 and rpart are good trees for classification and regression.

Random forest: Ensemble method that combines many decision trees. Parameters to tune: Number of trees Splitting criteria Voting threshold Decision tree params Image source: http://bigdataexaminer.com/data-science/i-thought-of-sharing-these-7-machine-learning-concepts-with-you/

Choosing an ML strategy: Select best candidate model(s) Cross Validation Different models / Feature selection methods Parameters

Refining selected ML strategy Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation Once if a validation set exists: Refine best candidates else: Test best model one and only one time

Estimating generalization error Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation Once Test best model once to get an estimate of generalization error

Concluding Summary Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation once

Goal: Generalization error Generalization error is a measure of how well predictions made by a model perform on external datasets: Accuracy in the training set is optimistic The test set provides an estimate of generalization error If we use the test set to inform the model we chose, it becomes part of the training set. Generalization error must be estimated from the training set.

Google flu trends: Problems in data science

Arguably not as important as dataset or feature selection

1Module #: Title of Module. Machine Learning 101.

Similar presentations

Presentation on theme: "1Module #: Title of Module. Machine Learning 101."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1Module #: Title of Module. Machine Learning 101.

Similar presentations

Presentation on theme: "1Module #: Title of Module. Machine Learning 101."— Presentation transcript:

Similar presentations

About project

Feedback