Download presentation
Presentation is loading. Please wait.
Published byGriffin Craig Modified over 8 years ago
1
1Module #: Title of Module
2
Machine Learning 101
3
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Choose internal validation strategy Picking features and choosing the model Estimating generalization error
4
Types of Machine Learning
5
Descriptive (Unsupervised) Goal: find ‘interesting’ patterns in the data. Unlabeled input
6
Predictive (Supervised) Dataset Training Set Sampling Strategy Model Prediction Function
7
Classification: Dataset Training Set Sampling Strategy Model Prediction Function
8
Regression: Dataset Training Set Sampling Strategy Model Prediction Function
9
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Why Cluster? ii.Hierarchical clustering iii.Partitioning methods 3.Supervised machine learning
10
Unsupervised methods for discovering biomarkers Example in breast cancer: Good predictors can aid medical decision making A prognostic gene expression signature that can be measured in women who have breast cancer Can be used to predict survival
11
Intrinsic Subtypes of Breast Cancer Sorlie et al. 2001Parker et al. 2009 11
12
Introduction to unsupervised ML (clustering) Clustering...... is an example of unsupervised learning... is useful for the analysis of patterns in data... can lead to class discovery. Clustering is the partitioning of a data set into groups of elements that are more similar to each other than to elements in other groups. Clustering is a completely general method that can be applied to genes, samples, or both.
13
How is Clustering Done (Simple)? Outliers Cluster Stimulus #1 Stimulus #2 Gene Intra-cluster distance Inter-cluster distance Clustering aims to MINIMIZE intra-cluster and MAXIMIZE inter-cluster distance
14
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Why Cluster? ii.Hierarchical clustering iii.Partitioning methods 3.Supervised machine learning
15
Hierarchical clustering Given N items and a distance metric... 1. Assign each item to its own "cluster". Initialize the distance matrix between clusters as the distance between items. 2. Find the closest pair of clusters and merge them into a single cluster. 3. Compute new distances between clusters. 4. Repeat 2-3 until all clusters have been merged into a single cluster.
16
"Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality" Hierarchical clustering
17
Distance metrics Common metrics include: Manhattan distance: Euclidean distance: 1-correlation: (proportional to Euclidean distance, but invariant to range of measurement from one sample to the next). dissimilar similar
18
Distance metrics compared EuclideanManhattan1-Correlation Distance matters!
19
Other distance metrics Hamming distance for ordinal, binary or categorical data:
20
Agglomerative hierarchical clustering
21
Anatomy of a Clustergram Heatmap Dendrogram
22
Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters
23
Linkage methods single complete average distance between centroids
24
cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data
25
plot(hc.single) Single linkage Example: cell cycle data
26
Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data
27
Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data
28
Single linkage, k=3 rect.hclust(hc.single,k=3) Example: cell cycle data
29
Single linkage, k=4 rect.hclust(hc.single,k=4) Example: cell cycle data
30
Single linkage, k=5 rect.hclust(hc.single,k=5) Example: cell cycle data
31
Single linkage, k=25 rect.hclust(hc.single,k=25) Example: cell cycle data
32
12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data
33
12 34 Single linkage, k=4 1 2 34 Example: cell cycle data
34
12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data
35
Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
36
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Why Cluster? ii.Hierarchical clustering iii.Partitioning methods 3.Supervised machine learning
37
Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object
38
Partitioning based methods Choose K groups initialise group centers A.K.A. centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes
39
K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam
40
Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
41
N items, assume K clusters Goal is to minimize over the possible assignments and centroids represents the location of the cluster. K-means
42
1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means
43
set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means
44
set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means
45
K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means
46
12 34 K-means, k=4 12 34 Single linkage, k=4 K-means
47
K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary
48
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Splitting data Picking features and choosing the model Estimating generalization error
49
Why predict: Oncotype DX- Breast Cancer 22 gene signature Paik et al. 2004 49
50
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Splitting data Picking features and choosing the model Estimating generalization error
51
Thinking like a Data Scientist: Questio n Data Feature s Algorith m Parameter choice Evaluation Bringing in a data scientist early on in the design of a project which will make use of machine learning is essential. Trained to formalize the above issues.
52
The Question Questio n Data Feature s Algorith m Parameter choice Evaluation John Wilder Tukey “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise”
53
The Question Questio n Data Feature s Algorith m Parameter choice Evaluation Must be specific and well defined What are you trying to Predict? Why is this an important question? How will the end user benefit from the answer? Most important part of any ML project.
54
The dataset: Posing the question. Questio n Data Feature s Algorith m Parameter choice Evaluation How predictions will be made. Publicly available data: what can/can’t be answered Experiment needed? Cost to get the data Very important: Garbage in Garbage out
55
The features: Making the data useful. Questio n Data Feature s Algorith m Parameter choice Evaluation Feature: a measurable property of a phenomenon being observed Numeric, strings or more complicated graphs ML algorithms need good features: informative, discriminating, independent Datas et Feature 1 Feature 2
56
Learning a model: Questio n Data Feature s Algorith m Parameter choice Evaluation Training Set Sampling Strategy Model Prediction Function Dataset Feature 1 Feature 2
57
Model parameterization: Questio n Data Feature s Algorith m Parameter choice Evaluation Prediction Function Parameters: Two types of parameters: Model parameters Tuned on the training set Hyperparameters: Tuned during evaluation (next step)
58
Algorithm Evaluation: Questio n Data Feature s Algorith m Parameter choice Evaluation Prediction Function Assessing model performance: In sample error (training set) Out of sample error (test set) Overfitting: In sample error << Out of sample error Cross validation ROC curves
59
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Internal validation strategies Picking features and choosing the model Estimating generalization error
60
Study Design Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation set once
61
An example: The iris dataset Used as classic classification example throughout CS 'iris' 150 cases (rows) 4 features (columns) Goal is to predict the Species. Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 versi.. 4.7 3.2 1.3 0.2 verg..
62
A practical example: library(‘caret’); data(iris); features <- iris[1:4]; classes <- iris[,5]; model <- knn3( features, classes, k=1); # for now just a model predictions <- predict( model, as.matrix(features), type = 'class'); # predictions of model confusionMatrix(prediction, classes); # get model performance
63
A practical example: > confusionMatrix(prediction, classes) Overall Statistics: Accuracy : 0.9667 Statistics by Class: setosa versicolorvirginica Sensitivity 10.940.96 Specificity 10.980.97 Pos_Pred_Value 10.95920.9412 Neg_Pred_Value 10.97030.9798 Balanced_Accuracy 10.960.965
64
ROC curves Binary classification problems are categorical Alive/dead Good prognosis/ poor prognosis Predictions are often quantitative: Probability of being alive Score threshold used to decide prediction The cutoff chosen gives different results
65
ROC curves:
66
Comparing models with ROC curves What makes a good model is domain specific. In some disciplines AUC=0.7 is very good In other disciplines AUC=0.9 is essential http://www.sprawls.org/ppmi2/IMGCHAR/1IMCHAR12.gif
67
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Internal validation strategies Picking features and choosing the model Estimating generalization error
68
Internal Validation Strategy Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation 1X
69
In Sample error vs out of sample error: In Sample Error: The error rate you get on the same data set you used to build your predictor training error Out of sample error: The error rate you get on a new data set. Generalization error Generalization error is what we care about Overfitting to the sample you trained your model on
70
Internal Validation: The goal is to obtain a good generalization error not a good training error. A model with good training error but poor generalization error is suffering from ‘over-fitting’ Completely independent datasets are not easy to come by. An internal validation strategy can help approximate the generalization error.
71
Internal Validation Strategies: Split-sample Training set/test-set/validation set Cross-Validation alternating training/validation Bootstrap-validation Bootstrap sample for model development: n patients drawn with replacement Original sample for validation
72
Understanding model error: Imagine you could repeat the whole model building process: gather new data, run a new analysis, create a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions.
73
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Internal validation strategies Picking features and choosing the model Estimating generalization error
74
Feature selection/extraction strategies Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation 1X
75
The curse of dimensionality: Example: An RNA-Seq experiment has information about gene expression and how they vary between patients with differing outcomes.
76
The curse of dimensionality: The curse of dimensionality refers to problems that arise when analyzing and organizing data in high-dimensional spaces. As the dimensionality of the dataset increases: Volume of the space increases. Available data become sparse. Amount of data needed to support a result grows exponentially
77
The curse of dimensionality: Problems: Redundancy in feature set: Several genes may be indicative of the same underlying change. Many uninformative features: Many genes have nothing to do with outcome.
78
The curse of dimensionality: Unsupervised methods cluster objects with similar properties With many dimensions: Data points appear to be sparse and dissimilar in many ways Common data organization strategies can be confused.
79
Features: Good features: Lead to data compression Retain relevant information Incorporate expert domain knowledge Example: RNA expression as pathways (reactome).
80
Types of Feature Selection: Filter methods select features first and then use in model. more computationally efficient than wrapper methods selection criterion is not directly related to model performance. features evaluated separately so important interactions between variables will not be identified. Wrapper methods a evaluate multiple models use procedures that add and/or remove features to find the optimal combination that maximizes model performance.
81
Feature Extraction: Feature extraction starts from an initial set of measured data builds derived values intended to be informative and non- redundant Facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. Feature extraction is related to dimensionality reduction.
82
Principle component analysis By Nicoguaro - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46871195 Principal component analysis (PCA): Transformation to convert features into a set of values of linearly uncorrelated variables called principal components. Each component has the highest variance under the constraint that it is orthogonal to the preceding components. Usually fewer PCs than features.
83
Overview 1.Machine learning a brief introduction 2.Unsupervised machine learning i.Clustering methods 3.Supervised machine learning i.Why predict? ii.Thinking like a data scientist iii.Study Design Defining error Internal validation strategies Picking features and choosing the model Estimating generalization error
84
ML models Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation 1X
85
Good ML models: Choosing the right model is often an accuracy tradeoff: The best models are: Interpretable Simple Fast to train/ test Scalable to big data Accurate
86
Machine learning Algorithms Support Vector Machines (SVMs) Random Forests (RF) Prediction Analysis of Microarrays (PAM) Naïve Bayes Linear Discriminant Analysis (LDA) Decision Trees Classification and Regression Trees (CART) Gaussian Mixtures Boltzmann Learning Neural Networks K-Nearest Neighbours (KNN) Maximum Likelihood Estimation (MLE) Multiple Discriminant Analysis (MDA) Logistic Regression Multivariate Adaptive Regression Splines (MARS) Flexible Discriminant Analysis A full list of models available in caret: https://topepo.github.io/caret/modelList.html
87
Supervised ML Example: K-Nearest Neighbours (KNN) Parameters to tune: K = number of neighbours to vote model <- knn3( features, classes, k=1)
88
Decision Tree: Parameters to tune: Minimal size for split Minimal leaf size Minimal gain Pruning method train(x,...) used to tune parameters for models in caret. More on this later J48 and rpart are good trees for classification and regression.
89
Random forest: Ensemble method that combines many decision trees. Parameters to tune: Number of trees Splitting criteria Voting threshold Decision tree params Image source: http://bigdataexaminer.com/data-science/i-thought-of-sharing-these-7-machine-learning-concepts-with-you/
90
Choosing an ML strategy: Select best candidate model(s) Cross Validation Different models / Feature selection methods Parameters
91
Refining selected ML strategy Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation Once if a validation set exists: Refine best candidates else: Test best model one and only one time
92
Estimating generalization error Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation Once Test best model once to get an estimate of generalization error
93
Concluding Summary Define the error rate Choose internal validation strategy Pick Features Choose model Apply to test set and refine Apply to validation once
94
Goal: Generalization error Generalization error is a measure of how well predictions made by a model perform on external datasets: Accuracy in the training set is optimistic The test set provides an estimate of generalization error If we use the test set to inform the model we chose, it becomes part of the training set. Generalization error must be estimated from the training set.
95
Google flu trends: Problems in data science
96
Arguably not as important as dataset or feature selection
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.