regression Can BMI explain Y? Can BMI predict Y?

regression Can BMI explain Y? Can BMI predict Y? How does Y vary with BMI? Regression is a general term for modeling the relationship between a label (a.k.a. dependent variable or target) and one or more features (a.k.a. the attributes, the independent or explanatory variable(s)).

that generated the data.
linear regression We need to make assumptions about the model that generated the data. linear relationship linear model

linear regression a and b are model parameters
a is the slope to the line (direction) b is the intercept or bias (position)

linear regression: prediction
BMI(70kg, 1.8 meter) = 21.6 scaled data! scaled BMI = -1.08 predicted value = f(-1.08) predicted Y = 125.9

linear regression: fitting
What values for a and b fit the data best? How can we evaluate them?

linear regression: fitting

linear regression Dataset linear regression Learning Algorithm

linear regression: cost

linear regression Fit a linear model
to the data set such that the cost function is minimal.

linear regression

gradient descent

gradient descent correct: simultaneous update incorrect:

gradient descent

linear regression model: cost function: Goal:
Learning: 1. start with some 2. change to reduce 3. repeat 2. until convergence

non-linear regression
Does y vary linearly with X? Can we fit a non-linear model? Yes Or we could add polynomial transformations of the features.

dose-response relationship sigmoid function (a.k.a. a logistic function) theta1 is the slope at the steepest part of the curve theta0 is the dosage at which 50% of the subjects are expected to show the desired response

model assumptions logistic regression for classification
We need to make assumptions about the model that generated the data. linear relationship linear model

model assumptions logistic regression for classification
We need to make assumptions about the model that generated the data. linearly separable logistic model

logistic regression: logistic model

logistic regression: cost function

logistic regression Fit a logistic model
to the data set such that the cost function is minimal using gradient descent

non-logistic regression

multiclass classification

one against all

one against one

unseen external data model selection and validation
(data not seen during training) For instance, when we augment the features in a data set by polynomial features of a certain degree d we need to set d such that it allows for training a model that performs best on all unseen external data.

train-test split =

model selection and validation
d > 8 overfitting d = 1 underfitting generalization performance? validation set

k-fold cross-validation (CV)
fewer data points for training performance results can depend strongly on a particular random choice of the data set splits k-fold CV: partition data set into k smaller sets (folds) For each fold do

3-fold cross-validation (CV)

regularization noise label feature feature relevance
relatively small train sets don’t try to fit the data perfectly don’t try to use all features notice how R2 on the train set does increase with d

regularization noise label feature feature relevance
relatively small train sets don’t try to fit the data perfectly don’t try to use all features notice how the accuracy on the train set increases with d d = 1 accuracy = 84% d = 7 accuracy = 97%

regularized linear regression
cost function regularized cost function

regularized logistic regression
cost function regularized cost function

high complexity low complexity

regularized logistic regression

support vector machines
logistic regression:

replace cost function by piecewise linear function if y = 1 then the contribution to the cost is

replace cost function by piecewise linear function if y = 1 then the contribution to the cost is if y = 0 then the contribution to the cost is

Fit a linear model such that with and is minimized.

In this case the contribution to the cost needs to be small when the model predicts high values (>0) and large when the model predicts low values (<0). For SVMs we see that the contribution to cost decreases linearly and becomes zero when

In this case the contribution to the cost needs to be large when the model predicts high values (>0) and small when the model predicts low values (<0). For SVMs we see that the contribution to the cost is zero for and then increases linearly.

Consider a train set with two classes that are perfectly linearly separable. The peacewise cost function can be made zero. The SVM objective can be written as

Consider a train set with two classes that are perfectly linearly separable. many possible decision boundaries

Consider a train set with two classes that are perfectly linearly separable. many possible decision boundaries SVM picks to one that maximizes the margin between the classes

red points statisfy blue points satisfy distance bewteen two dashed lines is SVM objective was

red points statisfy blue points satisfy distance bewteen two dashed lines is margin

no model parameters that satisfy

For misclassified red points (y=1) the contribution to the cost increases linearly with the distance from the upper dashed line.

For misclassified blue points (y=0) the contribution to the cost increases linearly with the distance from the lower dashed line.

kernel support vector machines

SVMs can also be formulated as a linear function of the samples (dual form) instead of the features as that can be reformulated as a non-linear function using what is know as a kernel function to become The data points for which are called the support vectors.

splice site prediction
CGTGTTGTCGCAACATCGTGCGTGACGGACTTGCGTAGCCTCCGACGTGTCAACGCGTACCACGTGCGTGGT Degroeve,S. et al. (2005) Predicting splice sites from high-dimensional local context representations. Bioinformatics, 21, 1332–1340 65

ACTTCGGTAGCCTCC 66

ACTTCGGTAGCCTCC 67

ACTTCGGTAGCCTCC 68

Sequence alignment kernel for recognition of promoter regions Leo Gordon, Alexey Ya. Chervonenkis, Alex J. Gammerman, Ilham A. Shahmuradov and Victor V. Solovyev 69

random forests

random forests How to reduce model variance? advantages:
ease of interpretation handles continuous and discrete features invariant to monotone transformation of features variable selection automated robust scalable disadvantages: unstable high variance overfitting How to reduce model variance?

random forests: bagging, out-of-bag error
We can train T (hyperparameter) different trees on random subsets (usually 2/3) of the data (with replacement) and then average according to where is the t-th decision tree. For each of the T trees we can compute the performance of the tree on the data points not used for training (1/3) and average this performance for the T trees. This is called the out-of-bag error (oob). Random Forests bag both the samples and the features for training each decision tree in the forest. The number of feature to sample for each tree is considered an important hyperparameter.

k-means clustering supervised learning unsupervised learning
hidden structure data groups

k-means clustering

k-means++ clustering

k-means++ clustering the first centroid is chosen uniformly at random from the data points that are being clustered each subsequent centroid is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing centroid better spread of the initial centroids

k-means++ clustering

k-means++ clustering: finding k
Cohesion: measures how closely related are objects in a cluster Separation: measure how well-separated a cluster is from other clusters separation cohesion

Cohesion a(x): the mean distance between the data point and all other points in the same cluster Separation b(x): the mean distance between the data point and all other points in the next nearest cluster separation x cohesion x

Silhouette score s(x): Silhouette coefficient SC:

Image segmentation: Each image is represented in the RGB color space.
An image pixel is represented as a 3D color vector Pixels are clustered to find the segments. 82

hierarchical clustering
agglomerative clustering no need to set k in advance start with a singleton cluster clusters are iteratively merged until one single cluster remains cluster tree or dendogram works on distance matrix distance

hierarchical clustering
represent each data point as a singleton cluster merge the two closest clusters repeat step 2. until one single cluster remains

“Visualization of gene expression profiles.
Expression of 320 transcripts from S. cerevisiae, collected over 18 time points throughout the cell cycle 80. Colors indicate cluster membership based on a k-means clustering (k= 4)” Gehlenborg N. et al. (2010) Visualization of omics data for systems biology. Nat Methods 7: S56–68 85

principle components analysis
unsupervised dimensionality reduction feature extraction: principle components orthogonal direction of largest variance centered data

eigenvalues explained variance

Eigenvalue 1 = 38.81 Eigenvalue 2 = 3.48 Explained variance by principle component 1 = 91.78% Explained variance by principle component 2 = 8.2%

preserves 91.78% of the original variance

multidimensional scaling
unsupervised dimensionality reduction data transformation starts from distance matrix The goal of MDS is to find vectors such that

multidimensional scaling

“The data set contains 3,192 individuals who were genotyped at 500,568 SNPs using the Affymetrix 500K SNP chip.” Novembre et al. (2008) “Genes mirror geography within Europe” Nature 456, 92

feature selection The features in a data set can be
relevant: these are required by the model to obtain optimal generalization performance irrelevant: these provide no useful information to improve the generalization error redundant: these provide no more information than other features in the data set (correlated) Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting the subset of relevant features for use in model construction.

feature selection improved model interpretability
shorter training times enhanced generalization by reducing overfitting feature selection ≠ feature extraction

feature selection: subset soring
define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential forward selection:

feature selection: subset soring
define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential backward selection:

feature selection define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential selection filter, wrapper and embedded methods

feature selection: filters
relevance computed directly from the data set typically independently from the other features e.g. Pearson product-moment correlation coefficient or information gain feature ranking rather than an explicit best feature subset cross-validation to find a good ranking threshold ANOVA F-test (Student's t-test for two classes)

feature selection: filters
ANOVA F-test (Student's t-test for two classes) not computationally intensive not tuned to a specific type of predictive model

feature selection: wrappers
use a predictive model to compute validation set / cross-validation or use model parameters to estimate feature relevance recursive feature elimination (RFE):

feature selection: wrappers

feature selection: ANOVA

feature selection: embedded
feature selection as part of the model construction process e.g. relative rank (i.e. depth) of a feature used as a decision node in a decision tree expected fraction of the samples they contribute to Random Forests algorithm: average expected activity rates

70-gene “Amsterdam”signature (MammaPrint™, Agendia)
76-gene “Rotterdam”signature (Veridex) 21-gene assay (Oncotype DX™) 97-gene “genomic grade”(MapQuant Dx™) and others… 105

regression Can BMI explain Y? Can BMI predict Y?

Similar presentations

Presentation on theme: "regression Can BMI explain Y? Can BMI predict Y?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

regression Can BMI explain Y? Can BMI predict Y?

Similar presentations

Presentation on theme: "regression Can BMI explain Y? Can BMI predict Y?"— Presentation transcript:

Similar presentations

About project

Feedback