Presentation is loading. Please wait.

Presentation is loading. Please wait.

regression Can BMI explain Y? Can BMI predict Y?

Similar presentations


Presentation on theme: "regression Can BMI explain Y? Can BMI predict Y?"— Presentation transcript:

1 regression Can BMI explain Y? Can BMI predict Y? How does Y vary with BMI? Regression is a general term for modeling the relationship between a label (a.k.a. dependent variable or target) and one or more features (a.k.a. the attributes, the independent or explanatory variable(s)).

2 that generated the data.
linear regression We need to make assumptions about the model that generated the data. linear relationship linear model

3 linear regression a and b are model parameters
a is the slope to the line (direction) b is the intercept or bias (position)

4 linear regression: prediction
BMI(70kg, 1.8 meter) = 21.6 scaled data! scaled BMI = -1.08 predicted value = f(-1.08) predicted Y = 125.9

5 linear regression: fitting
What values for a and b fit the data best? How can we evaluate them?

6 linear regression: fitting

7 linear regression Dataset linear regression Learning Algorithm

8 linear regression: cost

9 linear regression: cost

10 linear regression Fit a linear model
to the data set such that the cost function is minimal.

11 linear regression

12 gradient descent

13 gradient descent correct: simultaneous update incorrect:

14 gradient descent

15 linear regression model: cost function: Goal:
Learning: 1. start with some 2. change to reduce 3. repeat 2. until convergence

16 non-linear regression
Does y vary linearly with X? Can we fit a non-linear model? Yes Or we could add polynomial transformations of the features.

17 non-linear regression
Does y vary linearly with X? Can we fit a non-linear model? Yes Or we could add polynomial transformations of the features.

18 non-linear regression

19 non-linear regression
dose-response relationship sigmoid function (a.k.a. a logistic function) theta1 is the slope at the steepest part of the curve theta0 is the dosage at which 50% of the subjects are expected to show the desired response

20 model assumptions logistic regression for classification
We need to make assumptions about the model that generated the data. linear relationship linear model

21 model assumptions logistic regression for classification
We need to make assumptions about the model that generated the data. linearly separable logistic model

22 logistic regression: logistic model

23 logistic regression: cost function

24 logistic regression: cost function

25 logistic regression: cost function

26 logistic regression Fit a logistic model
to the data set such that the cost function is minimal using gradient descent

27 non-logistic regression

28 multiclass classification

29 one against all

30 one against one

31 unseen external data model selection and validation
(data not seen during training) For instance, when we augment the features in a data set by polynomial features of a certain degree d we need to set d such that it allows for training a model that performs best on all unseen external data.

32 train-test split =

33 model selection and validation
d > 8 overfitting d = 1 underfitting generalization performance? validation set

34 k-fold cross-validation (CV)
fewer data points for training performance results can depend strongly on a particular random choice of the data set splits k-fold CV: partition data set into k smaller sets (folds) For each fold do

35 3-fold cross-validation (CV)

36 3-fold cross-validation (CV)

37 regularization noise label feature feature relevance
relatively small train sets don’t try to fit the data perfectly don’t try to use all features notice how R2 on the train set does increase with d

38 regularization noise label feature feature relevance
relatively small train sets don’t try to fit the data perfectly don’t try to use all features notice how the accuracy on the train set increases with d d = 1 accuracy = 84% d = 7 accuracy = 97%

39 regularized linear regression
cost function regularized cost function

40 regularized logistic regression
cost function regularized cost function

41 regularized linear regression

42 regularized linear regression

43 regularized linear regression
high complexity low complexity

44 regularized logistic regression

45 support vector machines
logistic regression:

46 support vector machines
logistic regression:

47 support vector machines
replace cost function by piecewise linear function if y = 1 then the contribution to the cost is

48 support vector machines
replace cost function by piecewise linear function if y = 1 then the contribution to the cost is if y = 0 then the contribution to the cost is

49 support vector machines
Fit a linear model such that with and is minimized.

50 support vector machines
In this case the contribution to the cost needs to be small when the model predicts high values (>0) and large when the model predicts low values (<0). For SVMs we see that the contribution to cost decreases linearly and becomes zero when

51 support vector machines
In this case the contribution to the cost needs to be large when the model predicts high values (>0) and small when the model predicts low values (<0). For SVMs we see that the contribution to the cost is zero for and then increases linearly.

52 support vector machines
Consider a train set with two classes that are perfectly linearly separable. The peacewise cost function can be made zero. The SVM objective can be written as

53 support vector machines
Consider a train set with two classes that are perfectly linearly separable. many possible decision boundaries

54 support vector machines
Consider a train set with two classes that are perfectly linearly separable. many possible decision boundaries SVM picks to one that maximizes the margin between the classes

55 support vector machines
red points statisfy blue points satisfy distance bewteen two dashed lines is SVM objective was

56 support vector machines
red points statisfy blue points satisfy distance bewteen two dashed lines is margin

57 support vector machines
no model parameters that satisfy

58 support vector machines
For misclassified red points (y=1) the contribution to the cost increases linearly with the distance from the upper dashed line.

59 support vector machines
For misclassified blue points (y=0) the contribution to the cost increases linearly with the distance from the lower dashed line.

60 support vector machines

61 kernel support vector machines

62 kernel support vector machines
SVMs can also be formulated as a linear function of the samples (dual form) instead of the features as that can be reformulated as a non-linear function using what is know as a kernel function to become The data points for which are called the support vectors.

63 kernel support vector machines

64 kernel support vector machines

65 splice site prediction
CGTGTTGTCGCAACATCGTGCGTGACGGACTTGCGTAGCCTCCGACGTGTCAACGCGTACCACGTGCGTGGT Degroeve,S. et al. (2005) Predicting splice sites from high-dimensional local context representations. Bioinformatics, 21, 1332–1340 65

66 splice site prediction
ACTTCGGTAGCCTCC 66

67 splice site prediction
ACTTCGGTAGCCTCC 67

68 splice site prediction
ACTTCGGTAGCCTCC 68

69 Sequence alignment kernel for recognition of promoter regions Leo Gordon, Alexey Ya. Chervonenkis, Alex J. Gammerman, Ilham A. Shahmuradov and Victor V. Solovyev 69

70 random forests

71 random forests How to reduce model variance? advantages:
ease of interpretation handles continuous and discrete features invariant to monotone transformation of features variable selection automated robust scalable disadvantages: unstable high variance overfitting How to reduce model variance?

72 random forests: bagging, out-of-bag error
We can train T (hyperparameter) different trees on random subsets (usually 2/3) of the data (with replacement) and then average according to where is the t-th decision tree. For each of the T trees we can compute the performance of the tree on the data points not used for training (1/3) and average this performance for the T trees. This is called the out-of-bag error (oob). Random Forests bag both the samples and the features for training each decision tree in the forest. The number of feature to sample for each tree is considered an important hyperparameter.

73 k-means clustering supervised learning unsupervised learning
hidden structure data groups

74 k-means clustering

75 k-means clustering

76 k-means++ clustering

77 k-means++ clustering the first centroid is chosen uniformly at random from the data points that are being clustered each subsequent centroid is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing centroid better spread of the initial centroids

78 k-means++ clustering

79 k-means++ clustering: finding k
Cohesion: measures how closely related are objects in a cluster Separation: measure how well-separated a cluster is from other clusters separation cohesion

80 k-means++ clustering: finding k
Cohesion a(x): the mean distance between the data point and all other points in the same cluster Separation b(x): the mean distance between the data point and all other points in the next nearest cluster separation x cohesion x

81 k-means++ clustering: finding k
Silhouette score s(x): Silhouette coefficient SC:

82 Image segmentation: Each image is represented in the RGB color space.
An image pixel is represented as a 3D color vector Pixels are clustered to find the segments. 82

83 hierarchical clustering
agglomerative clustering no need to set k in advance start with a singleton cluster clusters are iteratively merged until one single cluster remains cluster tree or dendogram works on distance matrix distance

84 hierarchical clustering
represent each data point as a singleton cluster merge the two closest clusters repeat step 2. until one single cluster remains

85 “Visualization of gene expression profiles.
Expression of 320 transcripts from S. cerevisiae, collected over 18 time points throughout the cell cycle 80. Colors indicate cluster membership based on a k-means clustering (k= 4)” Gehlenborg N. et al. (2010) Visualization of omics data for systems biology. Nat Methods 7: S56–68 85

86 principle components analysis
unsupervised dimensionality reduction feature extraction: principle components orthogonal direction of largest variance centered data

87 principle components analysis
eigenvalues explained variance

88 principle components analysis
Eigenvalue 1 = 38.81 Eigenvalue 2 = 3.48 Explained variance by principle component 1 = 91.78% Explained variance by principle component 2 = 8.2%

89 principle components analysis
preserves 91.78% of the original variance

90 multidimensional scaling
unsupervised dimensionality reduction data transformation starts from distance matrix The goal of MDS is to find vectors such that

91 multidimensional scaling

92 “The data set contains 3,192 individuals who were genotyped at 500,568 SNPs using the Affymetrix 500K SNP chip.” Novembre et al. (2008) “Genes mirror geography within Europe” Nature 456, 92

93 feature selection The features in a data set can be
relevant: these are required by the model to obtain optimal generalization performance irrelevant: these provide no useful information to improve the generalization error redundant: these provide no more information than other features in the data set (correlated) Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting the subset of relevant features for use in model construction.

94 feature selection improved model interpretability
shorter training times enhanced generalization by reducing overfitting feature selection ≠ feature extraction

95 feature selection: subset soring
define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential forward selection:

96 feature selection: subset soring
define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential forward selection:

97 feature selection: subset soring
define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential backward selection:

98 feature selection define a score that estimates the relevance of a feature subset for the task and then find that optimizes many features requires greedy search sequential selection filter, wrapper and embedded methods

99 feature selection: filters
relevance computed directly from the data set typically independently from the other features e.g. Pearson product-moment correlation coefficient or information gain feature ranking rather than an explicit best feature subset cross-validation to find a good ranking threshold ANOVA F-test (Student's t-test for two classes)

100 feature selection: filters
ANOVA F-test (Student's t-test for two classes) not computationally intensive not tuned to a specific type of predictive model

101 feature selection: wrappers
use a predictive model to compute validation set / cross-validation or use model parameters to estimate feature relevance recursive feature elimination (RFE):

102 feature selection: wrappers

103 feature selection: ANOVA

104 feature selection: embedded
feature selection as part of the model construction process e.g. relative rank (i.e. depth) of a feature used as a decision node in a decision tree expected fraction of the samples they contribute to Random Forests algorithm: average expected activity rates

105 70-gene “Amsterdam”signature (MammaPrint™, Agendia)
76-gene “Rotterdam”signature (Veridex) 21-gene assay (Oncotype DX™) 97-gene “genomic grade”(MapQuant Dx™) and others… 105


Download ppt "regression Can BMI explain Y? Can BMI predict Y?"

Similar presentations


Ads by Google