Machine Learning A.Y SEM-II Mr. Dhomse G.P.

Machine Learning A.Y. 2018-19 SEM-II Mr. Dhomse G.P.
(BE Computer 2015 PAT) A.Y SEM-II Prepared by Mr. Dhomse G.P.

Unit-3 Regression Syllabus
Linear regression- Linear models, A bi-dimensional example, Linear Regression and higher dimensionality, Ridge, Lasso and Elastic Net, Robust regression with random sample consensus, Polynomial regression, Isotonic regression, Logistic regression-Linear classification, Logistic regression, Implementation and Optimizations, Stochastic gradient descendent algorithms, Finding the optimal hyper-parameters through grid search, Classification metric, ROC Curve.

Linear Regression Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x). When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression

Simple Linear Regression
Simple regression problem (a single x and a single y), the form of the model would be: Constant Coefficient y = b0 + b1 * x1 Dependent variable (DV) Independent variable (IV)

SALARY (₹) EQUATION PLOTTING y = b0 + b1 * x1 SALARY = b0 + b1 * EXPERIENCE +10 K +1 Yr HOW much Salary will increase? +1 Yr EXPERIENCE

ANALYZING DATASET DV IV

LET's CODE! Prep your Data Preprocessing Template Import Dataset No need for Missing Data Splitting into Training & Testing dataset Keep Feature Scaling but least preffered here Co-relate Salaries with Experience Later carry out prediction Verify the Values of prediction Prediction on TEST SET

Example-2 y = B0 + B1 * x1 or weight =B0 +B1 * height
Let’s make this concrete with an example. Imagine we are predicting weight (y) from height (x). Our linear regression model representation for this problem would be: y = B0 + B1 * x1 or weight =B0 +B1 * height

Where B0 is the bias coefficient and B1 is the coefficient for the height column. We use a learning technique to find a good set of coefficient values. Once found, we can plug in different height values to predict the weight. For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in kilograms) for a person with the height of 182 centimeters. weight = * 182 weight = 91.1 You can see that the above equation could be plotted as a line in two-dimensions. The B0 is our starting point regardless of what height we have. We can run through a bunch of heights from 100 to 250 centimeters and plug them to the equation and get weight values, creating our line.

Multi Linear Regression
Coefficients Constant y = b0 + b1 * x1 + b2 * x bn * xn Independent variables (IVs) Dependent variable (DV)

Multiple linear regression analysis makes several key assumptions:
Multivariate Normality–Multiple regression assumes that the residuals are normally distributed. No Multicollinearity—Multiple regression assumes that the independent variables are not highly correlated with each other. This assumption is tested using Variance Inflation Factor (VIF) values. Homoscedasticity–This assumption states that the variance of error terms are similar across the values of the independent variables. A plot of standardized residuals versus predicted values can show whether points are equally distributed across all values of the independent variables. Intellectus Statistics automatically includes the assumption tests and plots when conducting a regression.

Multiple linear regression requires at least two independent variables, which can be nominal, ordinal, or interval/ratio level variables. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis. First, multiple linear regression requires the relationship between the independent and dependent variables to be linear. The linearity assumption can best be tested with scatterplots. The following two examples depict a curvilinear relationship (left) and a linear relationship (right).

curvilinear relationship (left) and a linear relationship (right).

Second, the multiple linear regression analysis requires that the errors between observed and predicted values (i.e., the residuals of the regression) should be normally distributed. This assumption may be checked by looking at a histogram or a Q-Q-Plot. Normality can also be checked with a goodness of fit test (e.g., the Kolmogorov-Smirnov test), though this test must be conducted on the residuals themselves. Third, multiple linear regression assumes that there is no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

Multicollinearity may be checked multiple ways:
1) Correlation matrix – When computing a matrix of Pearson’s bivariate correlations among all independent variables, the magnitude of the correlation coefficients should be less than .80. 2) Variance Inflation Factor (VIF) – The VIFs of the linear regression indicate the degree that the variances in the regression estimates are increased due to multicollinearity. VIF values higher than 10 indicate that multicollinearity is a problem. If multicollinearity is found in the data, one possible solution is to center the data. To center the data, subtract the mean score from each observation for each independent variable. However, the simplest solution is to identify the variables causing multicollinearity issues (i.e., through correlations or VIF values) and removing those variables from the regression.

A scatterplot of residuals versus predicted values is good way to check for homoscedasticity. There should be no clear pattern in the distribution; if there is a cone-shaped pattern (as shown below), the data is heteroscedastic.

DUMMY VARIABLES Categorical Variable

DUMMY VARIABLES D NEW YORK CALIFORNIA 1 y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * D1

DUMMY VARIABLE TRAP D NEW YORK CALIFORNIA 1 t0y D2 = 1 - D1 Multi Linear Colineari y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * D1 + b5 * D2 Always OMIT one Dummy Variable

Building A Model STEP BY STEP

Building A Model METHODS OF BUILDING A MODEL All - in
Backward Elimination Forward Elimination Forward Selection Bidirectional Elimination Score Comparison Stepwise regression

Building A Model METHODS OF BUILDING A MODEL ALL - IN
Throw in every variable Prior Knowledge Known Values Preparing Backward elimination

Building A Model BACKWARD ELIMINATION MODEL (Best Model in All )
Step 1 Select significance level to stay in model (0.05) Step 2 Fit in full model with all possible predictors Step 3 Consider the predictor with highest P value If P > SL, go to Step 4, otherwise go to FIN Step 4 MODEL BUILT Remove the Predictor Step 5 Fit the model w/o this variable*

Bi-Dimensional Example
Let's consider a small dataset built by adding some uniform noise to the points belonging to a segment bounded between -6 and 6

The original equation is: y = x + 2 + n, where n is a noise term.
Figure shows , there's a plot with a candidate regression function: As we're working on a plane, the regressor we're looking for is a function of only two parameters: In order to fit our model, we must find the best parameters and to do that we choose an least squares approach.

This task can be easily accomplished by Least Square Method.
It is the most common method used for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. Because the deviations are first squared, when added, there is no cancelling out between positive and negative values.

The loss function to minimize is:
So (for simplicity, it accepts a vector containing both variables): import numpy as np def loss(v): e = 0.0 for i in range(nb_samples): e += np.square(v[0] + v[1]*X[i] - Y[i]) return 0.5 * e

in order to find the global minimum, we must impose:
the gradient can be defined as: def gradient(v): g = np.zeros(shape=2) for i in range(nb_samples): g[0] += (v[0] + v[1]*X[i] - Y[i]) g[1] += ((v[0] + v[1]*X[i] - Y[i]) * X[i]) return g

The optimization can now be solved using SciPy:
scipy.optimize.minimize Parameters: fun : callableThe objective function to be minimized. fun(x, *args) -> float where x is an 1-D array with shape (n,) and args is a tuple of the fixed parameters needed to completely specify the function. x0 : ndarray, shape (n,)Initial guess. Array of real elements of size (n,), where ‘n’ is the number of independent variables. args : tuple, optionalExtra arguments passed to the objective function and its derivatives (fun, jac and hess functions). method : str or callable, optionalType of solver. Should be one of ‘Nelder-Mead’ (see here) ‘Powell’ (see here) ‘CG’ (see here) ‘BFGS’ (see here) ‘Newton-CG’ (see here) ‘L-BFGS-B’ (see here)

‘TNC’ (see here) ‘COBYLA’ (see here) ‘SLSQP’ (see here) ‘trust-constr’(see here) ‘dogleg’ (see here) ‘trust-ncg’ (see here) ‘trust-exact’ (see here) ‘trust-krylov’ (see here) custom - a callable object (added in version ), see below for description. If not given, chosen to be one of BFGS, L-BFGS-B, SLSQP, depending if the problem has constraints or bounds. jac : {callable, ‘2-point’, ‘3-point’, ‘cs’, bool}, optionalMethod for computing the gradient vector. hess : {callable, ‘2-point’, ‘3-point’, ‘cs’, HessianUpdateStrategy}, optionalMethod for computing the Hessian matrix

>>> from scipy.optimize import minimize
>>> minimize(fun=loss, x0=[0.0, 0.0], jac=gradient, method='L-BFGS-B') fun: hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ e-06, e-05]) message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 8 nit: 7 status: 0 success: True x: array([ , ]) As expected, the regression denoised our dataset, rebuilding the original equation: y = x + 2.

Scipy Optimization Example using Python
Optimization deals with selecting the best option among a number of possible choices that are feasible or don't violate constraints. Mathematical optimization problems may include equality constraints (e.g. =), inequality constraints (e.g. <, <=, >, >=), objective functions, algebraic equations, differential equations, continuous variables, discrete or integer variables, etc.

This problem has a nonlinear objective that the optimizer attempts to minimize. The variable values at the optimal solution are subject to (s.t.) both equality (=40) and inequality (>25) constraints. The product of the four variables must be greater than 25 while the sum of squares of the variables must also equal 40. In addition, all variables must be between 1 and 5 and the initial guess is x1 = 1, x2 = 5, x3 = 5, and x4 = 1.

Linear regression with scikit-learn and higher dimensionality
scikit-learn offers the class LinearRegression, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset: from sklearn.datasets import load_boston >>> boston = load_boston() >>> boston.data.shape (506L, 13L) >>> boston.target.shape (506L,)

It has 506 samples with 13 input features and one output
It has 506 samples with 13 input features and one output. In the following figure, there' a collection of the plots of the first 12 features:

How to Find Accuracy of Model
Model to normalize the data before processing it. Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets: from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1) >>> lr = LinearRegression(normalize=True) >>> lr.fit(X_train, Y_train) LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

To check the accuracy of a regression, scikit-learn provides the internal method score(X,
y) which evaluates the model on test data: >>> lr.score(X_test, Y_test) So the overall accuracy is about 77%, which is an acceptable result considering the non- linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (like in our case).

we can use the function cross_val_score(), which works with all the classifiers.
The scoring parameter is very important because it determines which metric will be adopted for tests. As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative).

from sklearn.model_selection import cross_val_score
>>> scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error') array([ , , , , , , ]) >>> scores.mean() >>> scores.std()

Another very important metric used in regressions is called the coefficient of determination or R2. It measures the amount of variance on the prediction which is explained by the dataset >>> cross_val_score(lr, X, Y, cv=10, scoring='r2') 0.75 CV- Cross Validation Algo-10 R2 ~1

Big Mart Sales-In the data set, we have product wise Sales for Multiple outlets of a chain.

Ridge & Lasso Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things: Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting) Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features Though Ridge and Lasso might appear to work towards a common goal, the inherent properties and practical use cases differ substantially. If you’ve heard of them before, you must know that they work by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations. These are called ‘regularization’ techniques.

Why Penalize the Magnitude of Coefficients?
Lets try to understand the impact of model complexity on the magnitude of coefficients. As an example, I have simulated a sine curve (between 60° and 300°)

This resembles a sine curve but not exactly because of the noise.
We’ll use this as an example to test different scenarios in this article. Lets try to estimate the sine function using polynomial regression with powers of x form 1 to 15. Lets add a column for each power upto 15 in our dataframe.

Now that we have all the 15 powers, lets make 15 different linear regression models with each model containing variables with powers of x from 1 to the particular model number. For example, the feature set of model 8 will be – {x, x_2, x_3, … ,x_8}. RSS refers to ‘Residual Sum of Squares’ which is nothing but the sum of square of errors between the predicted and actual values in the training data set. We would expect the models with increasing complexity to better fit the data and result in lower RSS values. This can be verified by looking at the plots generated for 6 models:

As the model complexity increases, the models tends to fit even smaller deviations in the training data set. Though this leads to overfitting, lets keep this issue aside for some time and come to our main objective, i.e. the impact on the magnitude of coefficients. See the Out put in coef_matrix_simple It is clearly evident that the size of coefficients increase exponentially with increase in model complexity

Minimization objective = LS Obj + α * (sum of square of coefficients)
Ridge regression imposes an additional shrinkage penalty to the ordinary least squares loss function to limit its squared L2 norm: Ridge Regression: Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients Minimization objective = LS Obj + α * (sum of square of coefficients) Note that here ‘LS Obj’ refers to ‘least squares objective’, i.e. the linear regression objective without regularization.

α can take various values:
α = 0: The objective becomes same as simple linear regression. We’ll get the same coefficients as simple linear regression. α = ∞: The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite. 0 < α < ∞: The magnitude of α will decide the weightage given to different parts of objective. The coefficients will be somewhere between 0 and ones for simple linear regression.

as the value of alpha increases, the model complexity reduces.

The RSS increases with increase in alpha, this model complexity reduces An alpha as small as 1e-15 gives us significant reduction in magnitude of coefficients. How? Compare the coefficients in the first row of this table to the last row of simple linear regression table. High alpha values can lead to significant underfitting. Note the rapid increase in RSS for values of alpha greater than 1 Though the coefficients are very very small, they are NOT zero.

Lasso Regression:Performs L1 regularization, i. e
Lasso Regression:Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients Minimization objective = LS Obj + α * (sum of absolute value of coefficients) A Lasso regressor imposes a penalty on the L1 norm of w to determine a potentially higher number of null coefficients:

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Lets iterate it here briefly: α = 0: Same coefficients as simple linear regression α = ∞: All coefficients zero (same logic as before) 0 < α < ∞: coefficients between 0 and that of simple linear regression

This again tells us that the model complexity decreases with increase in the values of alpha.

Apart from the expected inference of higher RSS for higher alphas, we can see the following: For the same values of alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression (compare row 1 of the 2 tables). For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression Many of the coefficients are zero even for very small values of alpha

ElasticNet The last alternative is ElasticNet, which combines both Lasso and Ridge into a single model with two penalty factors: one proportional to L1 norm and the other to L2 norm. In this way, the resulting model will be sparse like a pure Lasso, but with the same regularization ability as provided by Ridge. The resulting loss function is:

Summery L1 Regularization aka Lasso Regularization– This add regularization terms in the model which are function of absolute value of the coefficients of parameters. The coefficient of the paratmeters can be driven to zero as well during the regularization process. Hence this technique can be used for feature selection and generating more parsimonious model L2 Regularization aka Ridge Regularization — This add regularization terms in the model which are function of square of coefficients of parameters. Coefficient of parameters can approach to zero but never become zero and hence Combination of the above two such as Elastic Nets– This add regularization terms in the model which are combination of both L1 and L2 regularization.

Robust regression with random sample consensus
A common problem with linear regressions is caused by the presence of outliers. An ordinary least square approach will take them into account and the result (in terms of coefficients) will be therefore biased. In the following figure, there's an example of such a behavior: The less sloped line represents an acceptable regression which discards the outliers, while the other one is influenced by them. An interesting approach to avoid this problem is offered by random sample consensus (RANSAC), which works with every regressor by subsequent iterations, after splitting the dataset into inliers and outliers.

What is Outlier?? Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. It detect via Boxplot Histogram Scatter Plot

Types of Outlier Data Entry Errors: Measurement Error:
Experimental Error: Intentional Outlier: Data Processing Error: Sampling Error How to Remove Outlier- Deleting observations Transforming and binning values

Random Sample Consensus (RANSAC),
The model is trained only with valid samples (evaluated internally or through the callable is_data_valid()) and all samples are re-evaluated to verify if they're still inliers or they have become outliers. The process ends after a fixed number of iterations or when the desired score is achieved.

from sklearn.linear_model import LinearRegression
there's an example of simple linear regression applied to the dataset shown in the previous figure on ppt 62 from sklearn.linear_model import LinearRegression >>> lr = LinearRegression(normalize=True) >>> lr.fit(X.reshape((-1, 1)), Y.reshape((-1, 1))) >>> lr.intercept_ array([ ]) >>> lr.coef_ array([[ ]]) As imagined, the slope is high due to the presence of outliers. The resulting regressor is y = x (slightly less sloped than what was shown in the figure).

Now we're going to use RANSAC with the same linear regressor:
RANSAC is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the complete data set. from sklearn.linear_model import RANSACRegressor >>> rs = RANSACRegressor(lr) >>> rs.fit(X.reshape((-1, 1)), Y.reshape((-1, 1))) >>> rs.estimator_.intercept_ array([ ]) >>> es.estimator_.coef_ array([[ ]]) In this case, the regressor is about y = 2 + x (which is the original clean dataset without outliers).

Polynomial Regression
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation: y=a+b*x^2

While there might be a temptation to fit a higher degree polynomial to get lower error, this can result in over-fitting. Always plot the relationships to see the fit and focus on making sure that the curve fits the nature of the problem. Here is an example of how plotting can help:

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]. Number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

>>> X = np. arange(6)
>>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = PolynomialFeatures(2) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) >>> poly = PolynomialFeatures(interaction_only=True)>>> poly.fit_transform(X) array([[ 1., 0., 1., 0.], [ 1., 2., 3., 6.], [ 1., 4., 5., 20.]])

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression >>> lr = LinearRegression(normalize=True) >>> lr.fit(X.reshape((-1, 1)), Y.reshape((-1, 1))) >>> lr.score(X.reshape((-1, 1)), Y.reshape((-1, 1))) Performances are poor, as expected. from sklearn.preprocessing import PolynomialFeatures >>> pf = PolynomialFeatures(degree=2) >>> Xp = pf.fit_transform(X.reshape(-1, 1)) >>> Xp.shape (100L, 3L)

As expected, the old x1 coordinate has been replaced by a triplet, which also contains the quadratic and mixed terms. At this point, a linear regression model can be trained: >>> lr.fit(Xp, Y.reshape((-1, 1))) >>> lr.score(Xp, Y.reshape((-1, 1))) The score is quite higher and the only price we have paid is an increase in terms of features.

Isotonic regression The isotonic regression finds a non-decreasing approximation of a function while minimizing the mean squared error on the training data. The benefit of such a model is that it does not assume any form for the target function such as linearity. It produces a piecewise interpolating function minimizing the functional:

An example (with a toy dataset) is provided next:
>>> X = np.arange(-5, 5, 0.1) >>> Y = X + np.random.uniform(-0.5, 1, size=X.shape) Following is a plot of the dataset. As everyone can see, it can be easily modeled by a linear regressor, but without a high non-linear function, it is very difficult to capture the slight (and local) modifications in the slope:

Another example

from sklearn.isotonic import IsotonicRegression
The class IsotonicRegression needs to know ymin and ymax (which correspond to the variables y0 and yn in the loss function). In this case, we impose -6 and 10: from sklearn.isotonic import IsotonicRegression >>> ir = IsotonicRegression(-6, 10) >>> Yi = ir.fit_transform(X, Y) The result is provided through three instance variables: >>> ir.X_min_ -5.0 >>> ir.X_max_ >>> ir.f_ <scipy.interpolate.interpolate.interp1d at 0x126edef8> The last one, (ir.f_), is an interpolating function which can be evaluated in the domain [xmin, xmax]. For example: >>> ir.f_(2) array( )

A plot of this function (the green line), together with the original data set, is shown in the following figure:

Logistic Regression

Linear classification

Let's consider a generic linear classification problem with two classes. In the following figure,
Our goal is to find an optimal hyperplane, which separates the two classes. In multi-class problems, the strategy one-vs-all is normally adopted, so the discussion can be focused only on binary classifications. Suppose we have the following dataset: This dataset is associated with the following target set: We can now define a weight vector made of m continuous components:

We can also define the quantity z
If x is a variable, z is the value determined by the hyperplane equation. Therefore, if the set of coefficients w that has been determined is correct, it happens that: Now we must find a way to optimize w, in order to reduce the classification error. If such a combination exists (with a certain error threshold), we say that our problem is linearly separable. On the other hand, when it's impossible to find a linear classifier, the problem is called non-linearly separable. A very simple but famous example is given by the logical operator XOR:

Logistic regression Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. The fundamental equation of generalized linear model is: g(E(y)) = α + βx1 + γx2 Here, g() is the link function, E(y) is the expectation of target variable and α + βx1 + γx2 is the linear predictor ( α,β,γ to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor.

g(y) = βo + β(Age) ---- (a) considered ‘Age’ as independent variable.
We are provided a sample of 1000 customers. We need to predict the probability whether a customer will buy (y) a particular magazine or not. As you can see, we’ve a categorical outcome variable, we’ll use logistic regression. g(y) = βo + β(Age) ---- (a) considered ‘Age’ as independent variable. In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). As described above, g() is the link function. This function is established using two things: Probability of Success(p) and Probability of Failure(1-p).

p = exp(βo + β(Age)) = e^(βo + β(Age)) ------- (b)
p should meet following criteria: It must always be positive (since p >= 0) It must always be less than equals to 1 (since p <= 1) Now, we’ll simply satisfy these 2 conditions and get to the core of logistic regression. To establish link function, we’ll denote g() with ‘p’ initially and eventually end up deriving this function. Since probability must always be positive, we’ll put the linear equation in exponential form. For any value of slope and dependent variable, exponent of this equation will never be negative. p = exp(βo + β(Age)) = e^(βo + β(Age)) (b)

To make the probability less than 1, we must divide p by a number greater than p. This can simply be done by: p = exp(βo + β(Age)) / exp(βo + β(Age)) + 1 = e^(βo + β(Age)) / e^(βo + β(Age)) + 1 ----- (c) sing (a), (b) and (c), we can redefine the probability as: p = e^y/ 1 + e^y --- (d) where p is the probability of success. This (d) is the Logit Function If p is the probability of success, 1-p will be the probability of failure which can be written as: q = 1 - p = 1 - (e^y/ 1 + e^y) --- (e)

where q is the probability of failure
On dividing, (d) / (e), we get, After taking log on both side, we get, log(p/1-p) is the link function. Logarithmic transformation on the outcome variable allows us to model a non-linear association in a linear way. After substituting value of y, we’ll get:

This is the equation used in Logistic Regression
This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%. A typical logistic model plot is shown below. You can see probability never goes below 0 and above 1.

Implementation and optimizations
>>> from sklearn.datasets import load_iris >>> from sklearn.linear_model import LogisticRegression >>> X, y = load_iris(return_X_y=True) >>> clf = LogisticRegression(random_state=0, solver='lbfgs', ...multi_class='multinomial').fit(X, y) >>> clf.predict(X[:2, :]) array([0, 0]) >>> clf.predict_proba(X[:2, :]) array([[9.8...e-01, e-02, e-08], [9.7...e-01, e-02, ...e-08]]) >>> clf.score(X, y)

Stochastic Gradient Descent Algorithms
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.

The advantages of Stochastic Gradient Descent are:
Efficiency. Ease of implementation (lots of opportunities for code tuning). The disadvantages of Stochastic Gradient Descent include: SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations. SGD is sensitive to feature scaling.

class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification.

As other classifiers, SGD has to be fitted with two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array Y of size [n_samples] holding the target values (class labels) for the training samples: >>> from sklearn.linear_model import SGDClassifier >>> X = [[0., 0.], [1., 1.]] >>> y = [0, 1] >>> clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5) >>> clf.fit(X, y) SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=None, validation_fraction=0.1, verbose=0, warm_start=False) After being fitted, the model can then be used to predict new values:

>>> clf.predict([[2., 2.]])
array([1]) SGD fits a linear model to the training data. The member coef_ holds the model parameters: >>> clf.coef_ array([[9.9..., ]]) Member intercept_ holds the intercept (aka offset or bias): >>> clf.intercept_ array([ ]) Whether or not the model should use an intercept, i.e. a biased hyperplane, is controlled by the parameter fit_intercept. To get the signed distance to the hyperplane use SGDClassifier.decision_function: >>> clf.decision_function([[2., 2.]]) array([ ])

The concrete loss function can be set via the loss parameter
The concrete loss function can be set via the loss parameter. SGDClassifier supports the following loss functions: loss="hinge": (soft-margin) linear Support Vector Machine, loss="modified_huber": smoothed hinge loss, loss="log": logistic regression, and all regression losses below. The first two loss functions are lazy, they only update the model parameters if an example violates the margin constraint, which makes training very efficient and may result in sparser models, even when L2 penalty is used.

penalty="l2": L2 norm penalty on coef_.
penalty="elasticnet": Convex combination of L2 and L1; (1 - l1_ratio) * L2 + l1_ratio * L1. In the case of multi-class classification coef_ is a two-dimensionalarray of shape=[n_classes, n_features] and intercept_ is a one-dimensional array of shape=[n_classes]. The i-th row of coef_ holds the weight vector of the OVA classifier for the i-th class; classes are indexed in ascending order (see attribute classes_). Note that, in principle, since they allow to create a probability model, loss="log" and loss="modified_huber" are more suitable for one-vs-all classification.

Finding the optimal hyperparameters through grid search
GridSearchCV, which automates the training process of different models and provides the user with optimal values using cross-validation. When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities.

In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning. What should I set my learning rate to for gradient descent? What degree of polynomial features should I use for my linear model?

we show how to use it to find the best penalty and strength factors for a linear regression with the Iris toy dataset: import multiprocessing from sklearn.datasets import load_iris from sklearn.model_selection import GridSearchCV >>> iris = load_iris() >>> param_grid = [ { 'penalty': [ ‘L1', ‘L2' ], 'C': [ 0.5, 1.0, 1.5, 1.8, 2.0, 2.5] } ]

Explanation Comparison of the sparsity (percentage of zero coefficients) of solutions when L1 and L2 penalty are used for different values of C. We can see that large values of C give more freedom to the model. Conversely, smaller values of C constrain the model more. In the L1 penalty case, this leads to sparser solutions. We classify 8x8 images of digits into two classes: 0-4 against 5-9. The visualization shows coefficients of the models for varying C.

C=1.00 C=0.10 C=0.01 Sparsity with L1 penalty: 84.38%
score with L1 penalty: Sparsity with L2 penalty: 4.69% score with L2 penalty: C=0.10 Sparsity with L1 penalty: 28.12% score with L1 penalty: score with L2 penalty: C=0.01 Sparsity with L1 penalty: 84.38% score with L1 penalty: score with L2 penalty:

Let Code Continue PPT 102 >>> gs = GridSearchCV(estimator=LogisticRegression(), param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=multiprocessing.cpu_count()) >>> gs.fit(iris.data, iris.target) >>> gs.best_estimator_LogisticRegression(C=1.5, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) >>> cross_val_score(gs.best_estimator_, iris.data, iris.target, scoring='accuracy', cv=10).mean()

Basic-Grid-search and cross-validated estimators
By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold. The default will change to a 5-fold cross-validation in version 0.22. Eg >>> clf.score(X_digits[1000:], y_digits[1000:]) Nested cross-validation >>> cross_val_score(clf, X_digits, y_digits) array([ , , ])

find the best parameters of an SGDClassifier trained with perceptron loss. The dataset is plotted in the following figure:

SGDClassifier This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time. This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

from sklearn.model_selection import GridSearchCV
>>> param_grid = [ { 'penalty': [ 'l1', 'l2', 'elasticnet' ], 'alpha': [ 1e-5, 1e-4, 5e-4, 1e-3, 2.3e-3, 5e-3, 1e-2], 'l1_ratio': [0.01, 0.05, 0.1, 0.15, 0.25, 0.35, 0.5, 0.75, 0.8] } ] >>> sgd = SGDClassifier(loss='perceptron', learning_rate='optimal') >>> gs = GridSearchCV(estimator=sgd, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=multiprocessing.cpu_count()) >>> gs.fit(X, Y)

>>> gs.best_score_ 0.89400000000000002
>>> gs.best_estimator_ SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.1, learning_rate='optimal', loss='perceptron', n_iter=5, n_jobs=1, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, verbose=0, warm_start=False)

Classification metrics
A classification task can be evaluated in many different ways to achieve specific objectives. Of course, the most important metric is the accuracy, often expressed as: In scikit-learn, it can be assessed using the built-in accuracy_score() function: from sklearn.metrics import accuracy_score >>> accuracy_score(Y_test, lr.predict(X_test))

from sklearn.metrics import zero_one_loss
>>> zero_one_loss(Y_test, lr.predict(X_test)) >>> zero_one_loss(Y_test, lr.predict(X_test), normalize=False) 7L A similar but opposite metric is the Jaccard similarity coefficient, defined as:

This index measures the similarity and is bounded between 0 (worst performances) and 1 (best performances). In the former case, the intersection is null, while in the latter, the intersection and union are equal because there are no misclassifications. from sklearn.metrics import jaccard_similarity_score >>> jaccard_similarity_score(Y_test, lr.predict(X_test)) These measures provide a good insight into our classification algorithms.

True positive: A positive sample correctly classified
it's necessary to be able to differentiate between different kinds of misclassifications (we're considering the binary case with the conventional notation: 0- negative, 1-positive), because the relative weight is quite different. For this reason, we introduce the following definitions: True positive: A positive sample correctly classified False positive: A negative sample classified as positive True negative: A negative sample correctly classified False negative: A positive sample classified as negative

false positive and false negative can be considered as similar errors, but think about a medical prediction: while a false positive can be easily discovered with further tests, a false negative is often neglected, with repercussions following the consequences of this action. For this reason, it's useful to introduce the concept of a confusion matrix:

it's possible to build a confusion matrix using a built-in function
it's possible to build a confusion matrix using a built-in function. Let's consider a generic logistic regression on a dataset X with labels Y: >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> lr = LogisticRegression() >>> lr.fit(X_train, Y_train) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) Now we can compute our confusion matrix and immediately see how the classifier is working:

we have five false negatives and two false positives.
from sklearn.metrics import confusion_matrix >>> cm = confusion_matrix(y_true=Y_test, y_pred=lr.predict(X_test)) cm[::-1, ::-1] [[50 5] [ ]] we have five false negatives and two false positives. Another useful direct measure is

from sklearn.metrics import precision_score
This is directly connected with the ability to capture the features that determine the positiveness of a sample, to avoid the misclassification as negative. In scikit-learn, the implementation is: from sklearn.metrics import precision_score >>> precision_score(Y_test, lr.predict(X_test)) The ability to detect true positive samples among all the potential positives can be assessed using another measure:

The scikit-learn implementation is:
from sklearn.metrics import recall_score >>> recall_score(Y_test, lr.predict(X_test)) It's not surprising that we have a 90 percent recall with 96 percent precision, because the number of false negatives (which impact recall) is proportionally higher than the number of false positives (which impact precision). A weighted harmonic mean between precision and recall is provided by:

A beta value equal to 1 determines the so-called F1 score, which is a perfect balance between the two measures. A beta less than 1 gives more importance to precision and a value greater than 1 gives more importance to recall. The following snippet shows how to implement it with scikit-learn: from sklearn.metrics import fbeta_score >>> fbeta_score(Y_test, lr.predict(X_test), beta=1) >>> fbeta_score(Y_test, lr.predict(X_test), beta=0.75) >>> fbeta_score(Y_test, lr.predict(X_test), beta=1.25)

For F1 score, scikit-learn provides the function f1_score(), which is equivalent to fbeta_score() with beta=1. The highest score is achieved by giving more importance to precision (which is higher), while the least one corresponds to a recall predominance. FBeta is hence useful to have a compact picture of the accuracy as a trade-off between high precision and a limited number of false negatives.

ROC curve The ROC curve (or receiver operating characteristics) is a valuable tool to compare different classifiers that can assign a score to their predictions. In general, this score can be interpreted as a probability, so it's bounded between 0 and 1. The plane is structured like in the following figure:

The x axis represents the increasing false positive rate (also known as specificity), while the y axis represents the true positive rate (also known as sensitivity). The dashed oblique line represents a perfectly random classifier, so all the curves below this threshold perform worse than a random choice, while the ones above it show better performances. Of course, the best classifier has an ROC curve split into the segments [0, 0] - [0, 1] and [0, 1] - [1, 1], and our goal is to find algorithms whose performances should be as close as possible to this limit. To show how to create a ROC curve with scikit-learn, we're going to train a model to determine the scores for the predictions (this can be achieved using the decision_function() or predict_proba() methods):

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
>>> lr = LogisticRegression() >>> lr.fit(X_train, Y_train) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) >>> Y_scores = lr.decision_function(X_test) we can compute the ROC curve: from sklearn.metrics import roc_curve >>> fpr, tpr, thresholds = roc_curve(Y_test, Y_scores)

it's also useful to compute the area under the curve (AUC), whose value is bounded between 0 (worst performances) and 1 (best performances), with a perfectly random value corresponding to 0.5: from sklearn.metrics import auc >>> auc(fpr, tpr) We already know that our performances are rather good because the AUC is close to 1. Now we can plot the ROC curve using matplotlib. As this book is not dedicated to this powerful framework, I'm going to use a snippet that can be found in several examples:

import matplotlib.pyplot as plt
>>> plt.figure(figsize=(8, 8)) >>> plt.plot(fpr, tpr, color='red', label='Logistic regression (AUC: %.2f)' % auc(fpr, tpr)) >>> plt.plot([0, 1], [0, 1], color='blue', linestyle='--') >>> plt.xlim([0.0, 1.0]) >>> plt.ylim([0.0, 1.01]) >>> plt.title('ROC Curve') >>> plt.xlabel('False Positive Rate') >>> plt.ylabel('True Positive Rate') >>> plt.legend(loc="lower right") >>> plt.show()

As confirmed by the AUC, our ROC curve shows very good performance
As confirmed by the AUC, our ROC curve shows very good performance. In later chapters, we're going to use the ROC curve to visually compare different algorithms. As an exercise, you can try different parameters of the same model and plot all the ROC curves, to immediately understand which setting is preferable.

References

Machine Learning A.Y SEM-II Mr. Dhomse G.P.

Similar presentations

Presentation on theme: "Machine Learning A.Y SEM-II Mr. Dhomse G.P."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning A.Y SEM-II Mr. Dhomse G.P.

Similar presentations

Presentation on theme: "Machine Learning A.Y SEM-II Mr. Dhomse G.P."— Presentation transcript:

Similar presentations

About project

Feedback