Scikit-Learn Intro to Data Science Presented by: Vishnu Karnam A
Linear Regression in Mathematical form Given a Data set {y i,x 1i,x 2i …….,x ip } LR assumes that dependent variable Y i is linearly dependent on x vector. The relationship is modeled through a random error variable ε i Thus the equation can now be represented as Y i = β 1 x 1i + β 2 x 2i + …….. + β p x pi + ε i Y i = x T i β + ε i These set of equations are stacked and written in vector form as Y = X β + ε Where Y, X and ε represented as
Cont…
After mathematical representation, estimation methods are used to determine the parameters. The estimation method consists of Objective function and the parameters are set either to maximize or minimize the objective function. In machine learning we will have set of parameters that are calculated based on given data points.
Dimensionality Reduction Some of the famous Dimensionality Reduction Algorithms are 1)Principal Component Analysis 2)Incremental PCA 3)Approximate PCA 4) Kernal PCA 5) Linear Discriminant Analysis
Basic Idea The general idea behind all these algorithms Project the data on set of Orthogonal components Generally most of the algorithms use Eigen vectors Find the minimum number of components that is a representation of whole data.
PCA on IRIS data
Cross Validation and Metrics Once the models are prepared we need to test the model based on testing data. Various Cross Validation techniques present in SciKit-learn are K-fold Stratified k-fold Leave one out
Preprocessing Some of the algorithms require preprocessing of data so that the estimation methods in the algorithms perform better. Ex: PCA algorithm works on data with mean centered and normalized data. Some useful functions are Scale MinMaxScaler Normalize
Binarization, Encoding and Imputation of Missing Values Binarization is used to convert quantitative value to binary value. Encoding is used to convert categorical features as integers. Imputation allows to insert missing data with mean, median or with frequent data.