Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 5 Statistical Methods.

Similar presentations


Presentation on theme: "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 5 Statistical Methods."— Presentation transcript:

1 1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 5 Statistical Methods

2 2 OUTLINE Objectives - introduce statistical terminology/methodology/motivation - taxonomy of methods - describe several representative statistical methods - interpretation of statistical methods under predictive learning Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion

3 3 Methodology and Motivation Original motivation: - understand how the inputs affect the output  simple model involving a few variables Regression modeling: Response = model + error y = f(x) + noise, where f(x) = E(y/x) Linear regression: f(x) = wx +b Model parameters estimates via

4 4 OLS Linear Regression OLS solution: - first, center x and y-values - then calculate the slope and bias Example: SBP vs. Age The meaning of bias term?

5 5 Statistical Assumptions Gaussian noise: zero mean, const variance Known (linear) dependency i.i.d. data samples (ensured by the protocol for data collection) – may not hold for observational data Do these assumptions hold for:

6 6 Multivariate Linear Regression Parameterization Matrix form (for centered variables) ERM solution: Analytic solution (when d < n):

7 7 Linear Ridge Regression When d > n, penalize large values of w Regularization parameter estimated via resampling Example: - 10 training samples uniformly samples in [0,1] range - additive gaussian noise with st. deviation 0.5 Apply standard linear least squares: Apply ridge regression using optimal

8 8 Example cont’d Target function Coefficient shrinkage : how w’s depend on lambda? Can it be used for feature selection?

9 9 Statistical Methodology for classification For classification : output y ~ (binary) class label (0 or 1) Probabilistic modeling starts with known distributions Bayes-optimal decision rule for known distributions: Statistical approach ~ form of ERM (?) - parametric form of class distributions is known/assumed  analytic form of D(x) is known, and its parameters are estimated from available training data Issues: loss function (used for statistical modeling)?

10 10 Gaussian class distributions

11 11 Logistic Regression Terminology : may be confusing (for non-statisticians) Gaussian class distributions (with equal covariances) is a linear function in x Logistic regression estimates probabilistic model: Equivalently, logistic regression estimates where sigmoid function is

12 12 Logistic Regression Example : interpretation of logistic regression model for the probability of death from a heart disease during 10-year period, for middle-aged patients, as a function of - Age (years, less 50) ~x1 - Gender male/female (0/1) ~x2 - cholesterol level, in mmol/L (less 5) ~ x3 where The probability of binary outcome ~ the risk (of death)* Logistic Regression Model interpretation: - increasing Age is associated with increased risk of death - females have lower risk of death (than males) - increasing Cholesterol level  increased risk of death

13 13 Estimating Logistic Regression Given : training data How to estimate model parameters (w,b) ? Maximum Likelihood ~ minimize negative log-likelihood: where  non-linear optimization Solution w*, b*  estimated model: - which can be used for both prediction and interpretation (for prediction, the model should be combined with costs)

14 14 Statistical Modeling Strategy Data-analytic models are used for: understanding the importance of inputs in explaining the output ERM approach: - a statistician selects (manually) a few ‘good’ variables and several models are estimated - the final model selected manually ~ heuristic implementation of Occam’s razor Linear regression and logistic regression - both estimate E(y/x), since for classification:

15 15 Classification via multiple-response regression How to use nonlinear regression s/w for classification? - classification methods estimate model parameters via minimization of squared-error  can use regression s/w with minor modifications: (1) for J class labels, use 1-of-J encoding, i.e. J=4 classes: ~ 1000010000100001 (4 outputs in regression). (2) estimate 4 regression models from the training data (usually all regression models use the same parameterization)

16 16 Classification via Regression Training ~ regression estimation using 1-of-J encoding Prediction (classification) ~ based on the max response value of estimated outputs

17 17 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods - model parameterization (representation) - nonlinear optimization strategies Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion

18 18 Taxonomy of Nonlinear Methods Main idea: improve flexibility of classical linear methods ~ use flexible (nonlinear) parameterization Dictionary parameterization ~ SRM structure Two interrelated issues: - parameterization (of nonlinear basis functions) - optimization method used These two factors define methods taxonomy

19 19 Taxonomy of nonlinear methods Decision tree methods: - piecewise-constant model - greedy optimization Additive methods: - backfitting method for model estimation Gradient-descent methods: - popular in neural network learning Penalization methods Note: all methods implement SRM structures

20 20 Dictionary representation Two possibilities Linear (non-adaptive) methods ~ predetermined (fixed) basis functions  only parameters have to be estimated via standard optimization methods (linear least squares) Examples: linear regression, polynomial regression linear classifiers, quadratic classifiers Nonlinear (adaptive) methods ~ basis functions depend on the training data Possibilities : nonlinear b.f. (in parameters ) feature selection (i.e. wavelet denoising)

21 21 Example of Nonlinear Parameterization Basis functions of the form i.e. sigmoid aka logistic function - commonly used in artificial neural networks - combination of sigmoids ~ universal approximator

22 22 Example of Nonlinear Parameterization Basis functions of the form i.e. Radial Basis Function(RBF) - RBF adaptive parameters: center, width - commonly used in artificial neural networks - combination of RBF’s ~ universal approximator

23 23 Neural Network Representation MLP or RBF networks - dimensionality reduction - universal approximation property – see example at http://www.mathworks.com/products/demos/nnettlbx/radial/index.html http://www.mathworks.com/products/demos/nnettlbx/radial/index.html

24 24 Example of Nonlinear Parameterization Adaptive Partitioning (CART) each b.f. is a rectangular region in x-space Each b.f. depends on 2d parameters Since the regions are disjoint, parameters w can be easily estimated (for regression) as Estimating b.f.’s ~ adaptive partitioning

25 25 Example of CART Partitioning CART Partitioning in 2D space - each region ~ basis function - piecewise-constant estimate of y (in each region) - number of regions ~ model complexity

26 26 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees - Regression trees (CART) - Boston Housing example - Classification trees (CART) Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion

27 27 Greedy Optimization Strategy Minimization of empirical risk for regression problems where the model Greedy Optimization Strategy basis functions are estimated sequentially, one at a time, i.e., the training data is represented as structure (model fit) + noise (residual): (1) DATA = (model) FIT 1 + RESIDUAL 1 (2) RESIDUAL 1 = FIT 2 + RESIDUAL 2 and so on. The final model for the data will be MODEL = FIT 1 + FIT 2 +.... Advantages: computational speed, interpretability

28 28 Regression Trees (CART) Minimization of empirical risk (squared error) via partitioning of the input space into regions where Example of CART partitioning for a function of 2 inputs

29 29 Growing CART tree Recursive partitioning for estimating regions (via binary splitting) Initial Model ~ Region (the whole input domain) is divided into two regions and A split is defined by one of the inputs(k) and split point s Optimal values of (k, s) chosen so that splitting a region into two daughter regions minimizes empirical risk Issues: - efficient implementation (selection of opt. split point) - optimal tree size ~ model selection(complexity control) Advantages and limitations

30 30 Valid Split Points for CART How to choose valid points (for binary splitting)? valid points ~ combinations of the coordinate values of training samples, i.e. for 4 bivariate samples  16 points used as candidates for splitting:

31 31 CART Modeling Strategy Growing CART tree ~ reducing MSE (for regression) Splitting the parent region is allowed only if # of samples exceeds certain threshold (~, Splitmin, user-defined). Tree pruning ~ reducing tree size by selectively combining adjacent leaf nodes (regions). This pruning implements minimization of the penalized MSE: where ~ MSE ~ number of leaf nodes (regions) and parameter is estimated via resampling

32 32 Example: Boston Housing data set Objective: to predict the value of homes in Boston area Data set ~ 506 samples total Output: value of owner-occupied homes (in $1,000’s) Inputs: 13 variables 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population

33 33 Example CART trees for Boston Housing 1.Training set: 450 samples Splitmin =100 (user-defined )

34 34 Example CART trees for Boston Housing 2.Training set: 450 samples Splitmin =50 (user-defined )

35 35 Example CART trees for Boston Housing 3.Training set: 455 samples Splitmin =100 (user-defined ) Note: CART model is sensitive to training samples (vs model 1)

36 36 Classification Trees (CART) Binary classification example(2D input space) Algorithm similar to regression trees (tree growth via binary splitting + model selection), BUT using different empirical loss function

37 37 Loss functions for Classification Trees Misclassification loss: poor practical choice Other loss (cost) functions for splitting nodes: For J-class problem, a cost function is a measure of node impurity where p(i/t) denotes the probability of class i samples at node t. Possible cost functions Misclassification Gini function Entropy function

38 38 Classification Trees: node splitting Minimizing cost function = maximizing the decrease in node impurity. Assume node t is split into two regions (Left & Right) on variable k at a split point s. Then the decrease is impurity caused by this split is: where Misclassification cost ~ discontinuous (due to max) - may give sub-optimal solutions (poor local min) - does not work well with greedy optimization

39 39 Using different cost fcts for node splitting (a) Decrease in impurity: misclassification = 0.25 gini = 0.13 entropy = 0.13 (b) Decrease in impurity: misclassification = 0.25 gini = 0.17 entropy = 0.22 Split (b) is better as it leads to a smaller final tree

40 40 Details of calculating decrease in impurity Consider split (a) Misclassification Cost Gini Cost

41 41 IRIS Data Set:A data set with 150 random samples of flowers from the iris species setosa, versicolor, and virginica (3 classes). From each species there are 50 observations for sepal length, sepal width, petal length, and petal width in cm. This dataset is from classical statistics MATLAB code (splitmin =10) load fisheriris; t = treefit(meas, species); treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'});

42 42 Sensitivity to random training data: Consider IRIS data set where every other sample is used (total 75 samples, 25 per class). Then the CART tree formed using the same Matlab software (splitmin = 10, Gini loss fct)) is

43 43 Decision Trees: summary Advantages - speed - interpretability - different types of input variables Limitations: sensitivity to - correlated inputs - affine transformations (of input variables) - general instability of trees Variations: ID3 (in machine learning), linear CART

44 44 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and discussion

45 45 Additive Modeling Additive model parameterization for regression whereis unknown (smooth) function. Each univariate component estimated separately Additive model for classification Backfitting is a greedy optimization approach for estimating basis functions sequentially

46 46 By fixing all basis functions the empirical risk (MSE) can be decomposed as  Each basis function is estimated via an iterative backfitting algorithm (until some stopping criterion is met) Note: can be interpreted as the response variable for the adaptive method

47 47 Consider regression estimation of a function of two variables of the form from training data For example Backfitting method: (1) estimate for fixed (2) estimate for fixed iterate above two steps Estimation via minimization of empirical risk Backfitting Algorithm: Example

48 48 Estimation of via minimization of MSE This is a univariate regression problem of estimating from n data points where Can be estimated by smoothing (kNN regression) Estimation of (second iteration) proceeds in a similar manner, via minimization of where Backfitting Algorithm(cont’d)

49 49 Projection Pursuit regression Projection Pursuit is an additive model: wherebasis functions are univariate functions (of projections) Features specify the projection of x onto w A sum of nonlinear functions can approximate any nonlinear model functions. See example below.

50 50

51 51 Projection Pursuit regression Projection Pursuit is an additive model: wherebasis functions are univariate functions (of projections) Backfitting algorithm is used to estimate iteratively (a) basis functions (parameters ) via scatterplot smoothing (b) projection parameters (via gradient descent)

52 52 EXAMPLE: estimation of a two-dimensional fct via projection pursuit (a)Projections are found that minimize unexplained variance. Smoothing is performed to create adaptive basis functions. (b)The final model is a sum of two univariate adaptive basis functions.

53 53 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion

54 54 Greedy feature selection Recall feature selection structure in SRM: - difficult (nonlinear) optimization problem - simple with orthogonal basis functions - why not use orthogonal b.f.’s for all apps? Consider sparse polynomial estimation (aka best subset regression) as an example of feature selection, i.e. features ~ Compare two approaches: - exhaustive search through all subsets - forward stepwise selection (in statistics)

55 55 Data set used for comparisons 30 noisy training samples generated from where and inputs are uniform in [0,1]

56 56 Feature selection via exhaustive search Exhaustive search for best subset selection - estimate prediction risk (MSE) via leave-one- out cross validation - minimize empirical risk via least squares for all possible subsets of m variables (features) - select the best subset (~ min pred. risk) Based on min prediction risk (via x-validation) the following model was selected Final model estimated via linear regression using features with all data:

57 57 Forward subset selection (greedy method) Forward subset selection - first estimate the model using one feature - then add the second feature if it results in sufficiently large decrease in RSS, otherwise stop - etc. (sequentially adding one more feature) Step 1: select the first feature (m=1) from a set of candidate models: via 0.2490.270 0.274 0.271 so selected model is with RSS(1)=0.249 Step 2: s elect second feature (m=2) from a set of candidate models: via RSS = 0.06150.054240.05422  selected model with RSS(2)=0.05422

58 58 Forward subset selection (greedy method) Step 2 (cont’d) - check whether including second feature in the model is justified using some statistical criterion, usually F test: so (m+1)-st feature is included only if F>90 For adding second feature: so we keep it in the model Step 3: s elect third feature from a set of candidate models: with RSS=0.05362RSS=0.05363 Test whether adding third feature is justified via F test:  not justified, so the final model

59 59 Feature selection in signal processing Regression formulation ~real-valued function estimation Signal representation: linear combination of orthogonal basis functions (harmonic, wavelets) Differences (from standard formulation) - fixed sampling rate - training data X-values = test data X-values  Computationally efficient orthogonal estimators: Discrete Fourier/Wavelet Transform (DFT / DWT)  Important features ~ large coefficients easy to select

60 60 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion

61 61 Summary and Discussion Evolution of statistical methods - parametric  flexible (adaptive) - fast optimization (favor greedy methods – why?) - interpretable - model complexity ~ number of parameters (basis functions, regions, features …) - batch mode (for training) Probabilistic framework - classical methods assume probabilistic models of observed data - adaptive statistical methods lack probabilistic derivation, but use clever heuristics for controling model complexity


Download ppt "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 5 Statistical Methods."

Similar presentations


Ads by Google