Download presentation
Presentation is loading. Please wait.
Published byMelina McKinney Modified over 9 years ago
1
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 5 Statistical Methods
2
2 OUTLINE Objectives - introduce statistical terminology/methodology/motivation - taxonomy of methods - describe several representative statistical methods - interpretation of statistical methods under predictive learning Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion
3
3 Methodology and Motivation Original motivation: - understand how the inputs affect the output simple model involving a few variables Regression modeling: Response = model + error y = f(x) + noise, where f(x) = E(y/x) Linear regression: f(x) = wx +b Model parameters estimates via
4
4 OLS Linear Regression OLS solution: - first, center x and y-values - then calculate the slope and bias Example: SBP vs. Age The meaning of bias term?
5
5 Statistical Assumptions Gaussian noise: zero mean, const variance Known (linear) dependency i.i.d. data samples (ensured by the protocol for data collection) – may not hold for observational data Do these assumptions hold for:
6
6 Multivariate Linear Regression Parameterization Matrix form (for centered variables) ERM solution: Analytic solution (when d < n):
7
7 Linear Ridge Regression When d > n, penalize large values of w Regularization parameter estimated via resampling Example: - 10 training samples uniformly samples in [0,1] range - additive gaussian noise with st. deviation 0.5 Apply standard linear least squares: Apply ridge regression using optimal
8
8 Example cont’d Target function Coefficient shrinkage : how w’s depend on lambda? Can it be used for feature selection?
9
9 Statistical Methodology for classification For classification : output y ~ (binary) class label (0 or 1) Probabilistic modeling starts with known distributions Bayes-optimal decision rule for known distributions: Statistical approach ~ form of ERM (?) - parametric form of class distributions is known/assumed analytic form of D(x) is known, and its parameters are estimated from available training data Issues: loss function (used for statistical modeling)?
10
10 Gaussian class distributions
11
11 Logistic Regression Terminology : may be confusing (for non-statisticians) Gaussian class distributions (with equal covariances) is a linear function in x Logistic regression estimates probabilistic model: Equivalently, logistic regression estimates where sigmoid function is
12
12 Logistic Regression Example : interpretation of logistic regression model for the probability of death from a heart disease during 10-year period, for middle-aged patients, as a function of - Age (years, less 50) ~x1 - Gender male/female (0/1) ~x2 - cholesterol level, in mmol/L (less 5) ~ x3 where The probability of binary outcome ~ the risk (of death)* Logistic Regression Model interpretation: - increasing Age is associated with increased risk of death - females have lower risk of death (than males) - increasing Cholesterol level increased risk of death
13
13 Estimating Logistic Regression Given : training data How to estimate model parameters (w,b) ? Maximum Likelihood ~ minimize negative log-likelihood: where non-linear optimization Solution w*, b* estimated model: - which can be used for both prediction and interpretation (for prediction, the model should be combined with costs)
14
14 Statistical Modeling Strategy Data-analytic models are used for: understanding the importance of inputs in explaining the output ERM approach: - a statistician selects (manually) a few ‘good’ variables and several models are estimated - the final model selected manually ~ heuristic implementation of Occam’s razor Linear regression and logistic regression - both estimate E(y/x), since for classification:
15
15 Classification via multiple-response regression How to use nonlinear regression s/w for classification? - classification methods estimate model parameters via minimization of squared-error can use regression s/w with minor modifications: (1) for J class labels, use 1-of-J encoding, i.e. J=4 classes: ~ 1000010000100001 (4 outputs in regression). (2) estimate 4 regression models from the training data (usually all regression models use the same parameterization)
16
16 Classification via Regression Training ~ regression estimation using 1-of-J encoding Prediction (classification) ~ based on the max response value of estimated outputs
17
17 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods - model parameterization (representation) - nonlinear optimization strategies Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion
18
18 Taxonomy of Nonlinear Methods Main idea: improve flexibility of classical linear methods ~ use flexible (nonlinear) parameterization Dictionary parameterization ~ SRM structure Two interrelated issues: - parameterization (of nonlinear basis functions) - optimization method used These two factors define methods taxonomy
19
19 Taxonomy of nonlinear methods Decision tree methods: - piecewise-constant model - greedy optimization Additive methods: - backfitting method for model estimation Gradient-descent methods: - popular in neural network learning Penalization methods Note: all methods implement SRM structures
20
20 Dictionary representation Two possibilities Linear (non-adaptive) methods ~ predetermined (fixed) basis functions only parameters have to be estimated via standard optimization methods (linear least squares) Examples: linear regression, polynomial regression linear classifiers, quadratic classifiers Nonlinear (adaptive) methods ~ basis functions depend on the training data Possibilities : nonlinear b.f. (in parameters ) feature selection (i.e. wavelet denoising)
21
21 Example of Nonlinear Parameterization Basis functions of the form i.e. sigmoid aka logistic function - commonly used in artificial neural networks - combination of sigmoids ~ universal approximator
22
22 Example of Nonlinear Parameterization Basis functions of the form i.e. Radial Basis Function(RBF) - RBF adaptive parameters: center, width - commonly used in artificial neural networks - combination of RBF’s ~ universal approximator
23
23 Neural Network Representation MLP or RBF networks - dimensionality reduction - universal approximation property – see example at http://www.mathworks.com/products/demos/nnettlbx/radial/index.html http://www.mathworks.com/products/demos/nnettlbx/radial/index.html
24
24 Example of Nonlinear Parameterization Adaptive Partitioning (CART) each b.f. is a rectangular region in x-space Each b.f. depends on 2d parameters Since the regions are disjoint, parameters w can be easily estimated (for regression) as Estimating b.f.’s ~ adaptive partitioning
25
25 Example of CART Partitioning CART Partitioning in 2D space - each region ~ basis function - piecewise-constant estimate of y (in each region) - number of regions ~ model complexity
26
26 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees - Regression trees (CART) - Boston Housing example - Classification trees (CART) Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion
27
27 Greedy Optimization Strategy Minimization of empirical risk for regression problems where the model Greedy Optimization Strategy basis functions are estimated sequentially, one at a time, i.e., the training data is represented as structure (model fit) + noise (residual): (1) DATA = (model) FIT 1 + RESIDUAL 1 (2) RESIDUAL 1 = FIT 2 + RESIDUAL 2 and so on. The final model for the data will be MODEL = FIT 1 + FIT 2 +.... Advantages: computational speed, interpretability
28
28 Regression Trees (CART) Minimization of empirical risk (squared error) via partitioning of the input space into regions where Example of CART partitioning for a function of 2 inputs
29
29 Growing CART tree Recursive partitioning for estimating regions (via binary splitting) Initial Model ~ Region (the whole input domain) is divided into two regions and A split is defined by one of the inputs(k) and split point s Optimal values of (k, s) chosen so that splitting a region into two daughter regions minimizes empirical risk Issues: - efficient implementation (selection of opt. split point) - optimal tree size ~ model selection(complexity control) Advantages and limitations
30
30 Valid Split Points for CART How to choose valid points (for binary splitting)? valid points ~ combinations of the coordinate values of training samples, i.e. for 4 bivariate samples 16 points used as candidates for splitting:
31
31 CART Modeling Strategy Growing CART tree ~ reducing MSE (for regression) Splitting the parent region is allowed only if # of samples exceeds certain threshold (~, Splitmin, user-defined). Tree pruning ~ reducing tree size by selectively combining adjacent leaf nodes (regions). This pruning implements minimization of the penalized MSE: where ~ MSE ~ number of leaf nodes (regions) and parameter is estimated via resampling
32
32 Example: Boston Housing data set Objective: to predict the value of homes in Boston area Data set ~ 506 samples total Output: value of owner-occupied homes (in $1,000’s) Inputs: 13 variables 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population
33
33 Example CART trees for Boston Housing 1.Training set: 450 samples Splitmin =100 (user-defined )
34
34 Example CART trees for Boston Housing 2.Training set: 450 samples Splitmin =50 (user-defined )
35
35 Example CART trees for Boston Housing 3.Training set: 455 samples Splitmin =100 (user-defined ) Note: CART model is sensitive to training samples (vs model 1)
36
36 Classification Trees (CART) Binary classification example(2D input space) Algorithm similar to regression trees (tree growth via binary splitting + model selection), BUT using different empirical loss function
37
37 Loss functions for Classification Trees Misclassification loss: poor practical choice Other loss (cost) functions for splitting nodes: For J-class problem, a cost function is a measure of node impurity where p(i/t) denotes the probability of class i samples at node t. Possible cost functions Misclassification Gini function Entropy function
38
38 Classification Trees: node splitting Minimizing cost function = maximizing the decrease in node impurity. Assume node t is split into two regions (Left & Right) on variable k at a split point s. Then the decrease is impurity caused by this split is: where Misclassification cost ~ discontinuous (due to max) - may give sub-optimal solutions (poor local min) - does not work well with greedy optimization
39
39 Using different cost fcts for node splitting (a) Decrease in impurity: misclassification = 0.25 gini = 0.13 entropy = 0.13 (b) Decrease in impurity: misclassification = 0.25 gini = 0.17 entropy = 0.22 Split (b) is better as it leads to a smaller final tree
40
40 Details of calculating decrease in impurity Consider split (a) Misclassification Cost Gini Cost
41
41 IRIS Data Set:A data set with 150 random samples of flowers from the iris species setosa, versicolor, and virginica (3 classes). From each species there are 50 observations for sepal length, sepal width, petal length, and petal width in cm. This dataset is from classical statistics MATLAB code (splitmin =10) load fisheriris; t = treefit(meas, species); treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'});
42
42 Sensitivity to random training data: Consider IRIS data set where every other sample is used (total 75 samples, 25 per class). Then the CART tree formed using the same Matlab software (splitmin = 10, Gini loss fct)) is
43
43 Decision Trees: summary Advantages - speed - interpretability - different types of input variables Limitations: sensitivity to - correlated inputs - affine transformations (of input variables) - general instability of trees Variations: ID3 (in machine learning), linear CART
44
44 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and discussion
45
45 Additive Modeling Additive model parameterization for regression whereis unknown (smooth) function. Each univariate component estimated separately Additive model for classification Backfitting is a greedy optimization approach for estimating basis functions sequentially
46
46 By fixing all basis functions the empirical risk (MSE) can be decomposed as Each basis function is estimated via an iterative backfitting algorithm (until some stopping criterion is met) Note: can be interpreted as the response variable for the adaptive method
47
47 Consider regression estimation of a function of two variables of the form from training data For example Backfitting method: (1) estimate for fixed (2) estimate for fixed iterate above two steps Estimation via minimization of empirical risk Backfitting Algorithm: Example
48
48 Estimation of via minimization of MSE This is a univariate regression problem of estimating from n data points where Can be estimated by smoothing (kNN regression) Estimation of (second iteration) proceeds in a similar manner, via minimization of where Backfitting Algorithm(cont’d)
49
49 Projection Pursuit regression Projection Pursuit is an additive model: wherebasis functions are univariate functions (of projections) Features specify the projection of x onto w A sum of nonlinear functions can approximate any nonlinear model functions. See example below.
50
50
51
51 Projection Pursuit regression Projection Pursuit is an additive model: wherebasis functions are univariate functions (of projections) Backfitting algorithm is used to estimate iteratively (a) basis functions (parameters ) via scatterplot smoothing (b) projection parameters (via gradient descent)
52
52 EXAMPLE: estimation of a two-dimensional fct via projection pursuit (a)Projections are found that minimize unexplained variance. Smoothing is performed to create adaptive basis functions. (b)The final model is a sum of two univariate adaptive basis functions.
53
53 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion
54
54 Greedy feature selection Recall feature selection structure in SRM: - difficult (nonlinear) optimization problem - simple with orthogonal basis functions - why not use orthogonal b.f.’s for all apps? Consider sparse polynomial estimation (aka best subset regression) as an example of feature selection, i.e. features ~ Compare two approaches: - exhaustive search through all subsets - forward stepwise selection (in statistics)
55
55 Data set used for comparisons 30 noisy training samples generated from where and inputs are uniform in [0,1]
56
56 Feature selection via exhaustive search Exhaustive search for best subset selection - estimate prediction risk (MSE) via leave-one- out cross validation - minimize empirical risk via least squares for all possible subsets of m variables (features) - select the best subset (~ min pred. risk) Based on min prediction risk (via x-validation) the following model was selected Final model estimated via linear regression using features with all data:
57
57 Forward subset selection (greedy method) Forward subset selection - first estimate the model using one feature - then add the second feature if it results in sufficiently large decrease in RSS, otherwise stop - etc. (sequentially adding one more feature) Step 1: select the first feature (m=1) from a set of candidate models: via 0.2490.270 0.274 0.271 so selected model is with RSS(1)=0.249 Step 2: s elect second feature (m=2) from a set of candidate models: via RSS = 0.06150.054240.05422 selected model with RSS(2)=0.05422
58
58 Forward subset selection (greedy method) Step 2 (cont’d) - check whether including second feature in the model is justified using some statistical criterion, usually F test: so (m+1)-st feature is included only if F>90 For adding second feature: so we keep it in the model Step 3: s elect third feature from a set of candidate models: with RSS=0.05362RSS=0.05363 Test whether adding third feature is justified via F test: not justified, so the final model
59
59 Feature selection in signal processing Regression formulation ~real-valued function estimation Signal representation: linear combination of orthogonal basis functions (harmonic, wavelets) Differences (from standard formulation) - fixed sampling rate - training data X-values = test data X-values Computationally efficient orthogonal estimators: Discrete Fourier/Wavelet Transform (DFT / DWT) Important features ~ large coefficients easy to select
60
60 OUTLINE Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Summary and Discussion
61
61 Summary and Discussion Evolution of statistical methods - parametric flexible (adaptive) - fast optimization (favor greedy methods – why?) - interpretable - model complexity ~ number of parameters (basis functions, regions, features …) - batch mode (for training) Probabilistic framework - classical methods assume probabilistic models of observed data - adaptive statistical methods lack probabilistic derivation, but use clever heuristics for controling model complexity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.