Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic Learning Approaches and Complexity Control.

Similar presentations


Presentation on theme: "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic Learning Approaches and Complexity Control."— Presentation transcript:

1 1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic Learning Approaches and Complexity Control

2 2 OUTLINE 2.0 Objectives 2.1 Terminology and Basic Learning Problems 2.2 Basic Learning Approaches 2.3 Generalization and Complexity Control 2.4 Application Example 2.5 Summary

3 3 2.0 Objectives 1.To quantify the notions of explanation, prediction and model 2.Introduce terminology 3.Describe basic learning methods Past observations ~ data points Explanation (model) ~ function  Learning ~ function estimation Prediction ~ using estimated model to make predictions

4 4 2.0 Objectives (cont’d) Example: classification training samples, model Goal 1: explanation of training data Goal 2: generalization (for future data) Learning is ill-posed

5 5 Learning as Induction Induction ~ function estimation from data: Deduction ~ prediction for new inputs:

6 6 2.1 Terminology and Learning Problems Input and output variables Learning ~ estimation of F(X): X  y Statistical dependency vs causality

7 7 2.1.1 Types of Input and Output Variables Real-valued Categorical (class labels) Ordinal (or fuzzy) variables Aside: fuzzy sets and fuzzy logic

8 8 Data Preprocessing and Scaling Preprocessing is required with observational data (step 4 in general experimental procedure) Examples: …. Basic preprocessing includes - summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output - detection (removal) of outliers - scaling of input/output variables (may be required for some learning algorithms) Visual inspection of data is tedious but useful

9 9 Example Data Set: animal body&brain weight kggram 1 Mountain beaver 1.350 8.100 2 Cow 465.000 423.000 3 Gray wolf 36.330 119.500 4 Goat 27.660 115.000 5 Guinea pig 1.040 5.500 6 Diplodocus 11700.000 50.000 7 Asian elephant 2547.000 4603.000 8 Donkey 187.100 419.000 9 Horse 521.000 655.000 10 Potar monkey 10.000 115.000 11 Cat 3.300 25.600 12 Giraffe 529.000 680.000 13 Gorilla 207.000 406.000 14 Human 62.000 1320.000

10 10 Example Data Set: cont’d kggram 15 African elephant 6654.000 5712.000 16 Triceratops 9400.000 70.000 17 Rhesus monkey 6.800 179.000 18 Kangaroo 35.000 56.000 19 Hamster 0.120 1.000 20 Mouse 0.023 0.400 21 Rabbit 2.500 12.100 22 Sheep 55.500 175.000 23 Jaguar 100.000 157.000 24 Chimpanzee 52.160 440.000 25 Brachiosaurus 87000.000 154.500 26 Rat 0.280 1.900 27Mole 0.122 3.000 28Pig 192.000 180.000

11 11 Original Unscaled Animal Data: what points are outliers?

12 12 Animal Data: with outliers removed and scaled to [0,1] range: humans in the left top corner

13 13 2.1.2 Supervised Learning: Regression Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (‘response’) Regression: y is real-valued  Estimation of real-valued function x  y

14 14 2.1.2 Supervised Learning: Classification Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (‘response’) Classification: y is categorical (class label)  Estimation of indicator function x  y

15 15 2.1.2 Unsupervised Learning Data in the form (x), where - x is multivariate input (i.e. vector) Goal 1: data reduction or clustering  Clustering = estimation of mapping X  c

16 16 Unsupervised Learning (cont’d) Goal 2: dimensionality reduction Finding low-dimensional model of the data

17 17 2.1.3 Other (nonstandard) learning problems Multiple model estimation:

18 18 OUTLINE 2.0 Objectives 2.1 Terminology and Learning Problems 2.2 Basic Learning Approaches - Parametric Modeling - Non-parametric Modeling - Data Reduction 2.3 Generalization and Complexity Control 2.4 Application Example 2.5 Summary

19 19 2.2.1 Parametric Modeling Given training data (1)Specify parametric model (2)Estimate its parameters (via fitting to data) Example: Linear regression F(x)= (w x) + b

20 20 Parametric Modeling Given training data (1)Specify parametric model (2)Estimate its parameters (via fitting to data) Univariate classification:

21 21 2.2.2 Non-Parametric Modeling Given training data Estimate the model (for given ) as ‘local average’ of the data. Note: need to define ‘local’, ‘average’ Example: k-nearest neighbors regression

22 22 2.2.3 Data Reduction Approach Given training data, estimate the model as ‘compact encoding’ of the data. Note: ‘compact’ ~ # of bits to encode the model Example: piece-wise linear regression How many parameters needed for two-linear-component model?

23 23 Example: piece-wise linear regression vs linear regression

24 24 Data Reduction Approach (cont’d) Data Reduction approaches are commonly used for unsupervised learning tasks. Example: clustering. Training data encoded by 3 points (cluster centers ) H Issues: -How to find centers? -How to select the number of clusters?

25 25 Inductive Learning Setting Induction and Deduction in Philosophy: All observed swans are white (data samples). Therefore, all swans are white. Model estimation ~ inductive step, i.e. estimate function from data samples. Prediction ~ deductive step  Inductive Learning Setting Discussion: which of the 3 modeling approaches follow inductive learning? Do humans implement inductive inference?

26 26 OUTLINE 2.0 Objectives 2.1 Terminology and Learning Problems 2.2 Modeling Approaches & Learning Methods 2.3 Generalization and Complexity Control - Prediction Accuracy (generalization) - Complexity Control: examples - Resampling 2.4 Application Example 2.5 Summary

27 27 2.3.1 Prediction Accuracy Inductive Learning ~ function estimation All modeling approaches implement ‘data fitting’ ~ explaining the data BUT True goal ~ prediction Two possible goals of learning: - estimation of ‘true function’ - good generalization for future data Are these two goals equivalent? If not, which one is more practical?

28 28 Explanation vs Prediction (a) Classification(b) Regression

29 29 Inductive Learning Setting The learning machine observes samples (x,y), and returns an estimated response Recall ‘ first-principles ’ vs ‘ empirical ’ knowledge  Two modes of inference: identification vs imitation Risk

30 30 Discussion Math formulation useful for quantifying - explanation ~ fitting error (training data) - generalization ~ prediction error Natural assumptions - future similar to past: stationary P(x,y), i.i.d.data - discrepancy measure or loss function, i.e. MSE What if these assumptions do not hold?

31 31 Example: Regression Given: training data Find a function that minimizes squared error for a large number (N) of future samples: BUT Future data is unknown ~ P(x,y) unknown

32 32 2.3.2 Complexity Control: parametric modeling Consider regression estimation Ten training samples Fitting linear and 2-nd order polynomial:

33 33 Complexity Control: local estimation Consider regression estimation Ten training samples from Using k-nn regression with k=1 and k=4:

34 34 Complexity Control (cont’d) Complexity (of admissible models) affects generalization (for future data) Specific complexity indices for –Parametric models: ~ # of parameters –Local modeling: size of local region –Data reduction: # of clusters Complexity control = choosing good complexity (~ good generalization) for a given (training) data

35 35 How to Control Complexity ? Two approaches: analytic and resampling Analytic criteria estimate prediction error as a function of fitting error and model complexity For regression problems: Representative analytic criteria for regression Schwartz Criterion: Akaike’s FPE: where p = DoF/n, n~sample size, DoF~degrees-of-freedom

36 36 2.3.3 Resampling Split available data into 2 sets: Training + Validation (1) Use training set for model estimation (via data fitting) (2) Use validation data to estimate the prediction error of the model Change model complexity index and repeat (1) and (2) Select the final model providing lowest (estimated) prediction error BUT results are sensitive to data splitting

37 37 K-fold cross-validation 1.Divide the training data Z into k randomly selected disjoint subsets {Z 1, Z 2,…, Z k } of size n/k 2.For each ‘left-out’ validation set Z i : - use remaining data to estimate the model - estimate prediction error on Z i : 3.Estimate ave prediction risk as

38 38 Example of model selection(1) 25 samples are generated as with x uniformly sampled in [0,1], and noise ~ N(0,1) Regression estimated using polynomials of degree m=1,2,…,10 Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the polynomial model, along with training (* ) and validation (*) data points, for one partitioning. mEstimated R via Cross validation 10.1340 20.1356 30.1452 40.1286 50.0699 60.1130 70.1892 80.3528 90.3596 100.4006

39 39 Example of model selection(2) Same data set, but estimated using k-nn regression. Optimal value k = 7 chosen according to 5-fold cross-validation model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning. kEstimated R via Cross validation 10.1109 20.0926 30.0950 40.1035 50.1049 60.0874 70.0831 80.0954 90.1120 100.1227

40 40 More on Resampling Leave-one-out (LOO) cross-validation - extreme case of k-fold when k=n (# samples) - efficient use of data, but requires n estimates Final (selected) model depends on: - random data - random partitioning of the data into K subsets (folds)  the same resampling procedure may yield different model selection results Some applications may use non-random splitting of the data into (training + validation) Model selection via resampling is based on estimated prediction risk (error). Does this estimated error measure reflect true prediction accuracy of the final model?

41 41 Resampling for estimating true risk Prediction risk (test error) of a method can be also estimated via resampling Partition the data into: Training/ validation/ test Test data should be never used for model estimation Double resampling method: - for complexity control - for estimating prediction performance of a method Estimation of prediction risk (test error) is critical for comparison of different learning methods

42 42 Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data. Optimal decision boundary for k=14

43 43 Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data. Optimal decision boundary for k=50 which one is better? k=14 or 50

44 44 Estimating test error of a method For the same example (Ripley’s data) what is the true test error of k-NN method ? Use double resampling, i.e. 5-fold cross validation to estimate test error, and 6-fold cross-validation to estimate optimal k for each training fold: Fold #kValidationTest error 12011.76%14% 290%8% 3117.65%10% 4125.88%18% 5717.65%14% mean10.59%12.8% Note: opt k-values are different; errors vary for each fold, due to high variability of random partitioning of the data

45 45 Estimating test error of a method Another realization of double resampling, i.e. 5-fold cross validation to estimate test error, and 6-fold cross- validation to estimate optimal k for each training fold: Fold #kValidationTest error 1714.71%14% 2318.82%14% 32511.76%10% 4114.71%18% 56211.76%4% mean12.35%12% Note: predicted average test error (12%) is usually higher than minimized validation error (11%) for model selection

46 46 2.4 Application Example Why financial applications? - “market is always right” ~ loss function - lots of historical data - modeling results easy to understand Background on mutual funds Problem specification + experimental setup Modeling results Discussion

47 47 OUTLINE 2.0 Objectives 2.1 Terminology and Basic Learning Problems 2.2 Basic Learning Approaches 2.3 Generalization and Complexity Control 2.4 Application Example 2.5 Summary

48 48 2.4.1 Background: pricing mutual funds Mutual funds trivia and recent scandals Mutual fund pricing: - priced once a day (after market close)  NAV unknown when order is placed How to estimate NAV accurately? Approach 1: Estimate holdings of a fund (~200- 400 stocks), then find NAV Approach 2: Estimate NAV via correlations btwn NAV and major market indices (learning)

49 49 2.4.2 Problem specs and experimental setup Domestic fund: Fidelity OTC (FOCPX) Possible Inputs: SP500, DJIA, NASDAQ, ENERGY SPDR Data Encoding: Output ~ % daily price change in NAV Inputs ~ % daily price changes of market indices Modeling period: 2003. Issues: modeling method? Selection of input variables? Experimental setup?

50 50 Experimental Design and Modeling Setup Mutual Funds Input Variables YX1X2X3 FOCPX^IXIC-- FOCPX^GSPC^IXIC- FOCPX^GSPC^IXICXLE All variables represent % daily price changes. Modeling method: linear regression Data obtained from Yahoo Finance. Time period for modeling 2003. Possible variable selection:

51 51 Specification of Training and Test Data Year 2003 1, 23, 45, 67, 89, 1011, 12 Training Test Two-Month Training/ Test Set-up  Total 6 regression models for 2003

52 52 Results for Fidelity OTC Fund (GSPC+IXIC) Coefficientsw0w1 (^GSPC)W2(^IXIC) Average-0.0270.1730.771 Standard Deviation (SD)0.0430.1500.165 Average model: Y =-0.027+0.173^GSPC+0.771^IXIC ^IXIC is the main factor affecting FOCPX’s daily price change Prediction error: MSE (GSPC+IXIC) = 5.95%

53 53 Results for Fidelity OTC Fund (GSPC+IXIC) Daily closing prices for 2003: NAV vs synthetic model

54 54 Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE ^IXIC is the main factor affecting FOCPX daily price change Prediction error: MSE (GSPC+IXIC+XLE) = 6.14% Coefficientsw0w1 (^GSPC)W2(^IXIC)W3(XLE) Average-0.0290.1470.7840.029 Standard Deviation (SD)0.0440.2150.1910.061 Results for Fidelity OTC Fund (GSPC+IXIC+XLE)

55 55 Results for Fidelity OTC Fund (GSPC+IXIC+XLE) Daily closing prices for 2003: NAV vs synthetic model

56 56 Effect of Variable Selection Different linear regression models for FOCPX: Y =-0.035+0.897^IXIC Y =-0.027+0.173^GSPC+0.771^IXIC Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI Have different prediction error (MSE): MSE (IXIC) = 6.44% MSE (GSPC + IXIC) = 5.95% MSE (GSPC + IXIC + XLE) = 6.14% MSE (GSPC + IXIC + XLE + DJIA) = 6.43% (1)Variable Selection is a form of complexity control (2)Good selection can be performed by domain experts

57 57 Discussion Many funds simply mimic major indices  statistical NAV models can be used for ranking/evaluating mutual funds Statistical models can be used for - hedging risk and - to overcome restrictions on trading (market timing) of domestic funds Since 70% of the funds under-perform their benchmark indices, better use index funds

58 58 Summary Inductive Learning ~ function estimation Goal of learning (empirical inference): to act/perform well, not system identification Important concepts: - training data, test data - loss function, prediction error (aka risk) - basic learning problems - basic learning methods Complexity control and resampling Estimating prediction error via resampling


Download ppt "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic Learning Approaches and Complexity Control."

Similar presentations


Ads by Google