Lecture 1: Introduction to Machine Learning Methods Stephen P. Ryan Olin Business School Washington University in St. Louis
Structure of the Course Goals Basic machine learning algorithms Heterogeneous treatment effects Some recent econometric theory advances How to use machine learning in your research projects Sequence: Introduction to various ML techniques Causal inference and ML Random projection with large choice sets Double machine learning Using text data Moment trees for heterogeneous models
Shout Out to Elements of Statistical Learning http://statweb.stanford.edu/~tibs/ElemStatLearn/ Free PDF online Convenient summary of many ML techniques by some of the leaders of the field: Trevor Hastie, Robert Tibshirani, and Jerome Friedman Many examples, figures, algorithms in these notes are drawn from this book Other web resources: Christopher Manning and Richard Socher: Natural Language Processing with Deep Learning, http://web.stanford.edu/class/cs224n/syllabus.html Oxford Deep NLP course: https://github.com/oxford-cs-deepnlp- 2017/lectures Deep learning: http://deeplearning.net/
Machine Learning What is machine learning? Key idea: prediction Econometric theory has provided a general treatment of nonparametric estimation in the last twenty years Highest level: Nothing new under the sun Twist: combine model selection with estimation Fit in the middle ground between fully pre-specified parametric models and completely nonparametric approaches Scalability on large data sets (e.g., so-called big data)
Two Broad Types of Machine Learning Methods Frequency domain: Pre-specify the right-hand side variables Select among those Many methods select a finite number of elements from potentially high- dimensional set Characteristic space: Search over which variables (and their interactions) belong as explanatory variables Don’t need to take a stand on functional form of elements Both approaches require some restrictions on function complexity
The General Problem Consider: 𝑦=𝑓(𝑥,𝛽,𝜖) Even when x is univariate this can be a complex relationship What are some approaches for estimating this relationship? Nonparametric: kernel regression Semi-nonparametric: series estimation, e.g. b-splines with increasing knot density Parametric: assume functional form, e.g. 𝑦= 𝑥 ′ 𝛽+𝜖 Each approach has pros and cons
ML Approaches in Characteristic Space Alternatively, consider growing the set of RHS variables Classification and regression trees do this Recursive partitioning of the characteristic space One starts with just the data and lets the tree decide what interactions matter Nodes split on decision rule Final nodes (“leaves”) tree are classification vote or mean value of function
ML Approaches in Frequency Domain Many different ML algorithms approach this problem by saturating the RHS of the model, then assigning most of them zero influence Variants of penalized regression: min 𝑦 𝑖 − 𝑥 𝑖 ′ 𝛽 2 −𝐺( dim 𝛽 ) Where G is some penalty function Ridge regression: penalize squared beta (normed X) LASSO: penalize sum absolute value beta Square-root LASSO: penalize sqrt sum absolute value beta Support vector machines have slightly different objective function Also series of incremental approaches, simple to complex
Comparing LASSO and Ridge Regression
Incremental Forward Stagewise Regression
Coefficient Paths
Basis Functions
Increasing Smoothness
Splines in General There are many varieties of splines Smoothing spline is the solution to the following problem: Amazingly, there is a cubic spline that minimizes this objective function All splines can also be written in terms of basis splines, or b-splines B-splines are great and you should think about them when you need to approximate an unknown function
B-Spline Construction Define a set of knots for a b-spline of degree M We place two knots at the endpoints of our data, call that knot 0 and knot K We define K-1 knots on the interior between those two endpoints We also add M knots to the left and the right outside endpoints We then can compute b-splines recursively (see the book for details) Key point: b-splines are defined locally Upshot: numerically stable Approximate a function by least squares: 𝑓 𝑥 = 𝑖 𝛽 𝑖 𝑏 𝑠 𝑖 (𝑥)
Visual Representation of Basis Functions
Bias vs Variance: Smoothing Spline Example
Many Techniques Have Tuning Parameters In general, question is how to determine those tuning parameters? One approach is to use cross-validation Basic idea: estimate on one sample, predict on another Common approach: leave-one-out (LOO) Across all subsamples where one observation is removed, estimate model, predict error for omitted observation, sum up Balances too much bias (overfitting) against too much variance (oversmoothing)
Example: Smoothing Splines
General Problem: Model Selection and Fit
How to Think About This Problem In general, need to assess fit in-sample Need to assess fit across models Bias-variance decomposition: Optimal solution is to use three-way partitioning of data set: training data (fit model), validation data (choose among models), and test data (show test error)
Related to ML Methods Many ML methods seek to balance the variance-bias tradeoff in some fashion We will see honest trees later on building on this principle through sample splitting There are also some stochastic methods, such as bagging and random forests
Bootstrap and Bagging Bootstrap refers to the process of resampling your observations to (hopefully) learn something about the population, typically standard errors or confidence intervals Resample with replacement many times, compute statistic of interest on that sample; produces distribution Bagging (bootstrap aggregation) is a similar idea: replace sample with bootstrap samples:
Trees One of the settings where bagging can really reduce variance is in trees Trees recursively partition the feature space into rectangles At each node, tree splits sample on basis of some rule (e.g., x3>0.7, or x4={BMW, Mercedes-Benz}) Splits chosen to maximize some criterion (e.g., mean-squared prediction error) Tree grown until some stopping criterion is met Leaf returns type (classification tree) or average value (regression tree)
Example of a Tree
Bagging a Tree Bagging a tree can often lead to great reductions in the variance of the estimate Why? Bagging replaces single estimate using all the data with ensemble of estimates using data resampled with replacement Let 𝜙 𝑥 be a predictor for a given sample x, and let 𝜇 𝑥 = 𝐸 𝑥 (𝜙 𝑥 ). Then:
Random Forest The idea of resampling can be extended to subsampling The random forest is an ensemble version of the regression tree Key difference: estimate trees on bootstrap samples, but restrict set of variables considered at each split to be a finite subset Why? This helps break correlation across trees That’s useful since the variance of a mean of identically distributed RV’s is: 𝜌 𝜎 2 + 1−𝜌 𝐵 𝜎 2
Random Forest Algorithm
Support Vector Machines (linear kernel) Support vector machines are a modified penalized regression method: 𝑖 𝑉 𝑦 𝑖 −𝑓 𝑥 𝑖 + 𝜆 2 𝛽 2 Where 𝑉 𝑟 = 𝑟−𝜖 only if 𝑟>𝜖 Basically, going to ignore “small” errors Nonlinear optimization problem, results in only a subset (“support vector”) of coefficients being non-zero when 𝑓 𝑥 𝑖 is linear
So How Is Any of This Useful? Think about machine learning as combination of model selection and estimation Econometric theory has given us high-level tools for thinking about completely nonparametric estimation These techniques fit between fully parametric and fully nonparametric estimation Key point: we are approximating conditional expectations Economics literature now considering problem of how to take model selection seriously
Counterpoint