Missing at Random (MAR)  is unknown parameter of the distribution for the missing- data mechanism The probability some data are missing does not depend.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Models with Discrete Dependent Variables
Kernel methods - overview
x – independent variable (input)
Raymond J. Carroll Texas A&M University Nonparametric Regression and Clustered/Longitudinal Data.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Additive Models and Trees
Review.
Statistical analysis and modeling of neural data Lecture 4 Bijan Pesaran 17 Sept, 2007.
Linear and generalised linear models
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Linear and generalised linear models
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Intelligible Models for Classification and Regression
Review of Lecture Two Linear Regression Normal Equation
Objectives of Multiple Regression
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Two Ending Sunday, September 9 (Note: You must go over these slides and complete every.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Model Comparison for Tree Resin Dose Effect On Termites Lianfen Qian Florida Atlantic University Co-author: Soyoung Ryu, University of Washington.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Chapter 4: Introduction to Predictive Modeling: Regressions
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Estimating standard error using bootstrap
The simple linear regression model and parameter estimation
Piecewise Polynomials and Splines
Chapter 7. Classification and Prediction
CH 5: Multivariate Methods
Boosting and Additive Trees
More on Specification and Data Issues
Machine learning, pattern recognition and statistical data modelling
Machine Learning Basics
More on Specification and Data Issues
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
What is Regression Analysis?
Basis Expansions and Generalized Additive Models (2)
More on Specification and Data Issues
Generalized Additive Model
Multiple Regression Berlin Chen
Presentation transcript:

Missing at Random (MAR)  is unknown parameter of the distribution for the missing- data mechanism The probability some data are missing does not depend on the actual values of the missing data Example: two variables age and income. The data are MAR if the probability that income is missing does not vary according to the value of income that is missing, but may vary according to the age

Missing Completely at Random (MCAR) The probability some data are missing does not depend on either the actual values of the missing data or the observed data The observed values are a random subsample of the sampled values Example: The data are MCAR if the probability that income is missing does not vary according to the value of income or age

Dealing with Missing Features Assume features are missing completely at random Discard observations with any missing values –Useful if the relative amount of missing data is small –Otherwise should be avoided Rely on the learning algorithm to deal with missing values in its training phase –In CART, surrogate splits –In GAM, omit missing values when smoothing against feature in backfitting, then set their fitted values to zero (amounts to assigning the average fitted value to the missing observations)

Dealing with Missing Features (con’t) Impute all missing values before training –Impute the missing value with the mean or median of the nonmissing values for that feature –Estimate a predictive model for each feature given the other features; then impute each missing value by its prediction from the model –Multiple imputations to create different training sets and access the variation of the fitting across training sets (e.g. if using CART as imputation engine, the multiple imputations could be done by sampling from the values in the corresponding terminal nodes)

Questions about Missing Data For the missing data, I still want to ask the question i asked in last , can we have a generalization of the missing data handling for different methods? [Yanjun] p. 294, it's not clear for me how you go about doing imputation, since we still face the same problem of partial input when predicting the missing feature value from the available ones, i.e., each time a different set of features are available to predict the missing ones. [Ben]

Regression Models Linear ModelGeneralized Linear Model Additive Model Generalized Additive Model

Smooth Functions f j () Non-parametric functions (linear smoother ) –Smoothing splines (Basis expansion) –Simple k-nearest neighbor (raw moving average) –Locally weighted average by using kernel weighting –Local linear regression, local polynomial regression Linear functions Functions of more than one variables (interaction term)

Questions about Model In pp. 258, the authors stress that each f_i(X_i) is a "non-parametric" function. However, in the next few sentences, they give examples to fit these functions with cubic smoothing splines, which assume the underlying models. It seems these two statements are contradictory. The other question is whether parametric or non- parametric functions is more appropriate in the generalized additive models. [Wei- Hao] From the reading it seems to me that the motivation to use GAM is to allow "non- parametric" functions to be added together. However, these are simply smoothed parametric basis expansions. How is this "non-parametric" and what has this gained? [Ashish] I tried but failed to figure out why the minimizer for generalized linear additive model is additive cubic spline model. Any clues? We don't have to assume any function forms for those f_i(x_i)? [Wei-Hao] Since the minimizer of Penalized Sum of Squares (Eq. 9.7) is additive cubic splines, what are justifications to use other smoothing functions like local polynomial regression or kernel methods? [Wei-Hao]

Questions about Model (con’t) I have noticed formula 9.7 in P259. In order to guarantee the smoothness, we simply extended our method from one dimension to N dimension using \sum f_j''. But this can only guarantee the smoothness on each single dimension and can not guarantee the smoothness on the whole feature space. It is quite possible that on every dimension the function is very smooth but There are much bumpy "between dimensions". How can we solve it? Is this a shortcoming of backfitting? [Fan] It appears to be the case that, to extend the generalized additive model such as considering the basis function with several variables will be quite messy. Therefore, even though, the additive model is able to introduce the nonlinearity into the model, it still lacks the strong ability of introducing the nonlinear correlation between inputs into models. Please comment it by comparing to the kernel function. [Rong]

Fitting Additive Model Backfitting algorithm 1.Initialize: 2.Cycle: j = 1,2,…, p,…,1,2,…, p,…, (m cycles) Until the functions change less than a prespecified threshold Intuitive motivation for the backfitting algorithm –If additive model is correct, then mp applications of a one-dimensional smoother, NlogN+N operations for cubic smoothing splines

Fitting logistic regression (P99)Fitting additive logistic regression (P262) Iterate: Using weighted least squares to fit a linear model to z i with weights w i, give new estimates 3. Continue step 2 until converge 1. where 2. Iterate: b. a. c.c. Using weighted backfitting algorithm to fit an additive model to z i with weights w i, give new estimates b. 3.Continue step 2 until converge

Minimize penalized least squares for additive model or maximize penalized log-likelihood for generalized additive model Convergence not always guaranteed –guaranteed convergence under certain conditions – see Chapter 5 of Generalized Additive Model –converge in most cases Model Fitting

Questions about Fitting Algorithm It seems to me that the ideas of "backfitting" and "backpropagation" are quite similar to each other. What makes different? Under what condition will Gauss Seidel converge? (linear & nonlinear system) And how fast? [Jian] For the local scoring algorithms for additive logistic regression, can we make a comparison with the page 99 which is about fitting logistic regression models? [Yanjun] Could you explain the algorithm 9.2 more clearly? [Rong] On p. 260, by stipulating that Sum_{i=1..N} f_j(x_{i,j})=0 for all j, do we effectively the variance over training sample to be 0? (note alpha in this case is avg(y_i) ) Why do we want to do that? What's the implication to the bias? [Ben] In Page 260, for the penalized sum of squares, for (9.7), if the matrix of input values is singular (uninvertible), the linear part of the f_j can not be unique, but the nonlinear part can be... what and why does this happen? [Yanjun]

PRO: No rigid parametric assumption Can identify and characterize nonlinear regression effects Avoid curse of dimensionality by assuming additive structure More flexible than linear models while still retaining much of their interpretability CON: More complicated to fit Have limitation for large data-mining applications Pros and Cons

Where GAMs are Useful The relationship between the variables is expected to be of a complex form, not easily fitted by standard linear or non- linear methods. There is no a priori reason for using a particular model We would like the data to suggest the appropriate functional form Explanatory data analysis

Questions about Link Functions For GAM, the books gives us three examples of the link functions and points out that they are from exponential family sampling models, which in addition includes gamma and negative-binomial distribution. Can we find out why we use these kinds of functions as link functions? [Yanjun] Can you explain to me about the intuition of the classical link functions, how each function corresponds to one particular data distribution as is said on page 258? That is, g(x) = x corresponds to Gaussian response data, logit to binomial probabilities, g(x)=log(x) for Poission count data? [Yan] On p. 259 various possibilities in formulating the link function g are mentioned. Could you hint at how we actually come up with a particular formulation? By correlation analysis? Cross validation? [Ben]