By Matt Bogard, M.S. May 12, 2011.  Single Variable Regression  Multivariable Regression  Logistic Regression  Data Mining vs. Classical Statistics.

By Matt Bogard, M.S. May 12, 2011

 Single Variable Regression  Multivariable Regression  Logistic Regression  Data Mining vs. Classical Statistics  Decision Trees  Neural Networks

 Can we describe this relationship with an equation for a line?  Fitting a line to the data- gives us the equation (regression equation)  How well does this line fit our data? How well does it describe the relationship between the variables (x & y)?  Interpretation & Inference

 The goal then is to minimize the sum of squared residuals. That is minimize:  ∑ e i 2 = ∑ (y i - b 0 - b 1 X i ) 2 with respect to b 0 and b 1.  This can be accomplished by taking the partial derivatives of ∑ e i 2 with respect to each coefficient and setting it equal to zero.  ∂ ∑ e i 2 / ∂ b 0 = 2 ∑ (y i - b 0 - b 1 X i ) (-1) = 0  ∂ ∑ e i 2 / ∂ b 1 = 2 ∑(y i - b 0 - b 1 X i ) (-X i ) = 0  Solving for b 0 and b 1 yields:  b 0 = y bar - b 1 ∑ X bar )  b 1 = ∑ (( X i - Xbar) (y i - ybar)) / ∑ ( X i - Xbar) = SS( X,y) / SS(X).

Larger R 2 -> better fit Larger F -> significance of β (model)  R 2 = SSR/SST  MSR = SSR/df  MSE = SSE/df  F =MSR/MSE  E(MSE) = σ 2  E(MSR) = σ 2 + β∑(x-xbar) 2  If β = 0 then F = σ 2 / σ 2  If β ≠ 0 then F=[σ 2 +β∑(x-xbar) 2 ]/ σ 2

 VAR(b j ) & SE(b j )  (will discuss later)  Test Ho: b j = β o  t= (b j -β o )/ SE(b j )  Note: if Ho: b j = 0 = β o then t = (b j / SE(b j ) and gives the same results as the F-test in a single variable regression

LIBNAME ADHOC ‘file path’; /* GET DATA */ DATA LSD; INPUT SCORE CONC; CARDS; 78.93 1.17 58.20 2.97 67.47 3.26 37.47 4.69 45.65 5.83 32.92 6.00 29.97 6.41 ; RUN; PROC REG; MODEL SCORE=CONC; PLOT SCORE*CONC; /*PLOTS REGRESSION LINE FIT TO DATA*/ RUN; QUIT;

PROC GLM DATA = LSD; MODEL SCORE=CONC; RUN; QUIT;

 y = bo + b 1 X 1 + b 2 X 2 + e  Often viewed in the context of matrices  Represented by y = bX + e  b = (X T X) -1 X T y = X T y / (X’X) ~ S(XY)/SS(x)  b <- solve( t(x) %*% x )%*%( t(x) %*% y )

 (1) E(y|x) = βX ‘we are estimating a linear approximation to the conditional expectation of y '  (2) E(e) = 0 ‘white noise error terms’  (3) VAR(e) = σ 2 I ‘constant variance’ ‘no heteroskedasticity & no serial correlation  (4) Rank(X) = k ‘no perfect multicollinearity’

 Why are we concerned with the error terms?  Recall b = (X’X) -1 X’y hence our b estimate does not depend on e and E(b) = β  VAR(b) = s 2 (X’X) -1 where s 2 = MSE = e’e/n-k ~∑e i 2 /n-1  Note SE(b) = √ VAR(b) and t= (b j -β o )/ SE(b j )  Note F = MSR/MSE and E(MSE) = σ 2  If we have σ i 2 vs σ 2 then we run into issues related to hypothesis testing and making inferences

 Maybe rank(X) = k, but there is still some correlation between the X variables in the regression.

 Blue  b x1  Green  b x2  Red -> correlation between x1 and x2  As corr(x1,x2) increases, blue and green decrease, red increases (circles overlap)  Decreased information used to estimate b’s, leads to increased variance in the estimates

 R 2 : b’s jointly can still explain variation in y.  Research: inferences about the specific relationship between X 1 and Y,  rely on SE(b) which are inflated by multicollinearity  Forecasting/Prediction: we are more comfortable with multicollinearity (Greene,1990; Kennedy, 2003; Studenmund; 2001)

DATA REG; INPUT INTEREST INFLATION INVESTMENT ; CARDS; 5.164.4.161 5.875.15.172 5.955.37.158 4.884.99.173 4.504.16.195 6.44 5.75.217 7.838.82.199 6.259.31.163 5.5 5.21.195 5.46 5.83.231 7.467.40.257 10.288.64.259 11.779.31.225 13.429.44.241 11.025.99.204 ; RUN;

PROC REG DATA = REG; MODEL INVESTMENT = INTEREST INFLATION/VIF ; RUN; QUIT;

 Ex: retention = Y/N  If y = { 0 or 1} then E[y|X] = P i a probability interpretation  Estimated probabilities outside (0,1)  e~binomial  var(e) = n*p*(1-p) which violates assumption of uniform variance

 Note however, despite theoretical concerns, OLS is used quite often without practical implications  Example: Statistical Alternatives for Studying College Student Retention: A Comparative Analysis of Logit, Probit, and Linear Regression. Dey & Astin. Research in Higher Education Vol 34 No 5, 1993.

D i = probability (y|x) = 1 = e Xβ / ( 1 + e Xβ )

 Choose β’s to maximize the likelihood of the sample being observed.  Maximizes the likelihood that data comes from a ‘real world’ characterized by one set of β’s vs another.

 L(β) = ∏ e Xβ / (1 + e Xβ ) ∏ 1/(1 + e Xβ )  the product of densities which give p(y=1) and p(y=0)  Take ln of both sides, choose β to maximize, → β MLE

 NOT minimizing sums of squares, not fitting a line to data  NO R 2  Changes in Log-likelihood are compared for full vs. restricted models to provide measures of ‘deviance’  Deviance is used for fit statistics such as AIC, chi-square test, pseudo- r- square

from Applied Choice Analysis. Hensher, Rose & Greene. 2005  Based on ratios of deviance for full vs. restricted model. Not directly comparable to R 2 from OLS

 % correct predictions  % correct 1’s  % correct 0’s

 β = change in the log of odds of y given a change in X  e β = odds ratio

PROC LOGISTIC

ODS GRAPHICS ON; ODS HTML; PROC LOGISTIC DATA = ADHOC.LOGIT PLOTS = ROC OUTMODEL = MODEL1; MODEL CLASS = X1 X2/ RSQ LACKFIT; SCORE OUT = SCORE1 FITSTAT; RUN; QUIT;

 “There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown”  "Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems." From Statistical Modeling: The Two Cultures. Statistical Science 2001, Vol. 16, No. 3, 199–231. Leo Breiman.

 Classical Statistics: Focus is on hypothesis testing of causes and effects and interpretability of models. Model Choice is based on parameter significance and In-sample Goodness-of-fit.  Example: Regression, Logit/Probit, Duration Models, Discriminant Analysis Machine Learning: Focus is on Predictive Accuracy even in the face of lack of interpretability of models. Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.  Example : Classification and Regression Trees, Neural Nets, K- Nearest Neighbors, Association Rules, Cluster Analysis

 ‘prediction error over an independent test sample’  A function of the bias and variance a model exhibits across multiple data sets  There is a bias- variance trade off related to model complexity

 Partition data into training, validation, and test samples (if data is sufficient)  Other methods: k-fold cross validation, random forests, ensemble models  Choose inputs (and model specification) that optimizes model performance on test and validation data

 “Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model (like a constant) in each one.” (Trevor Hastie, Robert Tibshirani & Jerome Friedman, 2009)

 Each split creates a cross tabulation  The split is evaluated with a chi-square  Pearson's Chi-squared test  data: tab1 X-squared = 52.3918, df = 1, p-value = 4.546e-13

 A nonlinear model of complex relationships composed of multiple 'hidden' layers (similar to composite functions)  Y = f(g(h(x)) or  x -> hidden layers ->Y

 ACTIVATION FUNCTION: formula used for transforming values from inputs and the outputs in a neural network.  COMBINATION FUNCTION: formula used for combining transformed values from activation functions in neural networks.  HIDDEN LAYER: The layer between input and output layers in a neural network.  RADIAL BASIS FUNCTION: A combination function that is based on the Euclidean distance between inputs and weights

 Hidden Layer: h 1 = logit(w 10 +w 11 x 1 + w 12 x 2 ) h 2 = logit(w 20 +w 21 x 1 + w 22 x 2 ) h 3 = logit(w 30 +w 31 x 1 + w 32 x 2 ) h 4 = logit(w 40 +w 41 x 1 + w 42 x 2 )  Output Layer: Y= W 0 + W 1 h 1 + W 2 h 2 + W 3 h 3 + W 4 h 4

 There is no ‘theoretically sound’ criteria for architecture selection in terms of the # of hidden units & hidden layers  The Autoneural node ‘automates’ some of these choices to a limited extent  SAS Global Forum 2011: there was a presentation utilizing genetic algorithms  Neural Networks don’t address model selection – typically pre-filter inputs via use of decision trees & regression nodes  Interpretation is a challenge- finance companies employ them for marketing purposes don’t use in areas subject to litigation (loan approvals)

 Selection of Target Sites for Mobile DNA Integration in the Human Genome Berry C, Hannenhalli S, Leipzig J, Bushman FD, 2006 Selection of Target Sites for Mobile DNA Integration in the Human Genome. PLoS Comput Biol 2(11): e157. doi:10.1371/journal.pcbi.0020157 (supporting information Text S1.)  Econometric Analysis. William H. Greene. 1990  A Guide to Econometrics. Kennedy. 5 th Ed. 2003  Statistical Alternatives for Studying College Student Retention: A Comparative Analysis of Logit, Probit, and Linear Regression. Dey & Astin. Research in Higher Education Vol 34 No 5, 1993.  Statistical Modeling: The Two Cultures. Statistical Science 2001, Vol. 16, No. 3, 199–231. Leo Breiman.  Applied Choice Analysis. Hensher, Rose & Greene. 2005  The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition Trevor Hastie, Robert Tibshirani & Jerome Friedman. 2009 The Elements of Statistical Learning:  A Course in Econometrics. Arthur S. Goldberger. 1991.  SAS Enterprise Miner  R Statistical Package http://www.r-project.org/

By Matt Bogard, M.S. May 12, 2011.  Single Variable Regression  Multivariable Regression  Logistic Regression  Data Mining vs. Classical Statistics.

Similar presentations

Presentation on theme: "By Matt Bogard, M.S. May 12, 2011.  Single Variable Regression  Multivariable Regression  Logistic Regression  Data Mining vs. Classical Statistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By Matt Bogard, M.S. May 12, 2011.  Single Variable Regression  Multivariable Regression  Logistic Regression  Data Mining vs. Classical Statistics.

Similar presentations

Presentation on theme: "By Matt Bogard, M.S. May 12, 2011.  Single Variable Regression  Multivariable Regression  Logistic Regression  Data Mining vs. Classical Statistics."— Presentation transcript:

Similar presentations

About project

Feedback