Download presentation
Presentation is loading. Please wait.
1
Statistical Inference and Regression Analysis: GB.3302.30
Professor William Greene Stern School of Business IOMS Department Department of Economics
2
Statistics and Data Analysis
Part 6 – Regression Model Conditional Mean
3
U.S. Gasoline Price 6 Months 5 Years
4
Impact of Change in Gasoline Price on Consumer Demand?
Elasticity concepts Long term vs. short term Income Demand for gasoline Demand for food
5
Movie Success vs. Movie Online Buzz Before Release (2009)
7
Internet Buzz and Movie Success
Box office sales vs. Can’t wait votes 3 weeks before release
8
Is There Really a Relationship?
BoxOffice is obviously not equal to f(Buzz) for some function. But, they do appear to be “related,” perhaps statistically – that is, stochastically. There is a covariance. The linear regression summarizes it. A predictor would be Box Office = a + b Buzz. Is b really > 0? What would be implied by b > 0?
11
Covariation – Education and Life Expectancy
Causality? Covariation? Does more education make people live longer? Is there a hidden driver of both? (Per capita GDP?)
12
Using Regression to Predict
The equation would not predict Titanic. Predictor: Overseas box office = a + b Domestic box office The prediction will not be perfect. We construct a range of “uncertainty.”
13
Conditional Variation and Regression
Conditional distribution of a pair of random variables f(y|x) or P(y|x) Mean function, E[y|x] = Regression of y on x.
14
Expected Income Depends on Household Size
y|x ~ Normal[ x, 42 ], x = 1,2,3,4; Poisson
15
Average Box Office by Internet Buzz Index = Average Box Office for Buzz in Interval
16
Linear Regression? Fuel Bills vs. Number of Rooms
17
Independent vs. Dependent Variables
Y in the model Dependent variable Response variable X in the model Independent variable: Meaning of ‘independent’ Regressor Covariate Conditional vs. joint distribution
18
Linearity and Functional Form
y = g(x) h(y) = + f(x) y = + x y = exp( + x); logy = + x y = + (1/x) = + f(x) y = e x, logy = + log x. Etc.
19
Inference and Regression
Least Squares
20
Fitting a Line to a Set of Points
Yi Gauss’s method of least squares. Residuals Predictions a + bxi Choose and to minimize the sum of squared residuals Xi
21
Least Squares Regression
23
Least Squares Algebra
24
Least Squares
25
Normal Equations
26
Computing the Least Squares Parameters a and b
(We will use sy2 later.)
27
Least Absolute Deviations
28
Least Squares vs. LAD
29
Inference and Regression
Regression Model
30
b Measures Covariation
Predictor Box Office = a + b Buzz.
31
Interpreting the Function
a = the life expectancy associated with 0 years of education. No country has average years of education The regression only applies in the range of experience. b = the increase in life expectancy associated with each additional year of average education. b a The range of experience (education)
32
Covariation and Causality
Does more education make you live longer (on average)?
33
Causality? Correlation = 0.84 (!)
Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86. Ht. Inc. Ht. Inc. Ht. Inc. Estimated Income = Height
34
Inference and Regression
Analysis of Variance
35
Regression Fits Regression of salary vs. years Regression of fuel bill vs. number of experience of rooms for a sample of homes
36
Regression Arithmetic
37
Variance Decomposition
38
Fit of the Equation to the Data
39
Regression vs. Residual SS
40
Analysis of Variance Table
Source Degrees of Freedom Sum of Squares Mean Square F Ratio P Value Regression 1 2P[z>√F]* Residual N-2 Total N-1
41
Explained Variation The proportion of variation “explained” by the regression is called R-squared (R2) It is also called the Coefficient of Determination (It is the square of something – to be shown later)
42
ANOVA Table Source Degrees of Freedom Sum of Squares Mean Square
F Ratio P Value Regression 1 2P[z>√F]* Residual N-2 Total N-1
43
Movie Madness Fit
44
Regression Fits R2=0.360 R2=0.522 R2=0.424 R2=0.880
45
R2 = 0.924 in this cross section.
R Squared Benchmarks Aggregate time series: expect .9+ Cross sections, .5 is good. Sometimes we do much better. Large survey data sets, .2 is not bad. R2 = in this cross section.
46
Correlation Coefficient
47
Correlations rxy = 0.6 rxy = 0.723 rxy = rxy = -.402
48
A regression with a high R2 predicts yi well.
R-Squared is rxy2 R-squared is the square of the correlation between yi and the predicted yi which is a + bxi. The correlation between yi and (a+bxi) is the same as the correlation between yi and xi. Therefore,…. A regression with a high R2 predicts yi well.
49
Squared Correlations rxy2 = 0.522 rxy2 = 0.36 rxy2 = .924 rxy2 = .161
50
Movie Madness Estimated equation Estimated coefficients a and b
S = se = estimated std. deviation of ε Square of the sample correlation between x and y N-2 = degrees of freedom Sum of squared residuals, Σiei2 S2 = se2
51
Software
57
MONET.MPJ
58
Use File:Open Worksheet to open an Excel .xls or .xlsx file
61
Stat Basic Statistics Display Descriptive Statistics
67
Stat Regression Regression
69
Results to Report
70
Linear Regression Sample Regression Line
80
Project Import Variables imports .csv
83
Command Typed in Editing Window
84
Cursor in desired line of text (or highlight more than one line)
Press GO button
85
Typing Commands in the Editor
86
Important Commands: SAMPLE ; first - last $ Sample ; 1 – 1000 $
Sample ; All $ CREATE ; Variable = transformation $ Create ; LogMilk = Log(Milk) $ Create ; LMC = .5*Log(Milk)*Log(Cows) $ Create ; … any algebraic transformation $
87
Name Conventions CREATE ; name = any function desired $
Name is the name of a new variable No more than 8 characters in a name The first character must be a letter May not contain -,+,*,/. May contain _.
88
Model Command Model ; Lhs = dependent variable
; Rhs = list of independent variables $ Regress ; Lhs = Milk ; Rhs = ONE,Feed,Labor,Land $ ONE requests the constant term
89
The Go Button
90
“Submitting” Commands
One Command Place cursor on that line Press “Go” button More than one command Highlight all lines (like any text editor)
91
Compute a Regression Sample ; All $ Regress ; Lhs = YIT
; Rhs = One,X1,X2,X3,X4 $ The constant term in the model
93
Standard Three Window Operation
Commands typed in editing window Project window shows variables Results appear in output window
94
Inference and Regression
Regression Model
95
The Linear Regression Statistical Model
The linear regression model Sample statistics and population quantities Specifying the regression model
96
A Linear Regression Predictor: Box Office = Buzz
97
Data and Relationship We suggested the relationship between box office and internet buzz is Box Office = Buzz Note the obvious inconsistency in the figure. This is not the relationship. How do we reconcile the equation with the data?
98
Modeling the Underlying Process
A model that explains the process that produces the data that we observe: Observed outcome = the sum of two parts (1) Explained: The regression line (2) Unexplained (noise): The remainder Regression model The “model” is the statement that part (1) is the same process from one observation to the next.
99
The Population Regression
THE model: A specific statement about the parts of the model (1) Explained: Explained Box Office = α + β Buzz (2) Unexplained: The rest is “noise, ε.” Random ε has certain characteristics Model statement Box Office = α + β Buzz + ε
100
The Data Include the Noise
101
What Explains the Noise?
102
(Regression) The equation linking “Box Office” and “Buzz” is stable
Assumptions (Regression) The equation linking “Box Office” and “Buzz” is stable E[Box Office| Buzz] = α + β Buzz Another sample of movies, say 2012, would obey the same fundamental relationship.
103
Model Assumptions yi = α + β xi + εi
α + β xi is the “regression function” Contains the “information” about Yi in xi Unobserved because α and β are not known for certain εi is the “disturbance. It is the unobserved random component Observed Yi is the sum of two unobserved parts.
104
Model Assumptions About εi
Random Variable Mean zero. The regression is the mean of yi. εi is the deviation from the regression. Variance σ2. Noise εi is unrelated to any values of xi (no covariance) – it’s “random noise” εi is unrelated to any other observations on εj (not “autocorrelated”).
105
Sample “Estimate” vs. Population
106
Application: Health Care Data
German Health Care Usage Data, There are altogether 27,326 observations on German households, DOCTOR = 1(Number of doctor visits > 0) HOSPITAL = 1(Number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / HHKIDS = children under age 16 in the household = 1; otherwise = EDUC = years of schooling AGE = age in years MARRIED = marital status EDUC = years of education 106
107
Sample vs. Population For the full ‘population’ of 27,326 Income = * Educ + ε For a random sample of 52 households, least squares regression produces Income = * Educ + e
108
Sample vs. Population
109
Disturbances vs. Residuals
=y--Buzz e=y-a-bBuzz
110
Standard Deviation of Residuals
Standard deviation of εi = yi-α-βxi is σ σ = √E[εi2] (Mean of εi is zero) Sample a and b estimate α and β Residual ei = yi-a-bxi estimates εi Use √(1/N)Σei2 to estimate σ? Close, not quite. Why N-2? Relates to the fact that two parameters (α,β) were estimated. Proof to come later.
111
Residuals
112
Samples and Populations
Population (Theory) yi = α + βxi + εi Parameters α, β Regression α + βxi Mean of yi | xi Disturbance, εi Mean 0 Standard deviation σ No correlation with xi Sample (Observed) yi = a + bxi + ei Estimates, a, b Fitted regression a + bxi Predicted yi|xi Residuals, ei Sample mean 0, Sample std. dev. se Sample Cov[x,e] = 0
113
Linear Regression Sample Regression Line
114
A Cost Model Electricity.mpj Total cost in $Million
Output in Million KWH N = 123 American electric utilities Model: Cost = α + βKWH + ε
115
Cost Relationship
116
Sample Regression
117
Interpreting the Model
Cost = Output + e Cost is $Million, Output is Million KWH. Fixed Cost = Cost when output = 0 Fixed Cost = $2.44Million Marginal cost = Change in cost/change in output = * $Million/Million KWH = $/KWH = cents/KWH.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.