Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9)

Slides:



Advertisements
Similar presentations
Statistics for Social and Behavioral Sciences Session #16: Confidence Interval and Hypothesis Testing (Agresti and Finlay, from Chapter 5 to Chapter 6)
Advertisements

A Sampling Distribution
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Statistics for Social and Behavioral Sciences Part IV: Causality Randomized Experiments, ANOVA Chapter 12, Section 12.1 Prof. Amine Ouazad.
Correlation and regression Dr. Ghada Abo-Zaid
Business Statistics for Managerial Decision
Hypothesis Testing IV Chi Square.
Probability Probability; Sampling Distribution of Mean, Standard Error of the Mean; Representativeness of the Sample Mean.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Statistics for Social and Behavioral Sciences Session #11: Random Variable, Expectations (Agresti and Finlay, Chapter 4) Prof. Amine Ouazad.
QBM117 Business Statistics
Chapter 5 Basic Probability Distributions
Part III: Inference Topic 6 Sampling and Sampling Distributions
 The Law of Large Numbers – Read the preface to Chapter 7 on page 388 and be prepared to summarize the Law of Large Numbers.
Statistics for CS 312. Descriptive vs. inferential statistics Descriptive – used to describe an existing population Inferential – used to draw conclusions.
“There are three types of lies: Lies, Damn Lies and Statistics” - Mark Twain.
Statistics for Social and Behavioral Sciences Part IV: Causality Association and Causality Session 22 Prof. Amine Ouazad.
Midterm 1 Well done !! Mean 80.23% Median 84.6% Standard deviation of ppt. 5 th percentile is 53.
R. Kass/S07 P416 Lec 3 1 Lecture 3 The Gaussian Probability Distribution Function Plot of Gaussian pdf x p(x)p(x) Introduction l The Gaussian probability.
Review of normal distribution. Exercise Solution.
Statistics for Social and Behavioral Sciences Session #15: Interval Estimation, Confidence Interval (Agresti and Finlay, Chapter 5) Prof. Amine Ouazad.
Statistics for Social and Behavioral Sciences Session #17: Hypothesis Testing: The Confidence Interval Method and the T-Statistic Method (Agresti and Finlay,
Statistics for Social and Behavioral Sciences Part IV: Causality Multivariate Regression Chapter 11 Prof. Amine Ouazad.
Lesson Means and Variances of Random Variables.
Statistics for Social and Behavioral Sciences Session #18: Literary Analysis using Tests (Agresti and Finlay, from Chapter 5 to Chapter 6) Prof. Amine.
Confidence Intervals and Hypothesis Testing
Chapter 5 Sampling Distributions
Statistics for Social and Behavioral Sciences Session #14: Estimation, Confidence Interval (Agresti and Finlay, Chapter 5) Prof. Amine Ouazad.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Inferences for Regression
Introduction to Data Analysis Probability Distributions.
Statistics for Social and Behavioral Sciences
1 Lecture 5 Binomial Random Variables Many experiments are like tossing a coin a fixed number of times and recording the up-face. * The two possible outcomes.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
A Sampling Distribution
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Chapter 7 Random Variables.  Sample spaces are not always numeric (example tossing 4 coins: HTTH, TTTH, etc.)  If we let X = the number of heads, then.
Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad.
Applied Business Forecasting and Regression Analysis Review lecture 2 Randomness and Probability.
DISCRETE PROBABILITY DISTRIBUTIONS
Statistics for Social and Behavioral Sciences Part IV: Causality Multivariate Regression R squared, F test, Chapter 11 Prof. Amine Ouazad.
Hypothesis testing Summer Program Brian Healy. Last class Study design Study design –What is sampling variability? –How does our sample effect the questions.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Statistics for Social and Behavioral Sciences Part IV: Causality Inference for Slope and Correlation Section 9.5 Prof. Amine Ouazad.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Section 10.1 Confidence Intervals
7.2 Means and variances of Random Variables (weighted average) Mean of a sample is X bar, Mean of a probability distribution is μ.
Statistics for Social and Behavioral Sciences Part IV: Causality Comparison of two groups Chapter 7 Prof. Amine Ouazad.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Mistah Flynn.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
The Practice of Statistics Third Edition Chapter 7: Random Variables Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
ANOVA, Regression and Multiple Regression March
1 Binomial Random Variables Lecture 5  Many experiments are like tossing a coin a fixed number of times and recording the up-face.  The two possible.
Statistics for Social and Behavioral Sciences Session #19: Estimation and Hypothesis Testing, Wrap-up & p-value (Agresti and Finlay, from Chapter 5 to.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 17 Simple Linear Regression and Correlation.
Chapter 15 Random Variables. Introduction Insurance companies make bets. They bet that you are going to live a long life. You bet that you are going to.
Copyright © 2010 Pearson Education, Inc. Chapter 16 Random Variables.
Statistics 16 Random Variables. Expected Value: Center A random variable assumes a value based on the outcome of a random event. –We use a capital letter,
Chapter 7: Random Variables 7.2 – Means and Variance of Random Variables.
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Means and Variances of Random Variables
Chapter 5 Sampling Distributions
INTEGRATED LEARNING CENTER
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
CHAPTER 12 More About Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Statistics Course Outline P ART I. I NTRODUCTION AND R ESEARCH D ESIGN P ART II. D ESCRIBING DATA P ART III. D RAWING CONCLUSIONS FROM DATA : I NFERENTIAL S TATISTICS P ART IV. : C ORRELATION AND C AUSATION : R EGRESSION A NALYSIS Week 1 Weeks 2-4 Weeks 5-9 Weeks This is where we talk about Zmapp and Ebola! Firenze or Lebanese Express? Where we are right now! Describing associations between two variables

Last session How good are my predictions? How good is my model? – Use the R Squared = ESS/TSS. – TSS = ESS + SSE. – The notations TSS, ESS, SSE are widespread. – The variance is the square of the standard deviation. – The R squared is also the square of the correlation of the predicted value and the actual value.

Outline 1.Conditional distribution – What wage will I earn after graduation? 2.Probabilities (Chapter 4) After the Break:Probability Distributions Chapter 4 of A&F

WHEN LaTisha Styles graduated from Kennesaw State University in Georgia in 2006 she had $35,000 of student debt. This obligation would have been easy to discharge if her Spanish degree had helped her land a well-paid job. But there is no shortage of Spanish-speakers in a nation that borders Latin America. So Ms Styles found herself working in a clothes shop and a fast-food restaurant for no more than $11 an hour. Frustrated, she took the gutsy decision to go back to the same college and study something more pragmatic. She majored in finance, and now has a good job at an investment consulting firm. Her debt has swollen to $65,000, but she will have little trouble paying it off.

A Contingency Table (From Previous Session) But can I do a regression analysis here? We will learn how to produce this later in the course. For now, let’s interpret/understan d this. Shows the average weekly earnings for each year of education.

What wage will I earn after graduation? Data: Census of Population The United States Census is a decennial census mandated by Article I, Section 2 of the United States Constitution, which states: "Representatives and direct Taxes shall be apportioned among the several States... according to their respective Numbers.... The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years.” Variables: – Number of years of education completed. – Wage income. We can only perform regression analysis on quantitative variables.

Linear Relationship anybody? We can postulate that there is a linear relationship between wage income(y) and years of schooling (x). Using greek letters here. True relationship. Notice the importance of residuals (aka errors) Units of measurement matter. Make sure you read the fine print. – y is annual income in dollars. x is in years. Also, with a linear relationship, an additional year of education leads the same increase in income at any stage of your education process. – Makes sense? Check the contingency table. We keep the linear relationship as a convenient model.

Estimation of  and  We estimate  and  by computing the values of a and b. We only have a sample, not the entire population. So, what earnings can we expect?

Linear? A contigency table Here, years of schooling (x) is quantitative discrete, so we can do both regression analysis and a contingency table! (… Continued …) 1 b Years of schooling y Wage Income a

The unconditional distribution of income and education We find that the mean and the standard deviation of the variables are as follows: – Annual income y mean: $41,550SD: $48,659 – Years of schooling x mean: yearsSD: 1.6 years Assuming a bell shaped distribution – Most earnings will fall between: mean +- 3 sd – 95% of earnings will fall between: mean +- 2 sd – 68% of earnings will fall between: mean +- 1 sd Interesting: could do a risk analysis with that data: – What is the probability that you earn more than mean + 2 sd? But the unconditional distribution of annual income mixes both individuals with high and low levels of education…

So instead of using the unconditional distribution of income (aka marginal distribution), we use the conditional distribution of income. “What is the distribution of income given that an individual studied for x years?” The conditional distribution of income and education x Earnings y Education

After x years of education, the predicted (mean) annual income will be: a + b x , x With x = 16 …. We find $42,915 ! Good or bad? Understanding the mean of income given x

Use the fact that TSS = ESS + SSE. – The ESS measures how education explains the variance of earnings. From this we find that Var(y) = Var(predictions) + Var(error). – How do we go from TSS=ESS+SSE to this? But that is the variance of the unconditional distribution of y. How can we find the variance of earnings given a level of education? In such a case Var(y) given a level of education is Var(y given x)=Var(error). And thus the standard deviation of earnings given a level of education is: – SD(residuals)= square root of (SSE/N) = sqrt ( / ) = $18,744 Applying the empirical rule… we find that most annual incomes will lie between: $ 42, x $18,744 and $ 42, x $18,744 $0 and $99,417 Understanding the risks: Approach #1

Use our beautiful formula: Hence the correlation between earnings and education is: It is lower than 1 because the linear relationship doesn’t hold exactly. The r 2 is thus: Notice the variance of the error: Var(error) = (1-R 2 ) x Var(y) And thus ! sd(residuals) = sqrt(1-R 2 ) * SD(y) We find: $18,744 !!! Same as before ! Understanding the risks: Approach #2 Slope: $2, Correlation Standard dev. of x Here 1.6 Standard dev. of y: 19,233.75

Where will your earnings lie with 95% probability? The Empirical Rule Frequency Earnings The conditional distribution has a lower standard deviation… a higher mean than the unconditional distribution. Unconditional distribution Conditional distribution

Wrap up With a linear relationship y = a + b x + e.. – The unconditional distribution of y has a larger variance than the conditional (i.e. marginal) distribution of y given x. The mean of the conditional distribution of y given x is a + b x And the standard deviation is the standard deviation of the errors e i. Such standard deviation is equal to: Again, N in the denominator. Proper discussion of this to follow.

Outline 1.Conditional distribution – What wage will I earn after graduation? 2.Probabilities (Chapter 4) After the Break:Probability Distributions Chapter 4 of A&F

Probability and Luck We play a game together… – Heads you win 1 dirham. – Tails I win 1 dirham. We play the game a very large number of times. Should you play this game? P(heads) = 0.5, P(tails) = 0.5

P(heads) = 1 – P(not heads) P(heads) is read as “probability of heads”. Game sequence: – In the long run, with a balanced coin, 0.5 of the trials will lead to heads, 0.5 of the trials will lead to tails. – The probability of heads is the ratio of the number of heads to the number of trials, with an infinite number of draws… Probability and Luck Perform the game for a very long number of draws. … the longer the game the closer the ratio will be to 0.5

What is the probability that you win twice in a row? – P(heads in the first round) * P(heads in the second round) = – Because the draws in the first and the second round are independent events. What is the probability that you win k times in a row? – P(heads in the first round) * P(heads in the second round) * …. * P(heads in the kth round) = Probability and Luck

Sometimes we can’t repeat our choices Life is full of random events… but We only draw one job at the end of university. – Hard to know what other incomes/jobs we would have gotten. We only draw one marriage. – Subsequent marriages are not identical to the first one. – What is the probability of divorce? We only die once at a particular age. – What is the probability of death at age 50?

In such a case we define the probability of an event as the ratio of the number of such events over the number of individuals in identical circumstances. – … for a very large number of such individuals. Example: number of individuals with the same degree, same age as me: What is the probability of earning more than $45,000 in my first job? Sometimes we can’t repeat our choices

Wrap Up What is the conditional distribution of y given x? – Use the relationship y = a + b x + e to find the mean of y given x. We compute a and b using our formulas. – Use the relationship TSS = ESS + SSE: the variance of the error is the variance of the y minus the variance of the prediction. – The standard deviation of y given x is the standard deviation of the errors (residuals). – Apply the empirical rule. 95% of the y given x will lie between a + b x +- 2 sd(y given x) Beginning probability distributions (chapter 4)

Coming up: Don’t forget: Break of Statistics for 2 weeks. Only one week break for recitations. For help: Amine Ouazad Office 1135, Social Science building Office hour: Wednesday from 4 to 6pm. GAF: Irene Paneda Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.