Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9)

Similar presentations


Presentation on theme: "Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9)"— Presentation transcript:

1 Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

2 Statistics Course Outline P ART I. I NTRODUCTION AND R ESEARCH D ESIGN P ART II. D ESCRIBING DATA P ART III. D RAWING CONCLUSIONS FROM DATA : I NFERENTIAL S TATISTICS P ART IV. : C ORRELATION AND C AUSATION : R EGRESSION A NALYSIS Week 1 Weeks 2-4 Weeks 5-9 Weeks 10-14 This is where we talk about Zmapp and Ebola! Firenze or Lebanese Express? Where we are right now! Describing associations between two variables

3 Last session How good are my predictions? How good is my model? – Use the R Squared = ESS/TSS. – TSS = ESS + SSE. – The notations TSS, ESS, SSE are widespread. – The variance is the square of the standard deviation. – The R squared is also the square of the correlation of the predicted value and the actual value.

4 Outline 1.Conditional distribution – What wage will I earn after graduation? 2.Probabilities (Chapter 4) After the Break:Probability Distributions Chapter 4 of A&F

5

6 WHEN LaTisha Styles graduated from Kennesaw State University in Georgia in 2006 she had $35,000 of student debt. This obligation would have been easy to discharge if her Spanish degree had helped her land a well-paid job. But there is no shortage of Spanish-speakers in a nation that borders Latin America. So Ms Styles found herself working in a clothes shop and a fast-food restaurant for no more than $11 an hour. Frustrated, she took the gutsy decision to go back to the same college and study something more pragmatic. She majored in finance, and now has a good job at an investment consulting firm. Her debt has swollen to $65,000, but she will have little trouble paying it off.

7 A Contingency Table (From Previous Session) But can I do a regression analysis here? We will learn how to produce this later in the course. For now, let’s interpret/understan d this. Shows the average weekly earnings for each year of education.

8 What wage will I earn after graduation? Data: Census of Population 2010. The United States Census is a decennial census mandated by Article I, Section 2 of the United States Constitution, which states: "Representatives and direct Taxes shall be apportioned among the several States... according to their respective Numbers.... The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years.” Variables: – Number of years of education completed. – Wage income. We can only perform regression analysis on quantitative variables.

9 Linear Relationship anybody? We can postulate that there is a linear relationship between wage income(y) and years of schooling (x). Using greek letters here. True relationship. Notice the importance of residuals (aka errors) Units of measurement matter. Make sure you read the fine print. – y is annual income in dollars. x is in years. Also, with a linear relationship, an additional year of education leads the same increase in income at any stage of your education process. – Makes sense? Check the contingency table. We keep the linear relationship as a convenient model.

10 Estimation of  and  We estimate  and  by computing the values of a and b. We only have a sample, not the entire population. So, what earnings can we expect?

11 Linear? A contigency table Here, years of schooling (x) is quantitative discrete, so we can do both regression analysis and a contingency table! (… Continued …) 1 b Years of schooling y Wage Income a

12 The unconditional distribution of income and education We find that the mean and the standard deviation of the variables are as follows: – Annual income y mean: $41,550SD: $48,659 – Years of schooling x mean: 12.25 yearsSD: 1.6 years Assuming a bell shaped distribution – Most earnings will fall between: mean +- 3 sd – 95% of earnings will fall between: mean +- 2 sd – 68% of earnings will fall between: mean +- 1 sd Interesting: could do a risk analysis with that data: – What is the probability that you earn more than mean + 2 sd? But the unconditional distribution of annual income mixes both individuals with high and low levels of education…

13 So instead of using the unconditional distribution of income (aka marginal distribution), we use the conditional distribution of income. “What is the distribution of income given that an individual studied for x years?” The conditional distribution of income and education x Earnings y Education

14 After x years of education, the predicted (mean) annual income will be: a + b x -123.610 + 2,689.936 x With x = 16 …. We find $42,915 ! Good or bad? Understanding the mean of income given x

15 Use the fact that TSS = ESS + SSE. – The ESS measures how education explains the variance of earnings. From this we find that Var(y) = Var(predictions) + Var(error). – How do we go from TSS=ESS+SSE to this? But that is the variance of the unconditional distribution of y. How can we find the variance of earnings given a level of education? In such a case Var(y) given a level of education is Var(y given x)=Var(error). And thus the standard deviation of earnings given a level of education is: – SD(residuals)= square root of (SSE/N) = sqrt (513012622113699.1/1460042) = $18,744 Applying the empirical rule… we find that most annual incomes will lie between: $ 42,915 - 3 x $18,744 and $ 42,915 + 3 x $18,744 $0 and $99,417 Understanding the risks: Approach #1

16 Use our beautiful formula: Hence the correlation between earnings and education is: 0.2240 It is lower than 1 because the linear relationship doesn’t hold exactly. The r 2 is thus: 0.050176 Notice the variance of the error: Var(error) = (1-R 2 ) x Var(y) And thus ! sd(residuals) = sqrt(1-R 2 ) * SD(y) We find: $18,744 !!! Same as before ! Understanding the risks: Approach #2 Slope: $2,689.936 Correlation Standard dev. of x Here 1.6 Standard dev. of y: 19,233.75

17 Where will your earnings lie with 95% probability? The Empirical Rule Frequency Earnings The conditional distribution has a lower standard deviation… a higher mean than the unconditional distribution. Unconditional distribution Conditional distribution

18 Wrap up With a linear relationship y = a + b x + e.. – The unconditional distribution of y has a larger variance than the conditional (i.e. marginal) distribution of y given x. The mean of the conditional distribution of y given x is a + b x And the standard deviation is the standard deviation of the errors e i. Such standard deviation is equal to: Again, N in the denominator. Proper discussion of this to follow.

19 Outline 1.Conditional distribution – What wage will I earn after graduation? 2.Probabilities (Chapter 4) After the Break:Probability Distributions Chapter 4 of A&F

20 Probability and Luck We play a game together… – Heads you win 1 dirham. – Tails I win 1 dirham. We play the game a very large number of times. Should you play this game? P(heads) = 0.5, P(tails) = 0.5

21 P(heads) = 1 – P(not heads) P(heads) is read as “probability of heads”. Game sequence: – In the long run, with a balanced coin, 0.5 of the trials will lead to heads, 0.5 of the trials will lead to tails. – The probability of heads is the ratio of the number of heads to the number of trials, with an infinite number of draws… Probability and Luck Perform the game for a very long number of draws. … the longer the game the closer the ratio will be to 0.5

22 What is the probability that you win twice in a row? – P(heads in the first round) * P(heads in the second round) = – Because the draws in the first and the second round are independent events. What is the probability that you win k times in a row? – P(heads in the first round) * P(heads in the second round) * …. * P(heads in the kth round) = Probability and Luck

23 Sometimes we can’t repeat our choices Life is full of random events… but We only draw one job at the end of university. – Hard to know what other incomes/jobs we would have gotten. We only draw one marriage. – Subsequent marriages are not identical to the first one. – What is the probability of divorce? We only die once at a particular age. – What is the probability of death at age 50?

24 In such a case we define the probability of an event as the ratio of the number of such events over the number of individuals in identical circumstances. – … for a very large number of such individuals. Example: number of individuals with the same degree, same age as me: What is the probability of earning more than $45,000 in my first job? Sometimes we can’t repeat our choices

25 Wrap Up What is the conditional distribution of y given x? – Use the relationship y = a + b x + e to find the mean of y given x. We compute a and b using our formulas. – Use the relationship TSS = ESS + SSE: the variance of the error is the variance of the y minus the variance of the prediction. – The standard deviation of y given x is the standard deviation of the errors (residuals). – Apply the empirical rule. 95% of the y given x will lie between a + b x +- 2 sd(y given x) Beginning probability distributions (chapter 4)

26 Coming up: Don’t forget: Break of Statistics for 2 weeks. Only one week break for recitations. For help: Amine Ouazad Office 1135, Social Science building amine.ouazad@nyu.edu Office hour: Wednesday from 4 to 6pm. GAF: Irene Paneda Irene.paneda@nyu.edu Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.


Download ppt "Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9)"

Similar presentations


Ads by Google