Stat 31, Section 1, Last Time Inference for Proportions –Hypothesis Tests 2 Sample Proportions Inference –Skipped 2-way Tables –Sliced populations in 2 different ways –Look for independence of factors –Chi Square Hypothesis test
Reading In Textbook Approximate Reading for Today’s Material: Pages , Approximate Reading for Next Class: Pages
Midterm I - Results Preliminary comments: Circled numbers are points taken off Total for each problem in brackets Points evenly divided among parts Page total in lower right corner Check those sum to total on front Overall score out of 100 points
Midterm I - Results Interpretation of Scores: Too early for letter grades These will change a lot: –Some with good grades will relax –Some with bad grades will wake up Don’t believe “A & C” average to “B”
Midterm I - Results Interpretation of Scores: Recall large variation over 2 midterms –No exception this semester
Midterm I - Results
Line of Equal Scores
Midterm I - Results Some have Dramatically Improved Others have Been distracted By other things
Midterm I - Results Interpretation of Scores: Recall large variation over 2 midterms –No exception this semester Get better info from 2 test Total –So will report answers in those terms
Midterm I - Results Histogram of Results:
Midterm I - Results Interpretation of Scores (2 Test total): A 155 – 168B 131 – 154C 120 – 129D F
Midterm I - Results Where do we go from here? I see 2 rather different groups… Which are you in? What can you do? Most important: It is still early days……
Chapter 9: Two-Way Tables Main idea: Divide up populations in two ways –E.g. 1: Age & Sex –E.g. 2: Education & Income Typical Major Question: How do divisions relate? Are the divisions independent? –Similar idea to indepe’nce in prob. Theory –Statistical Inference?
Two-Way Tables Big Question: Is there a relationship? Note: tallest bars French Wine French Music Italian Wine Italian Music Other Wine No Music Suggests there is a relationship
Two-Way Tables General Directions: Can we make this precise? Could it happen just by chance? –Really: how likely to be a chance effect? Or is it statistically significant? –I.e. music and wine purchase are related?
Two-Way Tables An alternate view: Replace counts by proportions (or %-ages) Class Example 31 (Wine & Music), Part 2 Advantage: May be more interpretable Drawback: No real difference (just rescaled)
Two-Way Tables Testing for independence: What is it? From probability theory: P{A | B} = P{A} i.e. Chances of A, when B is known, are same as when B is unknown Table version of this idea?
Independence in 2-Way Tables Counts analog of P{A|B}??? Equivalent condition for independence is: So for counts, look for: Table Prop’n = Row Marg’l Prop’n x Col’n Marg’l Prop’n i.e. Entry = Product of Marginals
Independence in 2-Way Tables Visualize Product of Marginals for: Class Example 31 (Wine & Music), Part 4 Shows same structure as marginals But not match between music & wine Good null hypothesis
Independence in 2-Way Tables Approach: Measure “distance between tables” –Use Chi Square Statistic –Has known probability distribution when table is independent Assess significance using P-value –Set up as: H 0 : Indep. H A : Dependent –P-value = P{what saw or m.c. | Indep.}
Independence in 2-Way Tables Chi-square statistic: Based on: Observed Counts (raw data), Expected Counts (under indep.), Notes: –Small for only random variation –Large for significant departure from indep.
Independence in 2-Way Tables Chi-square statistic calculation: Class example 31, Part 5: –Calculate term by term –Then sum –Is X 2 = 18.3 “big” or “small”?
Independence in 2-Way Tables H 0 distribution of the X 2 statistic: “Chi Squared” (another Greek letter ) Parameter: “degrees of freedom” (similar to T distribution) Excel Computation: –CHIDIST (given cutoff, find area = prob.) –CHIINV (given prob = area, find cutoff)
Independence in 2-Way Tables For test of independence, use: degrees of freedom = = (#rows – 1) x (#cols – 1) E.g. Wine and Music: d.f. = (3 – 1) x (3 – 1) = 4
Independence in 2-Way Tables E.g. Wine and Music: P-value = P{Observed X 2 or m.c. | Indep.} = = P{X 2 = 18.3 of m.c. | Indep.} = = P{X 2 >= 18.3 | d.f. = 4} = = Also see Class Example 31, Part 5
Independence in 2-Way Tables E.g. Wine and Music: P-value = Yes-No: Very strong evidence against independence, conclude music has a statistically significant effect Gray-Level: Also very strong evidence
Independence in 2-Way Tables Excel shortcut: CHITEST Avoids the (obs-exp)^2 / exp calculat’n Automatically computes d.f. Returns P-value
Independence in 2-Way Tables HW:
And Now for Something Completely Different A statistics joke, from: GARY C. RAMSEYER'S INTERNET GALLERY OF STATISTICS JOKES
And Now for Something Completely Different A somewhat advanced society has figured how to package basic knowledge in pill form. A student, needing some learning, goes to the pharmacy and asks what kind of knowledge pills are available.
And Now for Something Completely Different The pharmacist says "Here's a pill for English literature." The student takes the pill and swallows it and has new knowledge about English literature!
And Now for Something Completely Different " What else do you have?" asks the student. "Well, I have pills for art history, biology, and world history, "replies the pharmacist. The student asks for these, and swallows them and has new knowledge about those subjects!
And Now for Something Completely Different Then the student asks, "Do you have a pill for statistics?" The pharmacist says "Wait just a moment", and goes back into the storeroom and brings back a whopper of a pill that is about twice the size of a jawbreaker and plunks it on the counter. "I have to take that huge pill for statistics?" inquires the student.
And Now for Something Completely Different The pharmacist understandingly nods his head and replies: "Well, you know statistics always was a little hard to swallow."
Caution about 2-Way Tables Simpson’s Paradox: Aggregation into tables can be dangerous E.g. from: Study Admission rates to professional programs, look for sex bias….
Simpson’s Paradox Admissions to Business School: % Males ad’ted = 480 / ( ) * 100% = 80% % Females ad’ted = 180 / ( )* 100% = 90% Better for females??? AdmitDeny Male Female18020
Simpson’s Paradox Admissions to Law School: % Males ad’ted = 10 / ( ) * 100% = 10% % Females ad’ted = 100 / ( )*100% = 33.3% Better for females??? AdmitDeny Male1090 Female100200
Simpson’s Paradox Combined Admissions: % Males ad’ted = 490 / ( ) * 100% = 70% % Females ad’ted = 280 / ( )*100% = 56% Better for males??? AdmitDeny Male Female280220
Simpson’s Paradox How can the rate be higher for both females and also males? Reason: depends on relative proportions Notes: In Business (male applicants dominant), easier to get in (660 / 800) In Law (female applicants dominant), much harder to get in (110 / 400)
Simpson’s Paradox Lesson: Must be very careful about aggregation Worse: may not be aware that aggregation has been done…. Recall terminology: Lurking Variable Can hide in aggregation… Could be used for cheating…
Simpson’s Paradox HW:
Inference for Regression Chapter 10 Recall: Scatterplots Fitting Lines to Data Now study statistical inference associated with fit lines E.g. When is slope statistically significant?
Recall Scatterplot For data (x,y) View by plot: (1,2) (3,1) (-1,0) (2,-1)
Recall Linear Regression Idea: Fit a line to data in a scatterplot To learn about “basic structure” To “model data” To provide “prediction of new values”
Recall Linear Regression Recall some basic geometry: A line is described by an equation: y = mx + b m = slope m b = y intercept b Varying m & b gives a “family of lines”, Indexed by “parameters” m & b (or a & b)
Recall Linear Regression Approach: Given a scatterplot of data: Find a & b (i.e. choose a line) to “best fit the data”
Recall Linear Regression Given a line,, “indexed” by Define “residuals” = “data Y” – “Y on line” = Now choose to make these “small”
Recall Linear Regression Excellent Demo, by Charles Stanton, CSUSB More JAVA Demos, by David Lane at Rice U.
Recall Linear Regression Make Residuals > 0, by squaring Least Squares: adjust to Minimize the “Sum of Squared Errors”
Least Squares in Excel Computation: 1.INTERCEPT (computes y-intercept a) 2.SLOPE (computes slope b) Revisit Class Example 14 HW: 10.17a
Inference for Regression Goal: develop Hypothesis Tests and Confidence Int’s For slope & intercept parameters, a & b Also study prediction
Inference for Regression Idea: do statistical inference on: –Slope a –Intercept b Model: Assume: are random, independent and
Inference for Regression Viewpoint: Data generated as: y = ax + b Y i chosen from X i Note: a and b are “parameters”
Inference for Regression Parameters and determine the underlying model (distribution) Estimate with the Least Squares Estimates: and (Using SLOPE and INTERCEPT in Excel, based on data)
Inference for Regression Distributions of and ? Under the above assumptions, the sampling distributions are: Centerpoints are right (unbiased) Spreads are more complicated
Inference for Regression Formula for SD of : Big (small) for big (small, resp.) –Accurate data Accurate est. of slope Small for x’s more spread out –Data more spread More accurate Small for more data –More data More accuracy
Inference for Regression Formula for SD of : Big (small) for big (small, resp.) –Accurate data Accur’te est. of intercept Smaller for –Centered data More accurate intercept Smaller for more data –More data More accuracy
Inference for Regression One more detail: Need to estimate using data For this use: Similar to earlier sd estimate, Except variation is about fit line is similar to from before
Inference for Regression Now for Probability Distributions, Since are estimating by Use TDIST and TINV With degrees of freedom =
Inference for Regression Convenient Packaged Analysis in Excel: Tools Data Analysis Regression Illustrate application using: Class Example 27, Old Text Problem 8.6 (now 10.12)