Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Chi–squared tests; goodness–of–fit & independence.

Similar presentations


Presentation on theme: "Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Chi–squared tests; goodness–of–fit & independence."— Presentation transcript:

1 Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Chi–squared tests; goodness–of–fit & independence Priyantha Wijayatunga, Department of Statistics, Umeå University priyantha.wijayatunga@stat.umu.se These materials are altered ones from copyrighted lecture slides (© 2009 W.H. Freeman and Company) from the homepage of the book: The Practice of Business Statistics Using Data for Decisions :Second Edition by Moore, McCabe, Duckworth and Alwan.

2 Goodness–of–fit test and analysis of two–way contingency tables Reference to the book: Chapter 9.1, 9.2 and 2.5  Summerizing different data types  Testing goodness–of–fit of models for multinomial observations  Chi–squared distribution and Chi–squared test and p–values  Two-way contingency tables and describing relationships in two- way tables  The hypothesis: no association (or independence)  Conditional distributions and marginal distributions  Chi-square test vs. z-test

3 Techniques to summerize data 1.One variables –univaraite methods 2.Two variables –bivariate methods Graphical displays Two interval variables –scatterplot Two categorical variables –clustered bar chart More than two variables –graphical displays are hard

4 Observations can be taken 1.At the same time –cross sectional data Market surveys: eg. brand preferences of 100 people, etc. 2.At successive times repeatedly –time series data Price of a certain stock over the last 5 years Note: succession can be in space too. But we omit such discussions

5 Describing Relationship between Two Nominal/Ordinal Variables Contingency / cross–classification / cross–tabulation table is used to describe (two or more) nominal variables Ex: Are the profession and newspaper reading habbits related? A sample of people are asked about their professions and newspaper preferences PersonOccupationNewpaper 1White-collarPost 2White-collarSun 3ProfessionalSun.. 354Blue-collarMail Occ Newsp WCBCProTotal Globe 27 29 33 89 Mail 18 43 51112 Post 38 21 22 81 Sun 37 15 20 72 Total 120 108126354

6 Occupation Newspaper WCBCPro Globe27/120=0.23 29/108=0.27 33/126=0.26 Mail18/120=0.15 43/108=0.40 51/126=0.40 Post38/120=0.32 21/108=0.19 22/126=0.17 Sun 37/120=0.31 15/108=0.14 20/126=0.16 total 120 (1) 108 (1) 126 (1) Relative conditaional frequencies

7 Time seires data Observations are repeated at successive times Ex: Total amount of taxed collected (in billions, US$) from year 1993 to 2002 in USA. YearTax 1993 594 1994 625 1995 686 1996 755 1997 848 1998 940 19991032 20001137 20011178 20021038

8 Binary and multinary observations 1.Binomial Experiment: A nomial variable has two outcomes Eg: Do the majority of people like new economic policies or not? 2.Multinomial Experiment: For a nominal variable that has three or more outcomes, we test more than two proportions Eg: Do the people have equal preferences on five brands of tea? Note: Multinomial cases can be reduced to binomial case sometimes!

9 Multinary experiment: example 100 persons took part in a survey about different brands of coffee, say Ellips, Gexus, Luber and Loflia. Each of the persons tasted these four different kinds of coffee (in a blind test), and noted which one they liked the best. The result of the test is as follows: BrandEllipsGexusLuberLoflia Observed number of persons/occurences (frequency) 26281630

10 General question of interest Does the result of the survey show that any of the brands are more popular than the others, or are they all equal? In statistical terms we can formulate the problem as: Null hypothesis: All the coffee brands are equally popular. Alternative hypothesis: All the coffee brands are not equally popular.

11 If the null hypothesis is true, we could expect the following result of the survey: With a significance level of 5% can we say anything about whether the null hypothesis is true or not. One way of measuring how much the observed table differs from the expected table is to look at the squared differences: Brand:EllipsGexusLuberLoflia Expected number of persons (frequency) 25

12 However, there is a problem with the fact that the difference between 10 and 20 is relatively larger than the difference between 10000 and 10010. How can we take this into account? Divide with the expected value and formulate a test statistic (chi– squared statistic) If the null hypothesis is true, ought to be close to zero. Is 4.64 so far away from zero that we can reject the null hypothesis? What is the sampling distribution for if the null hypothesis is true?

13 Chi-squared statistic Chi-squared statistic has two things: 1. A continuous distribution: -distribution 2. A statistical test where the sampling distribution for the test statistic is - distributed. BrandEllipsGexusLuberLoflia Observed frequency (f o ) 26281630 Expected frequency (f e ) 25

14 Chi-squared distribution The distribution is a parametric distribution with the parameter v which is called the degrees of freedom. The distribution looks different for different degrees of freedom. Larger the v, the distribution is more symmetric and larger the expected value and standard deviation.

15 Eg: looking chi–squared table df = 6 If  2 = 16.1, for df=6, the p-value is between 0.01−0.02.

16 Chi-squared tail probabilities for critical values For our data We do not reject H 0 at the level of significance 0.05 0.05 < p–value < 0.1

17 Chi–squared Goodness–of–fit test Used to test to see if a variable with two or more possible categories has a specific distribution. (Do the observed frequencies in different categories align with what we can expect from some theory?) Steps  Formulate null and alternative hypotheses  Compute the expected frequencies if the null hypothesis is true (expected counts)  Note the observed frequencies (how many are there?)  Use the difference between the expected and the observed values and compute the value of the - statistic (called ).  Compare your value with the critical value of or compare the p- value with your level of significance.

18 Example 2 A political analyst believes that 45%, 40% and 15% of there voters will vote for political parties A, B and C respectively in the forthcoming election. In order to test her belief a statistician did a survey: 200 randomly selected voters were asked for their voting preference and it was found that 102, 82 and 16 voters were going to vote for parties A, B and C respectively. Can the statistician infer at 5% level of significance that political analyst’s belief is correct? Political party ABCTotal Observed frequency (f o ) 1028216200 Expected frequency (f e ) 908030200

19 Example 2 Calculate the chi–squared statistic This statistics follows a -distribution if H 0 is true. Look at the tabulated value from the chi–squared distribution with degrees of freedom 2 and level of significance 0.05. It is We reject H 0 at level of significance 0.05 since P–value One can do the testing with p–value too

20 Example 3 The 13 first weeks of the season, the TV watchers on Saturday evenings were distributed as follows: SVT128%SVT225% TV318%TV429% After a change of the TV program presentation, a sample of 300 households was taken and the following numbers were observed: SVT1 70 households SVT2 89 households TV3 46 households TV4 95 households Has the change in the TV program presentation changed the pattern of TV watchers?

21 Example 3 ChannelSVT1SVT2TV3TV4Total Observed frequency70894695300 Expecteded frequency 300x0.28 =84 300x0.25 =75 300x0.18 =54 300x0.29 =87 300 Therefore we do not reject the null hypothesis at the level of significance 0.05 That is, there is no evidence that the change in the program has affected the TV watching habits of the people

22 Two-way contingency tables  An experiment has a two-way design if two categorical factors are studied with several levels of each factor.  Two-way tables organize data about two categorical variables  Example: We call Education the row variable and Age group the column variable.  Each combination of values for these two variables is called a cell.

23 Describing relations  The cells of a two-way table represent the intersection of a given level of one categorical factor with a given level of the other categorical factor.  We can also compute each count as a percent of the column total. These percents should add up to 100% and together are the conditional distributions of education level given age group. Here the percents are calculated by age range (columns).

24 Hypothesis: no association Again, we want to know if the differences in sample proportions are likely to have occurred just by chance, because of the random sampling. We use the chi-square (   ) test to assess the null hypothesis of no relationship between the two categorical variables of a two-way table. H 0 : there is no relationship between these two categorical variables. Are these conditional probability distributions the same (very close)?

25 Expected counts in two-way tables H 0 : there is no relationship between these two categorical variables. H a : there is a relationship between these two categorical variables To test this hypothesis, we compare actual counts from the sample data with expected counts, given the null hypothesis of no relationship (assuming the null hypothesis is true). The expected count in any cell of a two-way table when H 0 is true is:

26 The chi-square statistic (  2 ) is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts. The formula for the  2 statistic is: (summed over all r * c cells in the table) Large values for  2 represent strong deviations from the expected distribution under the H 0, providing evidence against H 0. However, since  2 is a sum, how large a  2 is required for statistical significance will depend on the number of comparisons made. The chi-square test

27 If H 0 is true, the chi-square test has approximately a χ 2 distribution with (r − 1)(c − 1) degrees of freedom. The P-value for the chi-square test is the area to the right of  2 under the  2 distribution with df (r−1)(c−1): P(χ 2 ≥ X 2 ).

28 Example 1 In order to see if the people’s political beliefs and gender associated a survery was conducted on randomly selected 2771 people and the findings were recorded as follows Within brackets: the cond. prob. of political belief given gender PB Genger DemocraticIndependentRepublicanTotal Female573 (0.38)516 (0.34)422 (0.28)1511 Male386 (0.31)475 (0.38)399(0.32)1260 Tolal9599918212771

29 Example 1 H 0 : ”Political beliefs” and ”Gender” are independent H a : They are dependent Level of significance=0.05 Under H 0 : expected fequencies: f e = (column total) x (row total) / (total) Within brackets are the expected frequencies There is strong evidence for a dependency PB Genger DemocraticIndependentRepublicanTotal Female573 (522.9)516 (540.4)422 (447.7)1511 Male386 (436.1)475 (450.6)399 (373.3)1260 Tolal9599918212771

30 Cocaine addiction Cocaine produces short-term feelings of physical and mental well-being. To maintain the effect, the drug may have to be taken more frequently and at higher doses. After stopping use, users will feel tired, sleepy, and depressed. The pleasurable high, followed by unpleasant after-effects, encourages repeated compulsive use which can easily lead to dependency. Desipramine is an antidepressant affecting the brain chemicals that may become unbalanced and cause depression. It was thus tested for recovery from cocaine addiction. Treatment with desipramine was compared to a standard treatment (lithium, with strong anti-manic effects) and a placebo. Is there is a relationship between treatment (desipramine, lithium, placebo) and outcome (relapse or not)?

31 25*26/74 ≈ 8.7825*48/74≈16.22 9.1416.86 8.0814.92 Desipramine Lithium Placebo Expected relapse counts No Yes Observed (for No) Cocaine addiction Do we have same percentages for ”Yes” category? If not there should be some relation between two variables

32 Cocaine addiction 15 8.78 10 16.22 7 9.14 19 16.86 4 8.08 19 14.92 Desipramine Lithium Placebo No relapseRelapse  2 components: Table of counts: “actual & expected,” with three rows and two columns: df = (3−1)*(2−1) = 2

33 Cocaine addiction: X 2 = 10.71 and df = 2 10.60 < X 2 < 11.98  0.005 < p < 0.0025  reject the H 0 H 0 : there is no relationship between treatment (desipramine, lithium, placebo) and outcome (relapse or not).

34 Observed (for No) Cocaine addiction The p-value is 0.005 or half a percent. This is very significant. We reject the null hypothesis of no association and conclude that there is a significant relationship between treatment (desipramine, lithium, placebo) and outcome (relapse or not). Minitab statistical software output for the cocaine study:

35 Marginal distributions We can look at each categorical variable separately in a two-way table by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percentages (They are written as if in a margin.) 2000 U.S. census

36 Marginal distribution of education Similarly we can do it for column totals to obtain the marginal distribution of age The marginal distributions can then be displayed on separate bar graphs, typically expressed as percents instead of raw counts. Each graph represents only one of the two variables, completely ignoring the second one.

37 The calculated percents within a two-way table represent the conditional distributions, describing the “relationship” between both variables. For every two-way table, there are two sets of possible conditional distributions (column percents or row percents). For column percents, divide each cell count by the column total. The sum of the percents in each column should be 100, except for possible small round-off errors. When one variable is clearly explanatory, it makes sense to describe the relationship by comparing the conditional distributions of the response variable for each value (level) of the explanatory variable. Conditional distributions

38 Conditional Distribution  In the table below, the 25 to 34 age group occupies the first column. To find the complete distribution of education in this age group, look only at that column. Compute each count as a percent of the column total.  These percents should add up to 100% because all persons in this age group fall in one of the education categories. These four percents together are the conditional distribution of education, given the 25 to 34 age group. 2000 U.S. census

39 The percents within the table represent the conditional distributions. Comparing the conditional distributions allows you to describe the “relationship” between both categorical variables. Conditional distributions Here the percents are calculated by age range (columns). 29.30% = 11071 37785 = cell total. column total

40 The conditional distributions can be graphically compared using side by side bar graphs of one variable for each value of the other variable. Here the percents are calculated by age range (columns).

41 Music and wine purchase decision We want to compare the conditional distributions of the response variable (wine purchased) for each value of the explanatory variable (music played). Therefore, we calculate column percents. What is the relationship between type of music played in supermarkets and type of wine purchased? We calculate the column conditional percents similarly for each of the nine cells in the table : Calculations: When no music was played, there were 84 bottles of wine sold. Of these, 30 were French wine. 30/84 = 0.357  35.7% of the wine sold was French when no music was played. 30 = 35.7% 84 = cell total. column total

42 For every two-way table, there are two sets of possible conditional distributions. Wine purchased for each kind of music played (column percents) Music played for each kind of wine purchased (row percents) Does background music in supermarkets influence customer purchasing decisions?

43 Computing expected counts When testing the null hypothesis that there is no relationship between both categorical variables of a two-way table, we compare actual counts from the sample data with expected counts given H 0. The expected count in any cell of a two-way table when H 0 is true is: Although in real life counts must be whole numbers, an expected count need not be. The expected count is the mean over many repetitions of the study, assuming no relationship.

44 What is the expected count in the upper-left cell of the two-way table, under H 0 ? Column total 84: Number of bottles sold without music Row total 99: Number of bottles of French wine sold Table total 243: all bottles sold during the study The null hypothesis is that there is no relationship between music and wine sales. The alternative is that these two variables are related. Nine similar calculations produce the table of expected counts: Music and wine purchase decision This expected cell count is thus (84)(99) / 243 = 34.222

45 The chi-square statistic (  2 ) is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts. The formula for the  2 statistic is: (summed over all r * c cells in the table) Tip: First, calculate the  2 components, (observed-expected) 2 /expected, for each cell of the table, and then sum them up to arrive at the  2 statistic. Computing the chi-square statistic

46 H 0 : No relationship between music and wine H a : Music and wine are related We calculate nine X 2 components and sum them to produce the X 2 statistic: Music and wine purchase decision Observed counts Expected counts

47 We found that the X 2 statistic under H 0 is 18.28. The two-way table has a 3x3 design (3 levels of music and 3 levels of wine). Thus, the degrees of freedom for the X 2 distribution for this test is: (r – 1)(c – 1) = (3 – 1)(3 – 1) = 4 16.42 < X 2 =18.28 < 18.47 0.0025 > p-value > 0.001  very significant There is a significant relationship between the type of music played and wine purchases in supermarkets. Music and wine purchase decision H 0 : No relationship between music and wine H a : Music and wine are related

48 0.5209 2.3337 0.5209 0.0075 7.6724 6.4038 0.3971 0.0004 0.4223 Music and wine purchase decision X 2 components Interpreting the  2 output  The values summed to make up  2 are called the  2 components. When the test is statistically significant, the largest components point to the conditions most different from the expectations based on H 0. Two chi-square components contribute most to the X 2 total  the largest effect is for sales of Italian wine, which are strongly affected by Italian and French music. Actual proportions show that Italian music helps sales of Italian wine, but French music hinders it.

49 When is it safe to use a  2 test? We can safely use the chi-square test when:  The samples are simple random samples (SRS).  All individual expected counts are 1 or more (≥1)  No more than 20% of expected counts are less than 5 (< 5)  For a 2x2 table, this implies that all four expected counts should be 5 or more.

50 Chi-square test vs. z-test for two proportions When comparing only two proportions, such as in a 2x2 table where the columns represent counts of “success” and “failure,” we can test H 0 : p 1 = p 2 vs. H a p 1 ≠ p 2 equally with a two-sided z test or with a chi-square test with 1 degree of freedom and get the same p-value. In fact, the two test statistics are related: X 2 = (z) 2.

51 Successful firms Franchise businesses are sometimes given an exclusive territory by contract. This means that the new outlet will not have to compete with other outlets of the same chain within its own territory. How does the presence of an exclusive-territory clause in the contract relate to the survival of the business? A random sample of 170 new franchises recorded two categorical variables for each firm: (1) whether the firm was successful or not (based on economic criteria) and (2) whether or not the firm had an exclusive-territory contract. This is a 2x2 table (two levels for success, yes/no; two levels for exclusive territory, yes/no).  df = (2 − 1)(2 − 1) = 1

52 Successful firms How does the presence of an exclusive-territory clause in the contract relate to the survival of the business? To compare firms that have an exclusive territory with those that do not, we start by examining column percents (conditional distribution): The difference between the percent of successes among the two types of firms is quite large. The chi-square test can tell us whether or not these differences can be plausibly attributed to chance (random sampling). Specifically, we will test H 0 : No relationship between exclusive clause and success H a : There is some relationship between the two variables

53 The p-value is significant at α = 5% (p = 1.5%), thus we reject H 0 : we have found a significant relationship between an exclusive territory and the success of a franchised firm. Successful firms Here is the chi-square output from Minitab:

54 Computer output using Crunch It! Successful firms

55 Computations for two-way tables When analyzing relationships between two categorical variables, follow this procedure: 1. Calculate descriptive statistics that convey the important information in the table—usually column or row percents. 2. Find the expected counts and use them to compute the X 2 statistic. 3. Compare your X 2 statistic to the chi-square critical values from Table F to find the approximate P-value for your test. 4. Draw a conclusion about the association between the row and column variables.

56 Comparing several populations Select independent SRSs from each of c populations, of sizes n 1, n 2,..., n c. Classify each individual in a sample according to a categorical response variable with r possible values. There are c different probability distributions, one for each population. The null hypothesis is that the distributions of the response variable are the same in all c populations. The alternative hypothesis says that these c distributions are not all the same.

57 Cocaine addiction Cocaine produces short-term feelings of physical and mental well-being. To maintain the effect, the drug may have to be taken more frequently and at higher doses. After stopping use, users will feel tired, sleepy, and depressed. The pleasurable high, followed by unpleasant after-effects, encourage repeated compulsive use which can easily lead to dependency. Population 1: Antidepressant treatment (desipramine) Population 2: Standard treatment (lithium) Population 3: Placebo (“sugar pill”) We compare treatment with an anti- depressant (desipramine), a standard treatment (lithium), and a placebo.

58 25*26/74 ≈ 8.78 =25*0.35 16.22 25*0.65 9.14 =26*0.35 16.86 25*0.65 8.08 23*0.35 14.92 25*0.65 Desipramine Lithium Placebo Expected relapse counts No Yes 26/74= 35% 26/74 35% 26/74= 35% Expected Observed Cocaine addiction H 0 : The proportions of success (no relapse) are the same in all three populations. 15/25=0.6 7/26=0.27 4/23=0.17

59 Cocaine addiction 15 8.78 10 16.22 7 9.14 19 16.86 4 8.08 19 14.92 Desipramine Lithium Placebo No relapseRelapse  2 components: Table of counts: “actual & expected,” with three rows and two columns: df = (3−1)*(2−1) = 2

60 Cocaine addiction: X 2 = 10.71 and df = 2 10.60 < X 2 < 11.98  0.005 < p < 0.0025  reject the H 0 H 0 : The proportions of success (no relapse) are the same in all three populations. Observed  The proportions of success are not the same in all three populations (Desipramine, Lithium, Placebo). Desipramine is a more successful treatment 


Download ppt "Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Chi–squared tests; goodness–of–fit & independence."

Similar presentations


Ads by Google