Click to edit Master text styles Second level Third level Fourth level Fifth level Test of Categorical Data / Proportion 1
2 Estimation Estimate population means Estimate population proportion Estimate population variance Hypothesis testing Testing population means Testing categorical data / proportion Testing population variances Hypothesis about many population means One-way ANOVA Two-way ANOVA
3 Test the interested proportion in population e.g. Proportion of defect in production Proportion of people travelling by Skytrain in BKK Test if the proportion in population is as expected using data collected from sample
4 Binomial proportion Single population Two populations Multiple groups proportion : Chi-square Test Single population Test of homogeneity Goodness of fit test Two Populations Test of Homogeneity Test of Independence
5 The sample is categorized into two groups. Single population Determining if the proportion in one of two categories is different from a specified proportion Two populations Compare the difference between the proportions of two populations Steps are similar to testing of population mean Assumptions Normal distribution of population proportion (or proportion difference in the case of two populations) Number of sample (of both populations in the case of two populations) is sufficiently large (n ≥ 30)
6
7 A plastic product factory take a sample of 400 plastic containers, 12 of which are defective. From this data, test if the proportion of defect is more than 2% at significant level Hypotheses α = 0.05
8 Calculate test statistic z-score from table: z 0.05 = The calculated z-score is < 1.645, not falling in right- tailed critical region. Accept H 0 and reject H 1 The proportion of defect is not more than 2% at significant level 0.05
9
10 From the observation of students wearing and not wearing safety helmet when riding motorcycles, among 500 sample students, 75 students wear helmet. Can a conclusion be drawn that the proportion of students wearing helmet is less than 20% at significant level 0.01? Hypothesis H 0 : P ≥ 0.2 H 1 : P < 0.2 α = 0.01
11 Calculate test statistic z-score from table: z 0.01 = The calculated z-score is < , falling in left- tailed critical region. Reject H 0 and accept H 1 The proportion of students wearing helmet is less than 20% at significant level 0.01
12 A shampoo company expects that after advertising the new product for 2 months, the product will be popular among 60% of consumers. Thus, after the advertisement period, 300 bottles are given out to 300 sample consumers, 220 of which respond positively. Test if the assumption is true at significant level 0.05.
13 From a sample of 90 students, 28 students have private cars. Test if the proportion of the students having private cars is more than 25% at significant level 0.05.
14
15
16 If d 0 = 0 e.g. H 0 : P 1 = P 2 or P 1 – P 2 = 0 Pooled estimated proportion x 1, x 2 : numbers of interested in the first and second samples n 1, n 2 : sizes of the first and second samples Additional assumption
17 If d 0 = 0 e.g. H 0 : P 1 = P 2 or P 1 – P 2 = 0 Estimated variance of proportion difference Z-score calculation adjusted to
18 From a survey, among 100 IT students, 70 have a smart phone. And among 150 art students, 72 have a smart phone. Test if the proportion of IT students who have a smart phone is more than 10% higher than that of art students at significant level Hypothesis α = 0.05
19 Calculate test statistic z-score from table: z 0.05 = The calculated z-score is > 1.645, falling in right-tailed critical region. Reject H 0 and accept H 1 The proportion of IT students who have a smart phone is 10% higher than that of art students at significant level 0.05
20 From a survey, among 200 university students, 120 have a notebook computer. And among 500 high school students, 240 have a notebook computer. Test if the proportion of university students who have a notebook computer is higher than that of high school students at significant level Hypothesis *d 0 = 0 α = 0.025
21 Calculate pooled estimated proportion n 1 p = 200*0.51 = 102 n 1 q = 200*0.49 = 98 n 2 p = 500*0.51 = 255 n 2 q = 500*0.49 = 245 Calculate test statistic
22 z-score from table: z = 1.96 The calculated z-score is 2.9 > 1.96, falling in right-tailed critical region. Reject H 0 and accept H 1 The proportion of university students who have a notebook computer is higher than that of high school students at significant level 0.025
23 From the previous observation of students wearing and not wearing safety helmet when riding motorcycles, the 500 sample students are grouped by gender as shown in the table. Can a conclusion be drawn that the proportion of female students wearing helmet is higher than the male counterpart at significant level 0.05? MaleFemale Wearing helmet4035 Not wearing helmet Total300200
24 Hypothesis H 0 : P f ≤ P m H 1 : P f > P m
25 Calculate test statistic z-score from table: z 0.05 = The calculated z-score is 1.28 < 1.645, not falling in right- tailed critical region. Accept H 0 and reject H 1 The proportion of female students wearing helmet is not higher than male at significant level 0.05
26
27 According to a polio vaccination program in a school, 16 out of 100 vaccinated female students are infected, and 20 out of 200 vaccinated male students are infected. Test if the proportion of the infected female students is 5% higher than the proportion of the infected male students at significant level 0.10.
28 Categorical data cannot be measured in terms of number but can be grouped e.g. 5-rating scale, religion, occupation, and gender The data of each group is then frequency, which can be tested using Chi-square test (χ 2 ) Determine if the observed proportion of groups is different from a specified expected ratio
29 Assumptions Sample size must be sufficiently large: 4-5 times the number of groups The frequency of each group must not be less than 5. If exist, combine that group with an adjacent group (reducing degree of freedom) Cannot be applied to repeated measures design Measuring the same sample after a time period e.g. measuring the effect of a drug after the 1 st, 2 nd, and 3 rd hour. Measuring the same variable after changing treatment e.g. measuring blood pressure of the same sample after administering different drug dosages.
30 If the sample contain 2 groups (degree of freedom = 1) and total frequency is less than 50, Frank Yate suggested using Corrected Chi-square *If the total frequency is 50 or more, no need to use Corrected Chi-square But we leave this matter here
31 Single variable Test of homogeneity Goodness of fit test Two variables Test of Homogeneity Test of Independence df = k-1-m
32 Used to determine whether the proportion of two or more groups in a population is similar Hypothesis O i : observed frequency in each group E i : expected frequency in each group k: number of groups
33 Reject H 0 when the calculated from table Rejection region Acceptance region
34 In the teaching evaluation of a course, from the total of 200 students, 72 are very satisfied, 60 are satisfied, 22 are indifferent and 46 are unsatisfied. Is the proportion of the satisfaction levels similar at significant level 0.01? Hypothesis H 0 : Frequency of each satisfaction level is not different H 1 : Frequency of each satisfaction level is different
35 Calculate test statistic Frequency Level O E(O-E)(O-E) 2 (O-E) 2 /E Very satisfied Satisfied Indifferent Unsatisfied Total
36 Critical Chi-square Degree of freedom = k - 1 = 4 – 1 = 3 The calculated Chi-square is > falling in critical region Reject H 0 and accept H 1 The proportion of the satisfaction levels is not similar at significant level 0.01
37
38 A coffee bean reseller assumes that the sale proportion of 4 types of coffee beans are equal. 500 customers are sampled and the number of sale of each type of coffee bean is shown in the table. Test if the assumption is true at significant level TypeSale count ABCDABCD
39 Used to determine whether the proportion of two or more groups in a population fits a specified proportion Hypothesis O i : observed frequency in each group E i : expected frequency in each group E i = np i ; n = total freq, p = probability of distribution of the group k: number of groups m: number of parameters to be estimated (we only study non-parametric chi-square so ignore this) df = k-1-m
40 A financial institute studies history of loan clients. It is found that 80% of the clients can return their loan in 1 year, 10% in 2 years, 6% in 3 years, and 4% in over 3 years. To assess the current situation, 400 recent loan clients are sampled, 287 of which can return their loan in 1 year, 49 in 2 years, 30 in 3 years, and 34 in over 3 years. Test if the clients’ ability to return loans changes.
41 Hypothesis H 0 : p 1 :p 2 :p 3 :p 4 = 0.8: 0.1: 0.06: 0.04 H 1 : p 1 :p 2 :p 3 :p 4 ≠ 0.8: 0.1: 0.06: 0.04 OR H 0 : Clients’ ability to return loan does not change H 1 : Clients’ ability to return loan changes α = 0.05 Calculate test statistic
42 Degree of freedom = 4-1 = 3 The calculated Chi-square is > 7.81 falling in critical region Reject H 0 and accept H 1 Clients’ ability to return loan changes at significant level 0.05 TimeOiOi PiPi E i = np i O i – E i (O i - E i ) 2 (O i - E i ) 2 /E i 1 year 2 years 3 years > 3 years , Total
43
44 In an exam of a sale training program with 150 participant, the manager expects that the proportion of the results, which is categorized in 3 groups: very good, good, and fair, will be 2:1:2. After the exam, the actual frequency in the 3 groups are 70, 30, and 50 participants respectively. Are the actual and the expected proportions different at significant level 0.05?
45 Test of Homogeneity Used to determine whether the proportions of groups in a variable is similar when grouped by another variable Two or more groups in each variable H 0 : p 1 = p 2 = p 3 = … = p n H 1 : p 1 ≠ p 2 ≠ p 3 ≠ … ≠ p n E.g. proportion of occupations between three countries
46 Test of Independence Used to determine whether the effects of one variable depend on the value of another variable (2 variables) H 0 : Variable x and variable y are independent of each other (are not related) H 1 : Variable x and variable y are dependent of each other (are related)
47 Data is grouped in rows and columns of two-way table Country Occupation Sum ResearcherBusinessProgrammer ThailandO 11 O 12 O 13 R1R1 USAO 21 O 22 O 23 R2R2 AustraliaO 31 O 32 O 33 R3R3 SumC1C1 C2C2 C3C3 N
48 r: number of rows c: number of columns O ij : observed frequency of row i column j E ij : expected frequency of row i column j
49
50 Reject H 0 when the calculated from table Rejection region Acceptance region
51 According to a survey of 1200 sample individuals grouped by four occupations, the number of smokers and non-smokers are listed in the table. Test if the proportion in each occupation is different. OccupationNon-smokerSmokerFreq. Engineer Educator Accountant Scientist Total
52 Hypothesis H 0 : p 1 = p 2 = p 3 = p 4 H 1 : p 1 ≠ p 2 ≠ p 3 ≠ p 4 α = 0.05
53 Calculated expected frequencies E 11 = (300*233)/1200 = E 12 = (300*967)/1200 = E 21 = (250*233)/1200 = E 22 = (250*967)/1200 = E 31 = (300*233)/1200 = E 32 = (300*967)/1200 = E 41 = (350*233)/1200 = E 42 = (350*967)/1200 = Occup.Non- smoker SmokerFreq. Engineer Educator Account Scientist Total
54 Calculate test statistic Row- Column O ij E ij O ij – E ij (O ij - E ij ) 2 (O ij - E ij ) 2 /E ij Total
55 Degree of freedom = (r-1)(c-1) = 3*1 = 3 The calculated Chi-square is > 7.81 falling in critical region Reject H 0 and accept H 1 The proportion between smokers and non-smokers in each occupation is different at significant level 0.05
56 To test if the achievement score of a training program is related to the achievement score of the actual operation at significant level 0.01, 400 employees are sampled. The scores are listed in the table. Operation score Training score Total Below Average Average Above Average Fair Good Very good Total
57 Hypothesis H 0 : score of the training program and the score of the actual operation are not related H 1 : score of the training program and the score of the actual operation are related α = 0.01
58 Calculated test statistic = Row- Column O ij E ij =r i c j /NO ij – E ij (O ij - E ij ) 2 (O ij - E ij ) 2 /E ij *60/400= *188/400= *152/400= *60/400= *188/400= *152/400= *60/400= *188/400= *152/400= Total
59 Degree of freedom = (r-1)(c-1) = 2*2 = 4 The calculated Chi-square is > falling in critical region Reject H 0 and accept H 1 The score of the training program and the score of the actual operation are related (or are dependent on each other) at significant level 0.01
60
61 A factory manager believes that the efficiency of workers depends on how long they have worked in the factory. To test this belief, 100 sample products are inspected. The quality of the sample are listed in table. Test the belief at significant level Product QualityEmployee Experience (year)Total 12 – 56 – 10 Good Minor damaged Major damaged Total
62 A toothpaste company wants to know if the color of the toothpaste is related to the gender of buyers. Sample of 500 male and 500 female are randomly selected to examine their favored toothpaste color. Test if the color of the toothpaste is related to the gender at significant level Gender Color Total WhiteOther Male Female Total
63 Test of Homogeneity and Test of Independence use the same calculation Test of Homogeneity tells if the proportion is the same H 0 : Proportion is similar for all groups H 1 : Proportion not similar for some/all groups Test of Independence tells if two variables are dependent H 0 : Two variables are independent H 1 : Two variables are dependent
64 Consider this The proportion of selected major is the same for any gender That means no matter the gender, the proportions remain the same That means gender has no effect of selection of major and therefore the two are independent