Common Statistical Analyses Theory behind them

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Hypothesis Testing Steps in Hypothesis Testing:
Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types.
Econ 140 Lecture 81 Classical Regression II Lecture 8.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Objectives (BPS chapter 24)
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
Modelling risk ratios and risk differences …this is *new* methodology…
Final Review Session.
SIMPLE LINEAR REGRESSION
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Chapter Topics Types of Regression Models
SIMPLE LINEAR REGRESSION
A trial of incentives to attend adult literacy classes Carole Torgerson, Greg Brooks, Jeremy Miles, David Torgerson Classes randomised to incentive or.
Interpreting Bi-variate OLS Regression
Simple Linear Regression and Correlation
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
SIMPLE LINEAR REGRESSION
Biostat 200 Lecture 8 1. Hypothesis testing recap Hypothesis testing – Choose a null hypothesis, one-sided or two sided test – Set , significance level,
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
SUMMARY FOR EQT271 Semester /2015 Maz Jamilah Masnan, Inst. of Engineering Mathematics, Univ. Malaysia Perlis.
Basic Biostatistics Prof Paul Rheeder Division of Clinical Epidemiology.
Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Exact Logistic Regression
1 In the Monte Carlo experiment in the previous sequence we used the rate of unemployment, U, as an instrument for w in the price inflation equation. SIMULTANEOUS.
Chapter 13 Understanding research results: statistical inference.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
 List the characteristics of the F distribution.  Conduct a test of hypothesis to determine whether the variances of two populations are equal.  Discuss.
Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND.
Applied Regression Analysis BUSI 6220
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Warm-Up The least squares slope b1 is an estimate of the true slope of the line that relates global average temperature to CO2. Since b1 = is very.
Lecture 11: Simple Linear Regression
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
Inference for Least Squares Lines
From t-test to multilevel analyses Del-2
Design and Analysis of Experiments
CHAPTER 7 Linear Correlation & Regression Methods
Correlation and Simple Linear Regression
From t-test to multilevel analyses (Linear regression, GLM, …)
John Loucks St. Edward’s University . SLIDES . BY.
Chapter 11 Simple Regression
Y - Tests Type Based on Response and Measure Variable Data
Simple Linear Regression
Correlation and Simple Linear Regression
SA3202 Statistical Methods for Social Sciences
STAT120C: Final Review.
Review for Exam 2 Some important themes from Chapters 6-9
Chapter 12 Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Association, correlation and regression in biomedical research
SIMPLE LINEAR REGRESSION
CHAPTER 14 MULTIPLE REGRESSION
Simple Linear Regression and Correlation
SIMPLE LINEAR REGRESSION
RES 500 Academic Writing and Research Skills
Design and Analysis of Experiments
Chapter Thirteen McGraw-Hill/Irwin
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Common Statistical Analyses Theory behind them Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND

Statistical inference revisited Statistical inference use data from samples to make inferences about a population 1. Estimate the population parameter Characterized by confidence interval of the magnitude of effect of interest 2. Test the hypothesis being formulated before looking at the data Characterized by p-value

Sample n = 25 X = 52 SD = 5 Population Parameter estimation [95%CI] Hypothesis testing [P-value]

Sample n = 25 X = 52 SD = 5 SE = 1 Population Parameter estimation Z = 2.58 Z = 1.96 Z = 1.64 Population Parameter estimation [95%CI] : 52-1.96(1) to 52+1.96(1) 50.04 to 53.96 We are 95% confidence that the population mean would lie between 50.04 and 53.96

Sample n = 25 X = 52 SD = 5 SE = 1 Population Hypothesis testing HA :   55 Z = 55 – 52 1 3

Hypothesis testing H0 :  = 55 HA :   55 52 55 -3SE +3SE If the true mean in the population is 55, chance to obtain a sample mean of 52 or more extreme is 0.0027. Z = 55 – 52 1 3 P-value = 1-0.9973 = 0.0027

Calculation of the previous example based on t-distribution   Stata command to find t value for 95%CL . di (invttail(24, 0.025)) 2.0638986   Stata command to find probability .di (ttail(24, 3))*2 .00620574 Web base stat table: http://vassarstats.net/tabs.html or www.stattrek.com

Revisit the example based on t-distribution (Stata output) 1. Estimate the population parameter Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- | 25 52 1 49.9361 54.0639 2. Test the hypothesis being formulated before looking at the data One-sample t test ------------------------------------------------------------------------------ | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- x | 25 52 1 5 49.9361 54.0639 mean = mean(x) t = -3.0000 Ho: mean = 55 degrees of freedom = 24 Ha: mean < 55 Ha: mean != 55 Ha: mean > 55 Pr(T < t) = 0.0031 Pr(|T| > |t|) = 0.0062 Pr(T > t) = 0.9969

Mean one group: T-test a 1. Hypothesis H0:  = 0 Ha:    0 2. Data 1 5   3. Calculating for t-statistic   4. Obtain p-value based on t-distribution Stata command .di (ttail(4, 3.59))*2 .02296182 P-value = 0.023 5. Make a decision Reject the null hypothesis at level of significant of 0.05 The mean of y is statistically significantly different from zero.

Mean one group: T-test (cont.) One-sample t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- y | 5 3 .83666 1.870829 .6770594 5.322941 mean = mean(y) t = 3.5857 Ho: mean = 0 degrees of freedom = 4 Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 0.9885 Pr(|T| > |t|) = 0.0231 Pr(T > t) = 0.0115

Comparing 2 means: T-test 1. Hypothesis H0: A = B Ha: A   B 2. Data   a b 1 5 2 9 8     3. Calculating for t-statistic   4. Obtain p-value based on t-distribution P-value = 0.002 (http://vassarstats.net/tabs.html) 5. Make a decision Reject the null hypothesis at level of significant of 0.05 Mean of Group A is statistically significantly different from that of Group B.

T-test Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- a | 5 3 .83666 1.870829 .6770594 5.322941 b | 5 8 .7745967 1.732051 5.849375 10.15063 combined | 10 5.5 .9916317 3.135815 3.256773 7.743227 diff | -5 1.140175 -7.629249 -2.370751 diff = mean(1) - mean(2) t = -4.3853 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0012 Pr(|T| > |t|) = 0.0023 Pr(T > t) = 0.9988

Mann-Whitney U test Wilcoxon rank-sum test Two-sample Wilcoxon rank-sum (Mann-Whitney) test group | obs rank sum expected -------------+--------------------------------- 1 | 5 16 27.5 2 | 5 39 27.5 combined | 10 55 55 unadjusted variance 22.92 adjustment for ties -1.25 ---------- adjusted variance 21.67 Ho: y(group==1) = y(group==2) z = -2.471 Prob > |z| = 0.0135

Comparing 2 means : ANOVA Mathematical model of ANOVA X =  +  +  X = Grand mean + Treatment effect + Error X = M + T + E X M T E         = + + Mean: 3 8 [3-5.5] [8-5.5] Degree of freedom 1 1 8 3. Calculating for F-statistic Between groups     SST   SSE Within groups 4. Obtain p-value based on F-distribution P-value = 0.002 (http://vassarstats.net/tabs.html) 5. Make a decision Reject the null hypothesis at level of significant of 0.05 Mean of Group A is statistically significantly different from that of Group B.

ANOVA 2 groups Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 62.5 1 62.5 19.23 0.0023 Within groups 26 8 3.25 Total 88.5 9 9.83333333 Bartlett's test for equal variances: chi2(1) = 0.0211 Prob>chi2 = 0.885

Comparing 3 means: ANOVA 1. Hypothesis H0: A = B = C Ha: At least one mean is difference 2. Data a b c 1 5 4 2 9 6 8      

ANOVA 3 groups (cont.) X M T E = + + 3. Calculating for F-statistic Mathematical model of ANOVA X =  +  +  X = Grand mean + Treatment effect + Error X = M + T + E X M T E         = + + Mean: 3 8 5.2 [3-5.4] [8-5.4] [5.2-5.4] Df: 15 1 2 12 3. Calculating for F-statistic Between groups     SST   SSE Within groups 4. Obtain p-value based on F-distribution P-value = 0.003 (http://vassarstats.net/tabs.html) 5. Make a decision Reject the null hypothesis at level of significant of 0.05 At least one mean of the three groups is statistically significantly different from the others.

ANOVA 3 groups Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 62.8 2 31.4 9.71 0.0031 Within groups 38.8 12 3.23333333 Total 101.6 14 7.25714286 Bartlett's test for equal variances: chi2(2) = 0.0217 Prob>chi2 = 0.989

Kruskal-Wallis test Kruskal-Wallis equality-of-populations rank test +------------------------+ | group | Obs | Rank Sum | |-------+-----+----------| | 1 | 5 | 22.00 | | 2 | 5 | 61.50 | | 3 | 5 | 36.50 | chi-squared = 7.985 with 2 d.f. probability = 0.0185 chi-squared with ties = 8.190 with 2 d.f. probability = 0.0167

Comparing 2 means: Regression 1. Data a b 1 5 2 9 8 x y (x-x) (x-x)2 (y-y) (x-x)(y-y) 1 -0.5 0.25 -4.5 2.25 2 -3.5 1.75 5 0.5 -0.30 9 3.5 8 2.5 1.25 Mean 1.5 5.5 Sum 2.5 12.5 y = a + bx where b = 12.5/2.5 = 5, then 5.5 = a + 5(1.5) Thus a = 5.5-7.5 = -2

Comparing 2 means: Regression Y 10 a b 1 5 2 9 8 8 6 4 2 x a b -2

Comparing 2 means: Regression (cont.) Y 10 x y 1 2 5 9 8 8 6 4 2 x 1 2 -2

Comparing 2 means: Regression (cont.) Y y = a + bx 10 y = -2 + 5x y = 8 if x = 2 8 y = 5.5; x = 1.5 6 b difference of y between x=1 vs. x=2 y = 3 if x = 1 4 2 x 1 2 -2 a y = -2 if x = 0

Regression model (2 means) Source | SS df MS Number of obs = 10 -------------+------------------------------ F( 1, 8) = 19.23 Model | 62.5 1 62.5 Prob > F = 0.0023 Residual | 26 8 3.25 R-squared = 0.7062 -------------+------------------------------ Adj R-squared = 0.6695 Total | 88.5 9 9.83333333 Root MSE = 1.8028 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- group | 5 1.140175 4.39 0.002 2.370751 7.629249 _cons | -2 1.802776 -1.11 0.299 -6.157208 2.157208

Regression model (3 means) i.group _Igroup_1-3 (naturally coded; _Igroup_1 omitted) Source | SS df MS Number of obs = 15 -------------+------------------------------ F( 2, 12) = 9.71 Model | 62.8 2 31.4 Prob > F = 0.0031 Residual | 38.8 12 3.23333333 R-squared = 0.6181 -------------+------------------------------ Adj R-squared = 0.5545 Total | 101.6 14 7.25714286 Root MSE = 1.7981 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Igroup_2 | 5 1.137248 4.40 0.001 2.522149 7.477851 _Igroup_3 | 2.2 1.137248 1.93 0.077 -.2778508 4.677851 _cons | 3 .8041559 3.73 0.003 1.247895 4.752105

Correlation coefficient Pearson product moment correlation Denoted by r (for the sample) or  (for the population) Require bivariate normal distribution assumption Require linear relationship Spearman rank correlation For small sample, not require bivariate normal distribution assumption

Pearson product moment correlation Indeed it is the mean of the product of standard score.

Scatter plot b 10 a b 1 5 2 9 8 8 6 4 2 a 1 2 3 4 5

Calculation for correlation coefficient(r) [1] x [2] y [3] (x-x)/SD [4] (y-y)/SD [3] x [4] 1 5 -1.07 -1.73 1.85 2 9 -0.53 0.58 -0.31 8 1.07 0.00 0.62 Sum Mean 3 SD 1.87 1.73  

Interpretation of correlation coefficient Negative Positive None −0.09 to 0.00 0.00 to 0.09 Small −0.30 to −0.10 0.10 to 0.30 Medium −0.50 to −0.30 0.30 to 0.50 Strong −1.00 to −0.50 0.50 to 1.00 These serve as a guide, not a strict rule. In fact, the interpretation of a correlation coefficient depends on the context and purposes. From Wikipedia, the free encyclopedia

The correlation coefficient reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). The figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero. This is a file from the Wikimedia Commons.

Inference on correlation coefficient   Stata commands: .di tanh(-0.885) -.70891534 .di tanh(1.887) .95511058

Stata command ci2 x y, corr spearman Confidence interval for Spearman's rank correlation of x and y, based on Fisher's transformation. Correlation = 0.354 on 5 observations (95% CI: -0.768 to 0.942) Warning: This method may not give valid results with small samples (n<= 10) for rank correlations.

Inference on correlation coefficient   Or use Stata command .di (ttail(3, 0.9))*2 .43445103

Inference on proportion One proportion Two proportions Three or more proportions

One proportion: Z-test 1. Hypothesis H0: 1 = 0 Ha: 1   0 2. Data y 1 . . . ny = 50, py = 0.1 3. Calculating for z-statistic   4. Obtain p-value based on Z-distribution P-value = 0.018 (http://vassarstats.net/tabs.html) Stata command to get the p-vale . di (1-normal(2.357))*2 .01842325 5. Make a decision Reject the null hypothesis at a level of significant of 0.05 Proportion of Y is statistically significantly different from zero.

Comparing 2 proportions: Z-test 1. Hypothesis H0: 1 = 0 Ha: 1   0 2. Data x y 1 . . . x y 1 Total 45 5 50 30 20 75 25 100 n0 = 50, p0 = 0.1 n1 = 50, p1 = 0.4 3. Calculating for z-statistic     4. Obtain p-value based on t-distribution P-value = 0.0005 (http://vassarstats.net/tabs.html) 5. Make a decision Reject the null hypothesis at level of significant of 0.05 Proportion of Y between group of x is statistically significantly different from each other.

Z-test for two proportions Two-sample test of proportions 0: Number of obs = 50 1: Number of obs = 50 ------------------------------------------------------------------------------ Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 0 | .1 .0424264 .0168458 .1831542 1 | .4 .069282 .2642097 .5357903 diff | -.3 .0812404 -.4592282 -.1407718 | under Ho: .0866025 -3.46 0.001 diff = prop(0) - prop(1) z = -3.4641 Ho: diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(Z < z) = 0.0003 Pr(|Z| < |z|) = 0.0005 Pr(Z > z) = 0.9997

Comparing 2 proportions: Chi-square-test 1. Hypothesis H0: ij = i+ +j where I = 0, 1; j = 0, 1 Ha: ร่   i+ +j 2. Data x y 1 Total 45 5 50 30 20 75 25 100 O E (O-E) (O-E)2 (O-E)2/E 45 (75/100)50 = 37.50 7.50 56.25 1.50 5 (25/100)50 =12.50 -7.50 4.50 30 (75/100)50 =37.50 20 Chi-square (df = 1) 12.00 3. Calculating for 2-statistic 4. Obtain p-value based on t-distribution P-value = 0.001 (http://vassarstats.net/tabs.html) 5. Make a decision Reject the null hypothesis at level of significant of 0.05 There is statistically significantly association between x and y.

Comparing 2 proportions: Chi-square-test | y x | 0 1 | Total -----------+----------------------+---------- 0 | 45 5 | 50 1 | 30 20 | 50 Total | 75 25 | 100 Pearson chi2(1) = 12.0000 Pr = 0.001

csi 20 5 30 45, or exact | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 20 5 | 25 Noncases | 30 45 | 75 Total | 50 50 | 100 | | Risk | .4 .1 | .25 | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | .3 | .1407718 .4592282 Risk ratio | 4 | 1.62926 9.820408 Attr. frac. ex. | .75 | .3862245 .8981712 Attr. frac. pop | .6 | Odds ratio | 6 | 2.086602 17.09265 (Cornfield) +------------------------------------------------- 1-sided Fisher's exact P = 0.0005 2-sided Fisher's exact P = 0.0010

Binomial regression . binreg y x, rr Generalized linear models No. of obs = 100 Optimization : MQL Fisher scoring Residual df = 98 (IRLS EIM) Scale parameter = 1 Deviance = 99.80946404 (1/df) Deviance = 1.018464 Pearson = 99.99966753 (1/df) Pearson = 1.020405 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u) [Log] BIC = -351.4972 ------------------------------------------------------------------------------ | EIM y | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 4 1.833024 3.03 0.002 1.629265 9.820377 _cons | .1 .0424262 -5.43 0.000 .0435379 .2296851

Logistic regression . logistic y x Logistic regression Number of obs = 100 LR chi2(1) = 12.66 Prob > chi2 = 0.0004 Log likelihood = -49.904732 Pseudo R2 = 0.1125 ------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 6 3.316625 3.24 0.001 2.030635 17.72844 _cons | .1111111 .0523783 -4.66 0.000 .044106 .2799096

Q & A