Hypothesis testing. Parametric tests

Hypothesis testing. Parametric tests
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine

Outline Statistical inference Hypothesis testing
Type I and type II errors Student t-test ANOVA Parametric and non-parametric tests

Importance of biostatistics
Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Pancreatic cancer study Experimental group: 1-year survival rate: 23% Control group: 1-year survival rate: 20% Is there a difference?

Statistical inference. Role of chance
Formulate hypotheses Collect data to test hypotheses

Statistical inference. Role of chance
Еrror Formulate hypotheses Collect data to test hypotheses Accept hypothesis Reject hypothesis

Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Pancreatic cancer study Experimental group: 1-year survival rate: 23% Control group: 1-year survival rate: 20% Statistics are needed to quantify differences that are too small to recognize through clinical experience alone.

Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Increased sample size: Experimental group: Mean blood sugar level: 105 mg/dl Control group: Mean blood sugar level: 106 mg/dl

Making the decision Compare the mean between 2 samples / conditions
assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples X1 X2 Compare the mean between 2 samples / conditions if 2 samples are taken from the same population, then they should have fairly similar means

Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Increased sample size: Experimental group: Mean blood sugar level: 99 mg/dl Control group: Mean blood sugar level: 112 mg/dl

Making the decision Compare the mean between 2 samples/ conditions
assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples µ1 µ2 X1 X2 Compare the mean between 2 samples/ conditions if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, ie they really are different

Hypothesis testing The general idea of hypothesis testing involves:
Making an initial assumption; Collecting evidence (data); Based on the available evidence (data), deciding whether to reject or not reject the initial assumption. Every hypothesis test — regardless of the population parameter involved — requires the above three steps.

Criminal trial Criminal justice system assumes "the defendant is innocent until proven guilty." That is, our initial assumption is that the defendant is innocent. In the practice of statistics, we make our initial assumption when we state our two competing hypotheses – the null hypothesis (H0) and the alternative hypothesis (HA). Here, our hypotheses are: H0: Defendant is not guilty (innocent) HA: Defendant is guilty In statistics, we always assume the null hypothesis is true. That is, the null hypothesis is always our initial assumption.

Null hypothesis – H0 This is the hypothesis under test, denoted as H0.
The null hypothesis is usually stated as the absence of a difference or an effect. The null hypothesis says there is no effect. The null hypothesis is rejected if the significance test shows the data are inconsistent with the null hypothesis.

Alternative hypothesis – H1
This is the alternative to the null hypothesis. It is denoted as H', H1, or HA. It is usually the complement of the null hypothesis. If, for example, the null hypothesis says two population means are equal, the alternative says the means are unequal

Criminal trial The prosecution team then collects evidence with the hopes of finding "sufficient evidence" to make the assumption of innocence refutable. In statistics, the data are the evidence. The jury then makes a decision based on the available evidence: If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects H0 and deems the defendant guilty. We behave as if the defendant is guilty. If there is insufficient evidence, then the jury does not reject H0. We behave as if the defendant is innocent.

Making the decision Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. If it is likely, we do not reject the null hypothesis. If it is unlikely, then we reject the null hypothesis in favor of the alternative hypothesis. Effectively, then, making the decision reduces to determining "likely" or "unlikely."

Making the decision In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption: We could take the "critical value approach" (favored in many of the older textbooks). Or, we could take the “p-value approach" (what is used most often in research, journal articles, and statistical software).

Probability A measure of the likelihood that a particular event will happen. It is expressed by a value between 0 and 1. First, note that we talk about the probability of an event, but what we measure is the rate in a group. If we observe that 5 babies in every have congenital heart disease, we say that the probability of a (single) baby being affected is 5 in 1000 or 0.0 1.0 Cannot happen Sure to happen

Making the decision Suppose we find a difference between two groups in survival: patients on a new drug have a survival of 15 months; patients on the old drug have a survival of 18 months. So, the difference is 3 months.

Making the decision Suppose we find a difference between two groups in survival: patients on a new drug have a survival of 15 months; patients on the old drug have a survival of 18 months. So, the difference is 3 months. Do we accept or reject the hypothesis of no true difference between the groups (the two drugs)? Is a difference of 3 a lot, statistically speaking – a huge difference that is rarely seen? Or is it not much – the sort of thing that happens all the time?

Making the decision A statistical test tells you how often you’d get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. Suppose the test is done and its result is that p = This means that you’d get a difference of 3 quite often just by the play of chance – 32 times in 100 – even when there is in reality no true difference between the groups.

Making the decision A statistical test tells you how often you’d get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. Suppose the test is done and its result is that p = This means that you’d get a difference of 3 quite often just by the play of chance – 32 times in 100 – even when there is in reality no true difference between the groups. On the other hand if we did the statistical analysis and p = , then we say that you’d only get a difference as big as 3 by the play of chance 1 time in That’s so rarely that we want to reject our hypothesis of no difference: there is something different about the new therapy.

Probability 0.05 0.95 Win from the lottery Get hit by a truck 0.0 1.0
Cannot happen Sure to happen

Hypothesis testing Somewhere between 0.32 and we may not be sure whether to reject the null hypothesis or not. Mostly we reject the null hypothesis when, if the null hypothesis were true, the result we got would have happened less than 5 times in 100 by chance. This is the ‘conventional’ cutoff of 5% or p <0.05. This cutoff is commonly used but it’s arbitrary i.e. no particular reason why we use 0.05 rather than 0.06 or or whatever.

Statistical significance
The statistical significance (p-value) of a result is an estimated measure of the degree to which it is „true“. P-values are the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.

Power Power is the probability of rejecting the null hypothesis.
Power refers to the probability that your test will find a statistically significant difference when such a difference actually exists. In other words, power is the probability that you will reject the null hypothesis when you should (and thus avoid a Type II error). It varies according to the underlying truth. For example, the probability of rejecting the hypothesis of equal population means will vary according to the actual difference in population means are. All other things being equal, the probability of the rejecting the null hypothesis increases with the difference between population means.

Type I and type II errors
Rejecting the null hypothesis, when it is true is called a Type I Error. The probability of a type I error is denoted by the Greek letter  (alpha).  is also the power of the test when H0 is true. Type II Error occurs when we fail to reject the H0 when it is false. The probability of a type II error is denoted by the Greek letter  (beta). By definition, power = 1 -  when the H0 is false.

Type I and type II errors
Of course we may later find out (e.g. because people do more experiments) that actually we were looking at one of those 5 times in 100 – i.e. there really was no difference between the two groups. So we were wrong when we said the result was so rare that it couldn’t be due to chance and so rejected the null hypothesis. This mistake is called a Type I error. It most often occurs if we do lots and lots of tests on the same data. If you do 100 tests, 5 of them are going to be in that category of ‘only happens 5 times in 100’, so it would be stupid to declare these 5 statistically significant (which is the same as saying ‘would happen by chance, if the null hypothesis is true, less than 5 times in 100’).

Choosing a statistical test
Choice of a statistical test depends on: Level of measurement for the dependent and independent variables Number of groups or dependent measures Number of units of observation Type of distribution The population parameter of interest (mean, variance, differences between means and/or variances)

Multiple comparison – two or more data sets, which should be analyzed repeated measurements made on the same individuals entirely independent samples Degrees of freedom – the number of scores, items, or other units in the data set, which are free to vary One- and two tailed tests one-tailed test of significance used for directional hypothesis two-tailed tests in all other situations

Sample size – number of cases, on which data have been obtained Which of the basic characteristics of a distribution are more sensitive to the sample size? location (mean, median, mode) spread (standard deviation, range, IQR) skewness kurtosis mean standard deviation skewness kurtosis

Reporting convention: t = 11.456, p < 0.001
Student t-test Difference between the means divided by the pooled standard error of the mean t-value+ an end product of a calculation df = degrees of freedom (the number of individual scores that can vary without changing the sample mean) Standard error Is the standard deviation of sample means It is a measure of how representative a sample is likely to be of the population Large SE (relative to the sample mean): lots of variability between means of different samples  used sample may not be representative of a population Small SE: most sample means are similar to the population mean  sample is likely to be an accurate reflection of the population Reporting convention: t = , p < 0.001

1-sample t-test Comparison of sample mean with a population mean
It is known that the weight of young adult male has a mean value of 70.0 kg with a standard deviation of 4.0 kg. Thus the population mean, µ= 70.0 and population standard deviation, σ= 4.0. Data from random sample of 28 males of similar ages but with specific enzyme defect: mean body weight of 67.0 kg and the sample standard deviation of 4.2 kg. Question: Whether the studed group have a significantly lower body weight than the general population?

1-sample t-test population mean, µ= 70.0 kg
population standard deviation, σ= 4.0 kg sample size = 28 sample mean, x = 67.0 sample standard deviation, s= 4.0. Null hypothesis: There is no difference between sample mean and population mean. t-statistic = 0.15, p > 0.05 Null hypothesis is accepted.

2-sample t-test Comparison of means from two unrelated groups
Example: For two groups, t-tests and ANOVAs (F tests) are interchangeable. study of the effects of anticonvulsant therapy on bone disease in the elderly Study design: group of treated patients (n = 55) group of untreated patients (n = 47) Outcome measure: serum calcium concentration Research question: Whether the groups statistically significantly differ in mean serum concentration?

2-matched (paired) samples t-test
Tests hypothesis that two sample means are equal when observations are correlated (e.g., pre-/post-test data; data from matched controls). Example: Study of the effects of anticonvulsant therapy on bone disease in the elderly. Study design: group of treated patients (n = 40) Outcome measure: serum calcium concentration before and after operation Research question: Whether the mean serum consentration statistically significantly differ before and after operation?

Types of t-tests Independent Samples Related Samples
also called dependent means test Interval measures/ parametric Independent samples t-test* Paired samples t-test** Ordinal/ non-parametric Mann-Whitney U-Test Wilcoxon test There are lots of different types of t-tests, which need to be used depending on the type of data you have (equal) interval measures Scales in which the difference between consecutive measuring points on the scale is of equal value throughout No arbitrary zero, ie positive and negative measures, eg temperature Ordinal measures Scales on which the items can be ranked in order There is an order of magnitude but intervals may vary, ie one item on the scale is more or less than another but it is not clear by how much as this cannot be measured Often statements/ feelings are attached to numbers which can then be used for rating; in fact, this data can only be ranked (from highest to lowest) , ie what score had the highest turn-out 1= very good, 2= good, 3= neutral, 4= bad, 5= very bad Different measurements have a direct influence on the way analysis is conducted because some of them are more amenable to mathematical operations than others * 2 experimental conditions and different participants were assigned to each condition ** 2 experimental conditions and the same participants took part in both conditions of the experiments

Reporting convention: F = 65.58, df = 4.45, p < 0.001
ANOVA ANalysis Of VAriance (ANOVA) Still compares the differences in means between groups but it uses the variance of data to “decide” if means are different Terminology (factors and levels) F-statistic Magnitude of the difference between the different conditions p-value associated with F is probability that differences between groups could occur by chance if null-hypothesis is correct need for post-hoc testing (ANOVA can tell you if there is an effect but not where) ANOVA is concerned with differences between means of groups, not differences between variances. The name analysis of variance comes from the way the analysis uses variances to decide whether the means are different. A better acronym for this model would be ANOVASMAD (analysis of variance to see if means are different)! The way it works is simple: the program looks to see what the variation (variance) is within the groups, then works out how that variation would translate into variation (i.e. differences) between the groups, taking into account how many subjects there are in the groups. If the observed differences are a lot bigger than what you'd expect by chance, you have statistical significance. (so if the patterns of data spread are similar in your different samples, then the mean won’t be much different, ie the samples are probably from the same population; reversely, if the pattern of variance differs between groups, so will the mean, thus the samples are likely to be drawn from different populations) Terminology: Factors: the overall “things” being compared (eg, age vs task) Levels: the different elements of a factor (young vs old AND naming vs reading aoud) ANOVA tests for one overall effect only (this makes it an omnibus test), so it can tell us if experimental manipulation was generally successful but it doesn’t provide specific information about which specific groups were affected. need for post-hoc testing! ANOVA produces F-statistic or F-ratio which is similar to t-score as it compares the amount of systematic variance in the data to the amount of unsystematic variance. As such, it is the ratio of the experimental effect to the individual differences in performance. If the F=ratio’s value is less than 1, it must represent a non-significant event (so you always want a F-ratio greater than 1, indicating that experimental manipulation had some effect above and beyond the effect of individual differences in performance). To test for significance, compare obtained F-ratio against maximum value one would expect to get by chance alone in an F-distribution with the same degrees of freedom. p-value associated with F is probability that differences between groups could occur by chance if null-hypothesis is correct. Reporting convention: F = 65.58, df = 4.45, p < 0.001

Parametric and non-parametric tests
Parametric test – to estimate at least one population parameter from sample statistics Assumption: the variable we have measured in the sample is normally distributed in the population to which we plan to generalize our findings Non-parametric test – distribution free, no assumption about the distribution of the variable in the population

Hypothesis testing. Parametric tests

Similar presentations

Presentation on theme: "Hypothesis testing. Parametric tests"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hypothesis testing. Parametric tests

Similar presentations

Presentation on theme: "Hypothesis testing. Parametric tests"— Presentation transcript:

Similar presentations

About project

Feedback