Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.
Inferential Statistics
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Is it statistically significant?
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Hypothesis Testing Using a Single Sample
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Final Jeopardy $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 LosingConfidenceLosingConfidenceTesting.
11 Comparison of Two Means Tests involving two samples – comparing variances, F distribution TOH - x A = x B ? Step 1 - F-test  s A 2 = s B 2 ? Step.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Inference about a Mean Part II
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
IENG 486 Statistical Quality & Process Control
Chapter 9 Hypothesis Testing.
Chapter 8 Introduction to Hypothesis Testing
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Getting Started with Hypothesis Testing The Single Sample.
Probability Population:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 11 Introduction to Hypothesis Testing.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
AM Recitation 2/10/11.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Hypothesis Testing:.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Chapter 13 – 1 Chapter 12: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two.
Overview Definition Hypothesis
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Statistical inference: confidence intervals and hypothesis testing.
Chapter 8 Inferences Based on a Single Sample: Tests of Hypothesis.
Fundamentals of Hypothesis Testing: One-Sample Tests
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap th Lesson Introduction to Hypothesis Testing.
Copyright © Cengage Learning. All rights reserved. 10 Inferences Involving Two Populations.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Hypothesis Testing. Steps for Hypothesis Testing Fig Draw Marketing Research Conclusion Formulate H 0 and H 1 Select Appropriate Test Choose Level.
Chapter 9: Testing Hypotheses
Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.
One-sample In the previous cases we had one sample and were comparing its mean to a hypothesized population mean However in many situations we will use.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
5.1 Chapter 5 Inference in the Simple Regression Model In this chapter we study how to construct confidence intervals and how to conduct hypothesis tests.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
Chapter 8 Parameter Estimates and Hypothesis Testing.
Chapter 9: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Type I and II Errors Testing the difference between two means.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
© Copyright McGraw-Hill 2004
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Understanding Basic Statistics Fourth Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Nine Hypothesis Testing.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
Course Overview Collecting Data Exploring Data Probability Intro. Inference Comparing Variables Relationships between Variables Means/Variances Proportions.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
CHAPTER 7: TESTING HYPOTHESES Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
T-TEST. Outline  Introduction  T Distribution  Example cases  Test of Means-Single population  Test of difference of Means-Independent Samples 
Hypothesis Testing. Steps for Hypothesis Testing Fig Draw Marketing Research Conclusion Formulate H 0 and H 1 Select Appropriate Test Choose Level.
Hypothesis Testing: Hypotheses
Chapter 9 Hypothesis Testing.
Presentation transcript:

Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Normal Distribution

Distribution of a random variable Statistical parameters –  and  Normal Distribution

Central Limit Theorem Considered the following set of measurements for a given population: 55.20, 18.06, 28.16, 44.14, 61.61, 4.88, , , 97.47, 56.89, , , , 9.98, The population mean is Now, considered two samples from this population. These two different samples could have means very different from each other and also very different from the true population mean. What happen if we considered, not only two samples, but all possible samples of the same size ? The answer to this question is one of the most fascinating facts in statistics – Central limit theorem. It turns out that if we calculate the mean of each sample, those mean values tend to be distributed as a normal distribution, independently on the original distribution. The mean of this new distribution of the means is exactly the mean of the original population and the variance of the new distribution is reduced by a factor equal to the sample size n.

Central Limit Theorem When sampling from a population with mean  and variance , the distribution of the sample mean (or the sampling distribution X) will have the following properties: The distribution of X will be approximately normal. The larger the sample is, the more will the sampling distribution resemble the normal distribution. The mean x of the distribution of X will be equal to , the mean of the population from which the samples were drawn. The variance s 2 of distribution X will be equal to  2 /n, the variance of the original population of X divided by the sample size. The quantity s is called the standard error of the mean

Statistical hypothesis testing The expression level of a gene in a given condition is measured several times. A mean x of these measurements is calculated. From many previous experiments, it is known that the mean expression level of the given gene in normal conditions is . How can you decide which genes are significantly regulated in a microarray experiment? For instance, one can apply an arbitrary cutoff such as a threshold of at least twofold up or down regulation. One can formulate the following hypotheses: 1.The gene is up-regulated in the condition under study: x>  2.The gene is down-regulated in the condition under study: x<  3.The gene is unchanged in the condition under study: x=  4.Something has gone awry during the lab experiments and the genes measurements are completely off; the mean of the measurements may be higher or lower than the normal: x≠ .

Statistical hypothesis testing When a hypothesis test is viewed as a decision procedure, two types of error are possible, depending on which hypothesis, H 0 or H 1, is actually true. If a test rejects H 0 (and accept H 1 ) when H 0 is true, it is called a type I error. If a test fails to reject H 0 when H 1 is true, it is called a type II error. The following shows the results of the different decisions. Do not reject H 0 Reject H 0 TrueCorrect decisionType I error FalseType II errorCorrect decision H0H0 Decision

The next step is to generate two hypotheses. The two hypotheses must be mutually exclusive and all inclusive. Mutually exclusive – the two hypotheses cannot be true both at the same time All inclusive means that their union has to cover all possibilities Expression ratios are converted into probability values to test the hypothesis that particular genes are significantly regulated Null hypothesis H 0 that there is no difference in signal intensity across the conditions being tested The other hypothesis (called alternate or research hypothesis) named H . If we believe that the gene is up-regulated, the research hypothesis will be H 1 : x > , The null hypothesis has to be mutually exclusive and also has to include all other possibilities, therefore, the null hypothesis will be H 0 : x ≦ . One assigns a p-value for testing the hypothesis. The p-value is the probability of a measurement more extreme than a certain threshold occurring just by chance. The probability of rejecting the null hypothesis when it is true is the significance level , which is typically set at p<0.05, in other words we accept that 1 in 20 cases our conclusion can be wrong. Statistical hypothesis testing

One-tail testing The alternative hypothesis specifies that the parameter is greater than the values specified under H 0, e.g. H 1 :  >15. such a hypothesis is called upper one-tail testing. Example The expression level of a gene is measured 4 times in a given condition. The 4 measurements are used to calculate a mean expression level of x=90. it is known from the literature that the mean expression level of the given gene, measured with the same technology in normal conditions is  =100 and the standard deviation is  =10. We expect the gene to be down-regulated in the condition under study and we would like to test whether the data support this assumption. The alternative hypothesis H 1 is “the gene is down- regulated” or H 0 : x ≧ , therefore, H 1 x<  This is an example of a one-tail hypothesis (left-tail) in which we expect the values to be in one particular tail of the distribution. Accept H 0

Statistical hypothesis testing From the sampling theorem, the means of samples are distributed approximately as a normal distribution. Sample size = 4, Mean x = 90,  = 100 Standard deviation  = 10 Assuming a significance level of 5% The null hypothesis is rejected if the computed p-value is lower than the critical value (0.05) We can calculate the value of Z as The probability of having such a value just by chance, i.e. the p-value, is : P(Z < -2) = The computed p-value is lower than our significance threshold < 0.05, therefore we reject the null hypothesis. In other words, we accept the alternate hypothesis. We stated that “the gene is down-regulated at 5% significance level”. This will be understood by the knowledgeable reader as a conclusion that is wrong in 5% of the cases or fewer.

Normal distribution table

NORMDIST - Area under the curve start from left hand side Z=0 Z=2

Statistical hypothesis testing Two-tail testing A novel gene has just been discovered. A large number of expression experiments measured the mean expression level of this gene as 100 with a standard deviation of 10. Subsequently, the same gene is measured 4 times in 4 cancer patients. The mean of these 4 measurements is 109. Can we conclude that this gene is differential expressed in cancer? We do not whether the gene will be up- regulated or down-regulated. Null hypothesis H 0 : = 100, Alternative hypothesis H 1 : ≠ 100 At a significant level of 5%  2.5% for the left tail and 2.5% for the right tail Z = (109 – 100)/(10/√4) = 9/(10)*2 = 1.8 P-value, P(Z ≧ 1.8) = 1 – P(Z ≦ 1.8) = 1 – = >  that is the P- value is higher than the significant level, so we cannot reject the null hypothesis 2.5%

Tests involving the mean – the t distribution Hypothesis testing Parametric testing – where the data are known or assumed to follow a certain probability distribution (e.g. normal distribution) Non-parametric testing – where no a priori knowledge is available and no such assumptions are made. The t distribution test or student’s t distribution test is a parametric test, it was discovered by William S. Gossett, a 32-year old research chemist employed by the famous Irish brewery ( 釀造,如啤酒 ) Guinness.

Tests involving the mean – the t distribution Tests involving a single sample may focus on the mean of the sample (t-test, where variance of the population is not known) and the variance (  2 -test). The following hypotheses may be formulated if the testing regards the mean of the sample: 1.H 0 :  = c, H 1 :  ≠c 2.H 0 :  ≧ c, H 1 :  < c 3.H 0 :  ≦ c, H 1 :  > c The first hypotheses corresponds to a two-tail testing in which no a prior knowledge is available, while the second and the third correspond to a one-tail testing in which the measured value c is expected to be higher and lower than the population mean , respectively.

Tests involving the mean – the t distribution The expression level of a gene is known to have a mean expression level of 18 in the normal human population. The following expression values have been obtained in five measurements: 21, 18, 23, 20, 18. Is this data consistent with the published mean of 18 at a 5% significant level? Population s.d.  is not known  t-test, calculate sample s.d. s to estimate  H 0 : = , H 1 : ≠  18  two-tail test Calculate the t-test statistics Remember using n-1 when calculating standard deviation s.

Tests involving the mean – the t distribution Degree of freedom,, =5-1=4. Using a table of the t-distribution with four degree of freedom, the p-value associated with this test statistic is found to be between 0.05 and 0.1. The 5% two-tail test corresponds to a critical value of Since the p- value is greater than 0.05 (t-value=2.11 < critical value=2.776), the evidence is not strong enough to reject the null hypothesis of mean 18  accept H 0. t-distribution is symmetric

The t-distribution table - cumulative probability starting from left hand side Two-tails  =0.10, 0.05

The t-distribution table – Excel – TINV gives the two-tails critical value Two-tails

Tests involving the mean – the t distribution The expression level of a gene is known to have a mean expression level of 225 in the normal human population. The expression values have been obtained in sixteen measurements, in which the sample mean and s.d. are found to be and respectively. Is this data higher than the published mean at a 5% significant level? This is a right-hand one-tail test Null hypothesis H 0 : x ≦  =225 alternative hypothesis H 1 : x>  =225 t-score = ( )/[ /sqrt(16)] = Degree of freedom = 15 The 5% level corresponds to a critical value (t 0.05 (15)) of The t-score is less than the critical value, i.e < Based on the critical value, we can accept the null hypothesis. The gene expression data set is not higher than the published mean of 225 at a 5% significant level

Tests involving the variance – the chi-square distribution The expression level of a gene is known to have a variance  2 = 5000 in the normal human population. The same gene is measured 26 times and found to have a s 2 = Is there evidence that the new measurement different from the population at a 2% significant level? Unknown population mean,  2 test Null hypotheses H 0 : s 2 =  2 = 5000, that is the new measured variance is not different from the population  The alternative hypotheses H 1 : s 2 ≠  2 = 5000 (two-tail test) The new variable of score is This variable with the interesting that if all possible samples of size n are drawn from a normal population with a variance  2 and for each such sample the quantity is computed, these value will always form the same distribution. This distribution will be a sample distribution called a  2 (chi-square) distribution. accept H 0 reject H 0 two-tail test p=0.99 p=0.01

Tests involving the variance – the chi-square distribution If the sample s.d. s is close to the population s.d. , the value of  2 will be close to n-1 (degree of freedom) If the sample s.d. s is very different to the population s.d. , the value of  2 will be very different from n-1 Let us use the  2 distribution to solve the above problem. e/chisqtable/chisqtable.htmhttp://commons.bcit.ca/math/faculty/david_sabo/apples/math2441/section8/onevarianc e/chisqtable/chisqtable.htm The critical values for  (25) = and  (25) = (right-hand tail) Reject areas are  2 ≦ or  2 ≧ Since 46 >  reject null hypothesis The measurement is different from the population at a 2% significant level

The chi-square distribution Excel - CHIINV, uses right hand tail

Tests involving the variance – the chi-square distribution The expression level of a gene is known to follow normal distribution and have a standard deviation (s.d.) of no more than 5 in the normal human population. The same gene is measured 9 times and found to has a s.d. of 7. Is this data set has a sample variance higher than the published variance at a 5% significant level? This is a left-hand one-tail test Null hypothesis H 0 : s 2 ≦ 25 Alternative hypothesis H 1 : s 2 > 25  2 = (9-1)*49/25 = Degree of freedom = 8 The 5% level corresponds to a critical value of The  2 value is larger than the critical value Based on the critical value, we can reject the null hypothesis. The gene does has a s.d. higher than the published value 5 at a 5% significant level.

Tests involving two samples – comparing means The gene expression level of the gene AC is measured for the patients and controls are given in the following: geneIDP1P2P3P4P5P6 AC geneIDC1C2C3C4C5C6 AC H 0 :  P =  C, H 1 :  P ≠  C Mean of gene expression level of patients, X P = Mean of gene expression level of controls, X C = s P 2 = 0.059, s C 2 = To test whether the two samples have the same variance or not, we perform the F-test at a 5% level F = 0.059/0.097 = 0.60, d.o.f. = 10 F 0.025(6,6) = , F 0.975(6,6) = In between and  accept the null hypothesis  the patients and controls have the same variances

Tests involving two samples – comparing means t-statistic of two independent samples with equal variances The t-score is where the p-value, or the probability of having such a value by chance is This value is smaller than the significant level 0.05, and therefore we accept the null hypothesis, the gene AC is expressed differently between cancer patients and healthy subjects.