The hypergeometric distribution/ Fisher exact test

Slides:



Advertisements
Similar presentations
Discrete Random Variables and Probability Distributions
Advertisements

MOMENT GENERATING FUNCTION AND STATISTICAL DISTRIBUTIONS
Discrete Uniform Distribution
Discrete Random Variables
Statistics review of basic probability and statistics.
Probability Distributions
Chapter 4 Probability Distributions
Statistics.
Discrete Probability Distributions
Review of important distributions Another randomized algorithm
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Unit 6 – Data Analysis and Probability
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
Slide 1 Copyright © 2004 Pearson Education, Inc..
Introduction Discrete random variables take on only a finite or countable number of values. Three discrete probability distributions serve as models for.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
MATB344 Applied Statistics Chapter 5 Several Useful Discrete Distributions.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Biostatistics Class 3 Discrete Probability Distributions 2/8/2000.
School of Information University of Michigan Discrete and continuous distributions.
Introduction to Probability and Statistics Thirteenth Edition Chapter 5 Several Useful Discrete Distributions.
Probability Distributions u Discrete Probability Distribution –Discrete vs. continuous random variables »discrete - only a countable number of values »continuous.
LECTURE 14 TUESDAY, 13 OCTOBER STA 291 Fall
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
Fitting probability models to frequency data. Review - proportions Data: discrete nominal variable with two states (“success” and “failure”) You can do.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 5-5 Poisson Probability Distributions.
Normal approximation of Binomial probabilities. Recall binomial experiment:  Identical trials  Two outcomes: success and failure  Probability for success.
Exam 2: Rules Section 2.1 Bring a cheat sheet. One page 2 sides. Bring a calculator. Bring your book to use the tables in the back.
Some Common Discrete Random Variables. Binomial Random Variables.
Chapter 4. Random Variables - 3
Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions 5-4 Mean, Variance and Standard Deviation.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 5 Discrete Random Variables.
R. Kass/W04 P416 Lec 3 1 Lecture 3 The Gaussian Probability Distribution Function Plot of Gaussian pdf x p(x)p(x) Introduction l The Gaussian probability.
INTRODUCTION TO ECONOMIC STATISTICS Topic 5 Discrete Random Variables These slides are copyright © 2010 by Tavis Barr. This work is licensed under a Creative.
Inference for a Single Population Proportion (p)
6.3 Binomial and Geometric Random Variables
CHAPTER 6 Random Variables
Review of Probability Theory
Ch3.5 Hypergeometric Distribution
3. Random Variables (Fig.3.1)
Discrete Random Variables and Probability Distributions
Math 4030 – 4a More Discrete Distributions
Better visualization of the Metropolitan algorithm
Discrete Random Variables
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Random Variables.
Chapter 4. Inference about Process Quality
Physics 114: Lecture 7 Probability Density Functions
What is Probability? Quantification of uncertainty.
Chapter 2 Discrete Random Variables
The Gaussian Probability Distribution Function
Discrete random variable X Examples: shoe size, dosage (mg), # cells,…
Binomial Distribution
Normal as Approximation to Binomial
Discrete Probability Distributions
Section 10.2: Tests of Significance
Some Discrete Probability Distributions
Probability Key Questions
Welcome to the wonderful world of Probability
Probability Theory and Specific Distributions (Moore Ch5 and Guan Ch6)
When You See (This), You Think (That)
Lecture 11: Binomial and Poisson Distributions
Introduction to Probability and Statistics
3. Random Variables Let (, F, P) be a probability model for an experiment, and X a function that maps every to a unique point.
6.1 Probability Distribution
Chapter 18 The Binomial Test
12/12/ A Binomial Random Variables.
Chapter 5: Sampling Distributions
Presentation transcript:

The hypergeometric distribution/ Fisher exact test Using the hypergeometric distribution to ask if there is a lane effect for RNA-seq The Poisson distribution The Poisson distribution and rnaSeq

The binomial distribution samples with replacement. Flipping a coin does not change the probability of the next flip. There are so many pairs of residues in the protein that we (correctly or incorrectly) treat them as independent.. The background death rate of the disease is not affected by our study.

The hypergeometric distribution samples without replacement. I have a deck of 60 cards and 20 of them are marked. I draw 7. What is the probability that I will draw X marked cards. Not exactly dbinom(p=20/60) because if I draw a marked card, the number of remaining marked cards changes. From the Wiki:

This is not hard to understand or implement. I have 60 cards. 20 are marked. I draw 7. What are the odds I have 3 marked? N=60 K = 20 n = 7 k = 3 How many ways can I draw 3 marked cards * how many ways can I draw 4 unmarked cards How many ways can I draw any 7 cards

And of course, we have dhyper, phyper, qhyper and rhyper…

We see that the hypergeometric and binomial test in this case have close to (but not exactly the same) PDFs..

The differences between the hypergeometric and the binomial matter more when the sample size is smaller (of course) Here we have 5 marked cards in a deck of 15 for which we draw 7

If you use the hypergeometric distribution for inference, this is called the Fisher test… We have a clinical trial… On Drug Not on Drug Lived 13 3 Died 2 16 For a one-sided test what are the odds that by chance you could have split the people who lived with at least 13 on the drug living? The people who live are “marked”. We drew 13 “marked” people in 15 draws. There are a total of 16 “marked” people out of 34 people… dhyper(13,16,18,15) + # drawn # not marked Number marked and drawn Total # marked

On Drug Not on Drug Lived 13 3 Died 2 16 Alternatively, you can use Fisher.test but you have to input the matrix…

This documentation is tough going…

Surprisingly, the Fisher exact test can be conservative. Because of it’s discrete nature, the only “available” p-values may not line up to 0.05. You want to test at 0.05, but the test can’t report that. In this case, it can only report 0.045 so if your “real” p-value is >0.045 but <0.05, the test will report 0.16.

From the Wiki: You are unlikely to get into trouble with reviewers for using the Fisher exact test, however..

The hypergeometric distribution/ Fisher exact test Using the hypergeometric distribution to ask if there is a lane effect for RNA-seq The Poisson distribution The Poisson distribution and rnaSeq

An example of the hypermetric distribution from the genomics literature:

An example of the hypergeometric distribution from the genomics literature: Is there a lane effect In RNA-seq experiments?

dhyper(x1,x1+x2,C1+C2-(x1+x2),C1) To put this into R: # drawn in lane 1 x1= number of marked reads in lane 1 C1 – number of reads in lane 1 x2= number of marked reads in lane 2 C2 – number reads in lane 2 # not marked Number marked and drawn in lane #1 Total # marked

We can put it in matrix form and then use Fisher.test x1= number of marked reads in lane 1 C1 – number of reads in lane 1 x2= number of marked reads in lane 2 C2 – number reads in lane 2 Lane 1 Lane 2 From the gene x1 x2 Not from the gene C1-x1 C2-x2 m <- matrix( c(x1,x2,numReadsLane1-x1,numReadsLane2-x2 ), nrow=2) pValue <- fisher.test(m)$p.value;

We look in the methods section of the paper for more details…… Add some small # to make up for the discontinuous nature Not clear what justifies this #

“ceiling” just rounds up to an integer We can run this as a simulation (in this code ignoring the correction for discontinuity) “ceiling” just rounds up to an integer https://github.com/afodor/metagenomicsTools/blob/master/src/classExamples/simDist/hyper.txt

This is (of course) uniformly distributed…

It is interesting to compare our simulation to the real lane data Clearly the real data does have some artifacts that effect the distribution of a few genes…

We can simulate a differential expression experiment by having the true frequency of expression be different… Null hypotheses Differential expression (Maybe the hypergeomteric model here does not describe “real” data)

The hypergeometric distribution/ Fisher exact test Using the hypergeometric distribution to ask if there is a lane effect for RNA-seq The Poisson distribution The Poisson distribution and rnaSeq

Consider a rare event: I have a (very large) collection of cards. 1% of them are marked. I draw 1,000 of the cards. How many times can I expect to see the cards? We can show this with dbinom The expected value = n * p = 1,000 * 0.01 =10

The Poisson distribution is an alternative way of modeling rare events Here lambda is the expected value ( n * p ) that would occur in n trials. lambda can also be thought of as the frequency of an event occurring over some set interval of time… K is the number of successes… http://en.wikipedia.org/wiki/Poisson_distribution

http://en.wikipedia.org/wiki/Poisson_distribution

For the binomial: mean = np variance = n * p * (1-p) For the Poisson, p is small. (1-p) approaches 1 so… variance = n * p = mean The variance and the mean for the Poisson distribution are equal!

for a large sample size… We see that the Poisson distribution nicely approximates the binomial distribution for a large sample size… Derivation of the Poisson from the binomial for the limiting case of an infinite # of samples: https://probabilityandstats.wordpress.com/2011/08/18/poisson-as-a-limiting-case-of-binomial-distribution/

Just as we can use the binomial test for inference, we can use the Poisson test for inference… Consider an RNA seq experiment (modeled the same way as marked cards): I have a (small) RNA-seq dataset with 100,000 reads I have a gene that is expressed 0.1% of the time. Expected number of reads = p * N = 100,000 * 0.001 = 100 What are the odds that I would see X sequences from this gene? This is the same problem as for the cards…

What are the odds that I would see X sequences from this gene? This is the same problem as for the cards… We can do inference in exactly the same way as the binomial test… What are the odds that I would see 130 reads if the “true” expression of the gene was 0.001?

What are the odds that I would see 130 reads if the “true” expression of the gene was 0.001? The Poisson and binomial tests will give (nearly) identical results in the limiting case of an infinitely large sample size and small p.

The hypergeometric distribution/ Fisher exact test Using the hypergeometric distribution to ask if there is a lane effect for RNA-seq The Poisson distribution The Poisson distribution and rnaSeq

We can use the Poisson distribution to simulate an rna-seq experiment. We call a success ( a read that belongs to the gene) “1” and a failure “0”. Then mean = n * p = # of expected successes.. https://github.com/afodor/metagenomicsTools/blob/master/src/classExamples/simDist/Poisson.txt

The mean does equal the variance Our analytical calculation of the mean is correct The p-values Generated by the Poisson test are uniform for a true null

We see this exact Poisson test in use (for example) here… This is just like the Fisher test with no replacement. (Won’t matter at the large sample size of the # of reads in a typical rna-seq experiment) Set a p the background frequency observed in one lane. What are the odds that you will see as many reads in the other lane if the real value was p?

However, when we compare our simulated data to real data… The mean-variance relationship predicted by the Poisson does not hold! Lack of independence

Next time: The negative binomial distribution. A “real” example of a MCMC walk. Please look at this paper… http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003457;jsessionid=1542C917D52714E6043BD1567B416164