Transcription factor binding motifs

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Hidden Markov Model in Biological Sequence Analysis – Part 2
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Heuristic alignment algorithms and cost matrices
Differentially expressed genes
Transcription factor binding motifs (part I) 10/17/07.
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Confidence intervals and hypothesis testing Petter Mostad
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
A Very Basic Gibbs Sampler for Motif Detection
Challenges in Creating an Automated Protein Structure Metaserver
Differential Gene Expression
Statistical Testing with Genes
CJT 765: Structural Equation Modeling
Learning Sequence Motif Models Using Expectation Maximization (EM)
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Transcription factor binding motifs
Sequence comparison: Significance of similarity scores
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Sequential Pattern Discovery under a Markov Assumption
Discrete Event Simulation - 4
(Regulatory-) Motif Finding
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Statistical Testing with Genes
Presentation transcript:

Transcription factor binding motifs Prof. William Stafford Noble GENOME 541

Outline Representing motifs Motif discovery Gibbs sampling MEME Scanning for motif occurrences Multiple testing correction redux

Motif (n): a succession of notes that has some special importance in or is characteristic of a composition

Motif (n): a recurring genomic sequence pattern TCCACGGC

Sequence-specific transcription factors drive gene regulation

Motif discovery problem Given sequences Find motif seq. 1 seq. 2 seq. 3 IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682 seq. 1 seq. 2 seq. 3

Motif discovery problem (harder version) Given: a sequence or family of sequences. Find: the number of motifs the width of each motif the locations of motif occurrences

Why is this hard? Input sequences are long (thousands or millions of residues). Motif may be subtle Instances are short. Instances are only slightly similar.

The most common model of sequence motifs is the position-specific scoring matrix 1 2 3 4 5 A C G T 0.1 0.95 … 0.0 0.05 0.8 0.0 0.1 0.0

Log-odds score Estimate the probability of observing each amino acid. The amino acid was generated by the foreground model (i.e., the PSSM). The amino acid “A” is observed. Estimate the probability of observing each amino acid. Divide by the background probability of observing the same amino acid. Take the log so that the scores are additive. The amino acid was generated by the background model (i.e., randomly selected).

Motif logos scale letters by relative entropy Splice site motif pi, = probability of  in matrix position i b = background frequency of  CTCF binding motif

Gibbs sampling Lawrence et al. “Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.” Science 1993

Alternating approach Guess an initial weight matrix Use weight matrix to predict instances in the input sequences Use instances to predict a PSSM Repeat 2 & 3 until satisfied.

Initialization Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT sequence 2 sequence 3 sequence 4 sequence 5

Gibbs sampler Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. Steps 2 & 3 (search): Throw away an instance si: remaining (t - 1) instances define PSSM. PSSM defines instance probability at each position of input string Si Pick new si according to probability distribution Return highest-scoring motif seen

Sampler step illustration: ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T .45 .05 .25 .65 .85 ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4 11% ACGCCGT:20% ACGGCGT:52%

MEME Bailey and Elkan. “Fitting a mixture model by expectation-maximization to discover motifs in biopolymers.” ISMB 1994.

MEME solves the same motif discovery problem Input: Collection of sequences. Assumption: One TFBS per sequence Output: High-likelihood PSSM

The MEME Algorithm MEME uses expectation maximization (EM) to discover sequence motifs. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

The MEME Algorithm Step 1: Randomly guess the positions (and strands) of the sites. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

The MEME Algorithm Step 2: Build a PSSM from the sites. Alignment PSSM 1 AAAAGAGTCA 2 AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA N AAATGAGTCA 12 … w i j PSSM Count Matrix A C G T Step 2: Build a PSSM from the sites. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

The MEME Algorithm Step 3: Scan each sequence with the motif. A C G T 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

If the two PSSMs are the same, stop. Otherwise, return to step 2. The MEME Algorithm If the two PSSMs are the same, stop. Otherwise, return to step 2. Step 4: Construct a new PSSM from the selected sites. A C G T 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

How does MEME avoid finding a local minimum?

MEME runs EM from each starting point best_score = 0 best_pssm = [] for index in range(0, len(sequence) - width): old_pssm = make_pssm(sequence[index:index+width]) new_pssm = [] while (not equal(old_pssm, new_pssm)): counts = scan(sequence, old_pssm) new_pssm = make_pssm(counts) if (score_pssm(new_pssm) > best_score): best_score = score_pssm(new_pssm) best_pssm = new_pssm

Running EM many times is expensive.

MEME uses a heuristic to select good candidate starting points 1 2 3 4 5 A C G T 0.17 0.17 0.17 0.5 0.5 0.17 0.5 0.17 0.17 0.17 0.5 0.17 0.17 0.17 0.17 0.17 0.17 0.5 0.17 0.17 One round of EM Choose highest-likelihood initializations run EM to convergence

The full MEME algorithm is more complex Consider various widths do for (width = min; width *= 2; width < max) for each possible starting point run 1 iteration of EM select candidate starting points for each candidate run EM to convergence select best motif erase motif occurrences until (motif score < threshold) Heuristic to speed things up Find multiple motifs in one data set

Comparison of EM and Gibbs sampling Both iterate over two steps: Guess an initial weight matrix Use weight matrix to predict instances in the input sequences Use instances to predict a weight matrix Repeat 2 & 3 until satisfied. Convergence: EM converges when the PSSM stops changing. Gibbs sampling runs until you ask it to stop. Solution: EM may not find the motif with the highest score. Gibbs sampling will provably find the motif with the highest score, if you let it run long enough.

Scanning for motifs Grant et al. “FIMO: Scanning for occurrences of a given motif.” Bioinformatics 2011.

CTCF One of the most important transcription factors in human cells. Responsible both for turning genes on and for maintaining 3D structure of the DNA.

Motivating question: How accurately does a PSSM predict the binding of a given transcription factor?

Scanning for motif occurrences Given: a long DNA sequence, and TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG a DNA motif represented as a PSSM Find: occurrences of the motif in the sequence A 1.32 1.32 -0.15 -3.32 -3.32 -0.15 C -3.32 -3.32 -1.00 -3.32 -3.32 -3.32 G -3.32 -1.00 -1.00 -3.32 1.89 -3.32 T 0.38 -0.15 1.07 1.89 -3.32 1.54

Scanning for motif occurrences 0.38 + 1.32 – 0.15 + 1.89 + 1.89 + 1.54 = 6.87 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

Scanning for motif occurrences 1.32 + 1.32 + 1.07 – 3.32 – 3.32 + 1.54 = -1.39 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

Searching human chromosome 21 with the CTCF motif

Significance of scores Motif scanning algorithm 26.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? TTGACCAGCAGGGGGCGCCG

Two way to assess significance Empirical Randomly generate data according to the null hypothesis. Use the resulting score distribution to estimate p-values. Exact Mathematically calculate all possible scores

CTCF empirical null distribution

Poor precision in the tail

Converting scores to p-values Linearly rescale the matrix values to the range [0,100] and integerize.

Converting scores to p-values Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative.

Converting scores to p-values 100 / 7 = 14.2857 Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100.

Converting scores to p-values Round to the nearest integer.

Converting scores to p-values 0 1 2 3 4 … 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 Say that your motif has N columns. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that have a score of j.

Converting scores to p-values 0 1 2 3 4 … 10 60 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43 These 16 values correspond to all 16 strings of length 2.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value.

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-1 sequences

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-2 sequences starting with score=2.

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-2 sequences starting with score=2 or 5.

Dynamic programming for motif p-values CG or GA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-2 sequences

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-3 sequences starting with score=2.

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T

Multiple testing correction Noble. “How does multiple testing correction work?” Nature Biotechnology 2010.

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.9520 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 There is a 64.2% chance of making at least one mistake.

Bonferroni correction How does it work?

Bonferroni correction Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = 0.0025. Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.997520 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

Sample problem You have scanned both strands of the human genome with a single PSSM, yielding 6 × 109 scores. You use dynamic programming to assign a p-value of 2.1 × 10-11 to the top-scoring match. Is this alignment significance at a 95% confidence threshold? No, because 0.05 / 6 × 109 = 8.3 × 10-12.

Proof: Bonferroni adjustment controls the family-wise error rate Note: Bonferroni adjustment does not require that the tests be independent. Boole’s inequality Definition of p-value m = number of hypotheses m0 = number of null hypotheses ⍺ = desired control pi = ith p-value Definition of m and m0

Types of errors False positive: the algorithm indicates that this position is a binding site, but it actually is not. False negative: the site a binding site, but the algorithm indicates that it is not. Both types of errors are defined relative to some confidence threshold. Typically, researchers are more concerned about false positives.

False discovery proportion 5 FP 13 TP The false discovery proportion (FDP) is the percentage of target sequences above the threshold that are false positives. The FDR is the expected value of the FDP. In the context of motif scanning, the false discovery proportion is the percentage of sites above the threshold that are not binding sites. 33 TN 5 FN Binding site Non-binding site FDP = FP / (FP + TP) = 5/18 = 27.8%

Family-wise error rate vs. false discovery rate Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive among the sequences that score better than the threshold. With FDR control, you aim to control the percentage of false positives among the sequences that score better than the threshold.

Controlling the FDR Order the unadjusted p-values p1  p2  …  pm. To control FDR at level α, Reject the null hypothesis for j = 1, …, j*. (Benjamini & Hochberg, 1995)

FDR example Rank (jα)/m p-value 1 0.00005 0.0000008 2 0.00010 0.0000012 3 0.00015 0.0000013 4 0.00020 0.0000056 5 0.00025 0.0000078 6 0.00030 0.0000235 7 0.00035 0.0000945 8 0.00040 0.0002450 9 0.00045 0.0004700 10 0.00050 0.0008900 … 1000 0.05000 1.0000000 Choose the largest threshold j so that (jα)/m is less than the corresponding p-value. Approximately 5% of the examples above the line are expected to be false positives.

Summary – Multiple testing correction Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction divides the desired p-value threshold by the number of statistical tests performed. The false discovery proportion is the percentage of false positives among the target sequences that score better than the threshold. Use Bonferroni correction when you want to avoid making a single mistake; control the false discovery rate when you can tolerate a certain percentage of mistakes.