CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir Friedman
2 Detecting Target Genes promoter binding site? gene binding site? Probabilistic framework Log odds Score: ACGTACGT 1 2 k p[i,c] – prob. of letter c at position i
3 Detecting target genes (2) ? ?
4 p-value of Scores Score Prob S
5 p-value score: Universal Interpretable Control false positive error rate Detecting target genes (3) Bonferroni corrected p-value 0.01 score p-value
6 p-value Estimation Score Problem 1: naïve enumeration infeasible #seq = 4 k Prob S* Estimate the p-value by sampling from P 0 : samples scores: s 1 …s n
7 p-value Estimation Need ~10 7 attempts to get a sample with pvalue < Prob Problem 2: Multiple hypothesis Testing low p-values (10 -7 ) S* Score S*
8 Importance Sampling Approach Score 1.Cheat: Sample from Q(s 1 …s k ), to get high scoring samples 2. Get absolution: Weigh each sample S* Prob Empirical p-value ~ N ~ 10 4
9 Why is this allowed? x = subsequence Importance Sampling Desired estimate: expectation of log-odds Sample from P 0 (x) and count Multiply and divide by Q(x) Sample from Q(x) and reweight How to choose Q?
10 Choosing Sampling Distribution Score Q 10 = MotifQ 1 = Background Q5Q5 Under-sampled region Density
11 Choosing Sampling Distribution wRescale wCombine Comprehensive Coverage Sampling distribution Score Density Mixing ratio
12 PSSM Example 6e-5 Naive 0 2e-5 4e MAST (Bailey et al. 98) Normal p-value Score CIS ( ) (40 000) What if we want something else?
13 wDependency Models - Many possible variants: Trees, Mixture of PSSMs, Mixture of Trees etc. Tree Example: wSuggested by several recent papers: Barash et al.(2003), King & Roth (2003), Zhou & Liu (2004),… Beyond PSSM Models wMain Point: Capture dependencies between biding site positions Improve sites predictions Challenge: compute p-values for general models X1X1 X2X2 X3X3 X4X4 X5X5
14 Tree Model Example 0 2e-5 4e-5 6e-5 8e-5 1e p-value Scor e X Not efficient X Not applicable X Not accurate wNaïve Sampling wMAST (Baily et al,98) wNormal Approx. Naive Normal CIS ( ) (40 000)
15 Decreased Estimator Variability 0 2e-5 4e-5 6e-5 8e-5 1e p-value Scor e 10 repeats of sampling Naive Normal CIS ( 10x ) ( 10x )
16 CIS - Summary General form – Wide range of probabilistic models Computationally efficient Handles low p-values accurately Available online, at:
17 Thank you Joint Work with: Nir Friedman Gal Elidan Tommy Kaplan