Presentation is loading. Please wait.

Presentation is loading. Please wait.

. A Simple Hyper Geometric Approach for Discovering Putative Transcription Factor Binding Sites Yoseph Barash Gill Bejerano Nir Friedman Hebrew University.

Similar presentations


Presentation on theme: ". A Simple Hyper Geometric Approach for Discovering Putative Transcription Factor Binding Sites Yoseph Barash Gill Bejerano Nir Friedman Hebrew University."— Presentation transcript:

1 . A Simple Hyper Geometric Approach for Discovering Putative Transcription Factor Binding Sites Yoseph Barash Gill Bejerano Nir Friedman Hebrew University Jerusalem

2 Transcription Factors Rule Enhance/repress/initiate mRNA expression *Essential Cell Biology; p.268

3 The “Biological Hypothesis” Co-Expression Experiments Genes Co-Regulation Binding Sites within Promoters Genes Input: a set of upstream regions Output: common & unique binding sites (putative)

4 Our Approach: Highlights Diverse Literature: Bailey & Elkan, 94; Buhler & Tompa, 01; Bussemaker et al, 00; Lawrence et al, 93; Pevzner & Sze, 00; Tavazoie et al, 99; van Helden et al, 00; Vilo et al, 00, … Our Approach: u Expressive binding site model u Systematic exploration of search space u Statistical significance evaluation u Integrate biological knowledge u Computationally efficient

5 =1 =2 ball B(l, ) Basic Motifs Subsequence ACT AAT GCT ACC CCT AGT ACA ATA ATC AGA TGT TAT TCT TTT ATT AAA ATT CCT ACA ACT TCT GCT ACC ACG AGT AAT ATT CAT CTT CGT GGT GTT GAT TGT TTT TAT CCA CCC CCG l=3

6 Motif Generalizations u Richer alphabets SCTNNNGTAAR WATNNNGTCAR u General distance function u Random projections (following [Buhler & Tompa, 01]) l –mers projections using k < l positions SCTATGAGTAR SCAATGATCAR SC*A*GA**AR

7 Statistical Significance We found a motif… is it significant? u When using a large space of motifs, we expect some artifacts Two types of null-hypothesis u Generative Probability of generating a set of promoters containing this motif u Discriminative Probability of selecting a set of genes that contain this motif?

8 Selected genes Discriminative P-value u Start with the promoter regions of all genes u Mark genes that contain the motif P-value: u Probability of selecting as many marked genes if we select n genes at random Promoter sequences

9 Selected genes Discriminative P-value u Start with the promoter regions of all genes u Mark genes that contain the motif P-value: u Probability of selecting as many marked genes if we select n genes at random Promoter sequences

10 Statistical Significance Evaluation What if we select n genes V times? False Discovery Rate (FDR) (Benjamini & Huchberg,95): The expected ratio of false motifs identified from the whole set of motifs identified is no more than Bonferroni p-value limit (union bound) for a false positive rate: Motif is significant if p-value(Motif)

11 FDR Example P-value index probability

12 Algorithm Outline Define Space of Motifs alphabet,distance function,motif sets Evaluate All Motifs using hyper-geometric null model Choose Significant Motifs using Bonfferoni or FDR criteria

13 From Discrete Motif to PSSM Position Specific Score Matrix: for a motif of length L define P i (A,C,G,T) for i={1….S} where S >= L 2 Aims: Refine motif Extend seed to flanking regions

14 Learning a PSSM Initialize: u Find all subsequences of length S that contain the motif u Align them & compute probabilities Iterative EM - like procedure: u Score each position in each gene using the PSSM u Use the score to refine the PSSM representation u Iterate u Remove non informative flanking regions

15 Algorithm Outline (revisited) Define Space of Motifs alphabet,distance function,motif sets Evaluate All Motifs using hyper-geometric null model Choose Significant Motifs using Bonfferoni or FDR criteria Refine motifs into PSSM iterative EM-like procedure

16 Yeast Results Major differences: u Background discrimination u Running time (~1 hour Vs. ~1 Week) MEME < 8PSSMSeedCons.TFCluster E-valueRankp-valueRankp-valueRank Spellman et al. 1e-18 1 3e-42 1 4e-26 1 ACGCGTACGCGT MBFCLN2 * 8e-00 1 1e-12 1 1 CCAGCA SWI5 p SIC1 Tavazoie et al. 1e+06 4 6e-09 5 9e-07 5 GATGAGGATGAG Putative 3 8e+07 23 1e-11 2 4e-07 2 GAAAAatT Putative 1e+08 20 4e-06 3 6e-07 3 aAGGgG STRE8 5e+02 2 2e-11 1 1 gCCACAgT MET3130 Iyer et al. 1e+04 3 3e-18 1 1e-12 1 ACGCGTACGCGT MBF 1e-17 2 1e-37 1 1e-32 1 CGCGAAA SBFSBF **

17 Sample Yeast PSSM’s SBF CLN2

18 Human Results* - TGF  BSMC NHBE NHLF PBS TGF  PBS TGF  PBS TGF  * Research Collaboration with u Naftali Kaminski, Sheba Medical Center u Jane Lee, Dean Shappard, UCSF u Tommy Kaplan, Hebrew University Measuring response to TGF  in three distinct cell-types in lungs Each cell type/condition repeated at least 4 times

19 Human Results (2) BSMC-down 12:6 1000 ups Discrete Motif Search: (Random Projections) Learn PSSM: Motif suggests LyF-1 binding site: TTTGGGAGR

20 Summary Efficiency & Modularity Systematic coverage of event space Extension of initial seeds to PSSM Good background discrimination Enables usage of biological prior knowledge Promising preliminary results on biological data sets  Fast evaluation tool for possible co-regulated genes

21  * = 18.3 G119.5* G719.1 G519.1* G118.5* G418.3* G217.2 G916.8 G315.2 G815.1* G1014.8 Updating  - Example ScoreScore

22 Statistical Significance? Two Basic Approaches for Background modeling: Random Generation Random Selection We found a cluster of n genes, x of them contain a motif. Genome has M genes, K of them contain the motif. How significant is this ? Same thing as choosing n balls (x of them red), from a set of M balls, with K red ones.

23 Simple Example ACGTGTATTAAGTGACTCCTGATTGAG ACGAGTGACTTGCATATCCTGATTGAG ACGTGTATTAGATGAAAATTATCCCCC ACGTGTATTAAATTTCCGGTGTAGTAG AGTGTATTAAACCCCTCTAGTGTTGAG Here: n = 5, x = 4, M = 100, K = 10

24 Statistical Significance Evaluation What if we select n genes V times? Bonferroni p-value limit (union bound) for a false positive rate: Motif is significant if p-value(Motif) False Discovery Rate (FDR) (Benjamini & Huchberg,95): The expected ratio of false motifs identified from the whole set of motifs identified is no more than


Download ppt ". A Simple Hyper Geometric Approach for Discovering Putative Transcription Factor Binding Sites Yoseph Barash Gill Bejerano Nir Friedman Hebrew University."

Similar presentations


Ads by Google