Download presentation
Presentation is loading. Please wait.
Published byWilliam Hubbard Modified over 9 years ago
1
. A Simple Hyper Geometric Approach for Discovering Putative Transcription Factor Binding Sites Yoseph Barash Gill Bejerano Nir Friedman Hebrew University Jerusalem
2
Transcription Factors Rule Enhance/repress/initiate mRNA expression *Essential Cell Biology; p.268
3
The “Biological Hypothesis” Co-Expression Experiments Genes Co-Regulation Binding Sites within Promoters Genes Input: a set of upstream regions Output: common & unique binding sites (putative)
4
Our Approach: Highlights Diverse Literature: Bailey & Elkan, 94; Buhler & Tompa, 01; Bussemaker et al, 00; Lawrence et al, 93; Pevzner & Sze, 00; Tavazoie et al, 99; van Helden et al, 00; Vilo et al, 00, … Our Approach: u Expressive binding site model u Systematic exploration of search space u Statistical significance evaluation u Integrate biological knowledge u Computationally efficient
5
=1 =2 ball B(l, ) Basic Motifs Subsequence ACT AAT GCT ACC CCT AGT ACA ATA ATC AGA TGT TAT TCT TTT ATT AAA ATT CCT ACA ACT TCT GCT ACC ACG AGT AAT ATT CAT CTT CGT GGT GTT GAT TGT TTT TAT CCA CCC CCG l=3
6
Motif Generalizations u Richer alphabets SCTNNNGTAAR WATNNNGTCAR u General distance function u Random projections (following [Buhler & Tompa, 01]) l –mers projections using k < l positions SCTATGAGTAR SCAATGATCAR SC*A*GA**AR
7
Statistical Significance We found a motif… is it significant? u When using a large space of motifs, we expect some artifacts Two types of null-hypothesis u Generative Probability of generating a set of promoters containing this motif u Discriminative Probability of selecting a set of genes that contain this motif?
8
Selected genes Discriminative P-value u Start with the promoter regions of all genes u Mark genes that contain the motif P-value: u Probability of selecting as many marked genes if we select n genes at random Promoter sequences
9
Selected genes Discriminative P-value u Start with the promoter regions of all genes u Mark genes that contain the motif P-value: u Probability of selecting as many marked genes if we select n genes at random Promoter sequences
10
Statistical Significance Evaluation What if we select n genes V times? False Discovery Rate (FDR) (Benjamini & Huchberg,95): The expected ratio of false motifs identified from the whole set of motifs identified is no more than Bonferroni p-value limit (union bound) for a false positive rate: Motif is significant if p-value(Motif)
11
FDR Example P-value index probability
12
Algorithm Outline Define Space of Motifs alphabet,distance function,motif sets Evaluate All Motifs using hyper-geometric null model Choose Significant Motifs using Bonfferoni or FDR criteria
13
From Discrete Motif to PSSM Position Specific Score Matrix: for a motif of length L define P i (A,C,G,T) for i={1….S} where S >= L 2 Aims: Refine motif Extend seed to flanking regions
14
Learning a PSSM Initialize: u Find all subsequences of length S that contain the motif u Align them & compute probabilities Iterative EM - like procedure: u Score each position in each gene using the PSSM u Use the score to refine the PSSM representation u Iterate u Remove non informative flanking regions
15
Algorithm Outline (revisited) Define Space of Motifs alphabet,distance function,motif sets Evaluate All Motifs using hyper-geometric null model Choose Significant Motifs using Bonfferoni or FDR criteria Refine motifs into PSSM iterative EM-like procedure
16
Yeast Results Major differences: u Background discrimination u Running time (~1 hour Vs. ~1 Week) MEME < 8PSSMSeedCons.TFCluster E-valueRankp-valueRankp-valueRank Spellman et al. 1e-18 1 3e-42 1 4e-26 1 ACGCGTACGCGT MBFCLN2 * 8e-00 1 1e-12 1 1 CCAGCA SWI5 p SIC1 Tavazoie et al. 1e+06 4 6e-09 5 9e-07 5 GATGAGGATGAG Putative 3 8e+07 23 1e-11 2 4e-07 2 GAAAAatT Putative 1e+08 20 4e-06 3 6e-07 3 aAGGgG STRE8 5e+02 2 2e-11 1 1 gCCACAgT MET3130 Iyer et al. 1e+04 3 3e-18 1 1e-12 1 ACGCGTACGCGT MBF 1e-17 2 1e-37 1 1e-32 1 CGCGAAA SBFSBF **
17
Sample Yeast PSSM’s SBF CLN2
18
Human Results* - TGF BSMC NHBE NHLF PBS TGF PBS TGF PBS TGF * Research Collaboration with u Naftali Kaminski, Sheba Medical Center u Jane Lee, Dean Shappard, UCSF u Tommy Kaplan, Hebrew University Measuring response to TGF in three distinct cell-types in lungs Each cell type/condition repeated at least 4 times
19
Human Results (2) BSMC-down 12:6 1000 ups Discrete Motif Search: (Random Projections) Learn PSSM: Motif suggests LyF-1 binding site: TTTGGGAGR
20
Summary Efficiency & Modularity Systematic coverage of event space Extension of initial seeds to PSSM Good background discrimination Enables usage of biological prior knowledge Promising preliminary results on biological data sets Fast evaluation tool for possible co-regulated genes
21
* = 18.3 G119.5* G719.1 G519.1* G118.5* G418.3* G217.2 G916.8 G315.2 G815.1* G1014.8 Updating - Example ScoreScore
22
Statistical Significance? Two Basic Approaches for Background modeling: Random Generation Random Selection We found a cluster of n genes, x of them contain a motif. Genome has M genes, K of them contain the motif. How significant is this ? Same thing as choosing n balls (x of them red), from a set of M balls, with K red ones.
23
Simple Example ACGTGTATTAAGTGACTCCTGATTGAG ACGAGTGACTTGCATATCCTGATTGAG ACGTGTATTAGATGAAAATTATCCCCC ACGTGTATTAAATTTCCGGTGTAGTAG AGTGTATTAAACCCCTCTAGTGTTGAG Here: n = 5, x = 4, M = 100, K = 10
24
Statistical Significance Evaluation What if we select n genes V times? Bonferroni p-value limit (union bound) for a false positive rate: Motif is significant if p-value(Motif) False Discovery Rate (FDR) (Benjamini & Huchberg,95): The expected ratio of false motifs identified from the whole set of motifs identified is no more than
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.