Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science
2 Outline Input data The proposed model Results on yeast Results on arabidopsis Unsupervised pattern discovery
3 Input data
4 ~23,000 genes 25 points 1,500bp upstream gctaagc...
5 Normalization ~23,000 genes 25 points 1,500bp upstream normalize columns (mean=0) gctaagc...
6 Filtering ~23,000 genes 25 points 1,500bp upstream normalize columns (mean=0, stdev=1) ~5,000 genes 25 points gctaagc... motif bitmap … filter out low-variance
7 The proposed model
8 Assumption 1 A single TF binds on a single cis element (motif) Source: U.S. Department of Energy Genomics (
9 Assumption 2 TFs regulate genes sharing a motif only on subset of conditions
10 Assumption 2 (cont’d) TFs regulate genes sharing a motif only on subset of conditions
11 Assumption 3 The TF expression correlates with the sum of the partially correlating expression patterns
12 Objective For each cis element (motif): –discover groups of co-regulated genes –compute aggregate motif expression For each TF: –find best correlating motifs
13 The algorithm – step 1 ~5,000 genes step 1: clustering 25 points
14 The algorithm – step 2 ~5,000 genes step 1: clustering 25 points step 2 for any motif compute its gene set......
15 The algorithm – step 3 ~5,000 genes step 1 clustering 25 points step 2 for any motif compute its gene set step 3 compute the distribution of its genes into the clusters......
16 The algorithm – step 4 ~5,000 genes step 1 clustering 25 points step 2 for any motif compute its gene set step 3 compute the distribution of its genes into the clusters step 4 determine overrepresented clusters using t-test......
17 The algorithm – final step ~5,000 genes 25 points final step compute motif aggregate expression 25 points......
18 Yeast
19 Example TF: BAS1 RANK MOTIF OCCUR corr score 1 gactcg cgagtc gactaa ttagtc tcggct gctagt agtcac p-value= gagtca p-value=0.004 Using cis/TF version 1:
20 Example TF: BAS1 Using cis/TF version 2: RANK MOTIF OCCUR signf corr score 1 ctgact agtcag ggttta taaacc gagtca p-value= tgactc atttga tcaaat agtggc gccact
21
22
23
24
25
26 Conclusions Advantages of version 2: gives ability to focus on gene cluster that correlates best with a given TF thus, increases overall correlation and motif rank offers a measure of motif significance can be extended to pairs of TFs/motifs
27 Arabidopsis
28 Procedure Permute gene cluster assignment Compile list of putative motifs Compute significance score of known motifs Repeat 1000 times Compute p-value of the score
30 TF discovery? Need data for training! (TFs and their associated binding cites) Parameters to be estimated: number of clusters motif size & degeneracy
31 Pattern discovery
32 TF-driven pattern discovery Unsupervised pattern discovery Find groups of genes partially correlating with TF Apply statistical filter Look for over-represented motifs in genes’ upstream regions Data for validation?
33
34 Pattern discovery example
35 “Predicting Gene Expression form Sequence” Beer & Tavazoie, Cell 2004 Group genes in 49 clusters Predict gene cluster using motifs discovered in its upstream region
36
37 Conclusions
38 Conlusions Two options: Supervised training: –uses background knowledge to construct model –needs more training data Unsupervised pattern discovery: –minimal model bias (no prior knowledge) –needs more ‘expert’ help to filter results