Download presentation
Presentation is loading. Please wait.
1
Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science
2
2 Outline Input data The proposed model Results on yeast Results on arabidopsis Unsupervised pattern discovery
3
3 Input data
4
4 ~23,000 genes 25 points 1,500bp upstream gctaagc...
5
5 Normalization ~23,000 genes 25 points 1,500bp upstream normalize columns (mean=0) gctaagc...
6
6 Filtering ~23,000 genes 25 points 1,500bp upstream normalize columns (mean=0, stdev=1) ~5,000 genes 25 points gctaagc... motif bitmap 001011… filter out low-variance
7
7 The proposed model
8
8 Assumption 1 A single TF binds on a single cis element (motif) Source: U.S. Department of Energy Genomics (http://doegenomestolife.org)
9
9 Assumption 2 TFs regulate genes sharing a motif only on subset of conditions
10
10 Assumption 2 (cont’d) TFs regulate genes sharing a motif only on subset of conditions
11
11 Assumption 3 The TF expression correlates with the sum of the partially correlating expression patterns
12
12 Objective For each cis element (motif): –discover groups of co-regulated genes –compute aggregate motif expression For each TF: –find best correlating motifs
13
13 The algorithm – step 1 ~5,000 genes step 1: clustering 25 points............
14
14 The algorithm – step 2 ~5,000 genes step 1: clustering 25 points step 2 for any motif compute its gene set......
15
15 The algorithm – step 3 ~5,000 genes step 1 clustering 25 points step 2 for any motif compute its gene set step 3 compute the distribution of its genes into the clusters......
16
16 The algorithm – step 4 ~5,000 genes step 1 clustering 25 points step 2 for any motif compute its gene set step 3 compute the distribution of its genes into the clusters step 4 determine overrepresented clusters using t-test......
17
17 The algorithm – final step ~5,000 genes 25 points final step compute motif aggregate expression 25 points......
18
18 Yeast
19
19 Example TF: BAS1 RANK MOTIF OCCUR corr score 1 gactcg 46 0.6446 66 2 cgagtc 46 0.6446 16 3 gactaa 163 0.6381 66 4 ttagtc 163 0.6381 33 5 tcggct 87 0.6374 33... 12 gctagt 110 0.6268 33 13 agtcac 137 0.6262 83 p-value=0.079... 27 gagtca 136 0.6192 100 p-value=0.004 Using cis/TF version 1:
20
20 Example TF: BAS1 Using cis/TF version 2: RANK MOTIF OCCUR signf corr score 1 ctgact 122 0.62 0.66 33 2 agtcag 122 0.62 0.66 83 3 ggttta 187 0.62 0.63 50 4 taaacc 187 0.62 0.63 33 5 gagtca 136 0.68 0.63 100 p-value=0.002 6 tgactc 136 0.68 0.63 33 7 atttga 378 0.64 0.63 33 8 tcaaat 378 0.64 0.63 50 9 agtggc 126 0.66 0.61 50 10 gccact 126 0.66 0.61 50
21
21
22
22
23
23
24
24
25
25
26
26 Conclusions Advantages of version 2: gives ability to focus on gene cluster that correlates best with a given TF thus, increases overall correlation and motif rank offers a measure of motif significance can be extended to pairs of TFs/motifs
27
27 Arabidopsis
28
28 Procedure Permute gene cluster assignment Compile list of putative motifs Compute significance score of known motifs Repeat 1000 times Compute p-value of the score
30
30 TF discovery? Need data for training! (TFs and their associated binding cites) Parameters to be estimated: number of clusters motif size & degeneracy
31
31 Pattern discovery
32
32 TF-driven pattern discovery Unsupervised pattern discovery Find groups of genes partially correlating with TF Apply statistical filter Look for over-represented motifs in genes’ upstream regions Data for validation?
33
33
34
34 Pattern discovery example
35
35 “Predicting Gene Expression form Sequence” Beer & Tavazoie, Cell 2004 Group genes in 49 clusters Predict gene cluster using motifs discovered in its upstream region
36
36
37
37 Conclusions
38
38 Conlusions Two options: Supervised training: –uses background knowledge to construct model –needs more training data Unsupervised pattern discovery: –minimal model bias (no prior knowledge) –needs more ‘expert’ help to filter results
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.