Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science.

Similar presentations


Presentation on theme: "Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science."— Presentation transcript:

1 Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science

2 2 Outline Input data The proposed model Results on yeast Results on arabidopsis Unsupervised pattern discovery

3 3 Input data

4 4 ~23,000 genes 25 points 1,500bp upstream gctaagc...

5 5 Normalization ~23,000 genes 25 points 1,500bp upstream normalize columns (mean=0) gctaagc...

6 6 Filtering ~23,000 genes 25 points 1,500bp upstream normalize columns (mean=0, stdev=1) ~5,000 genes 25 points gctaagc... motif bitmap 001011… filter out low-variance

7 7 The proposed model

8 8 Assumption 1 A single TF binds on a single cis element (motif) Source: U.S. Department of Energy Genomics (http://doegenomestolife.org)

9 9 Assumption 2 TFs regulate genes sharing a motif only on subset of conditions

10 10 Assumption 2 (cont’d) TFs regulate genes sharing a motif only on subset of conditions

11 11 Assumption 3 The TF expression correlates with the sum of the partially correlating expression patterns

12 12 Objective For each cis element (motif): –discover groups of co-regulated genes –compute aggregate motif expression For each TF: –find best correlating motifs

13 13 The algorithm – step 1 ~5,000 genes step 1: clustering 25 points............

14 14 The algorithm – step 2 ~5,000 genes step 1: clustering 25 points step 2 for any motif compute its gene set......

15 15 The algorithm – step 3 ~5,000 genes step 1 clustering 25 points step 2 for any motif compute its gene set step 3 compute the distribution of its genes into the clusters......

16 16 The algorithm – step 4 ~5,000 genes step 1 clustering 25 points step 2 for any motif compute its gene set step 3 compute the distribution of its genes into the clusters step 4 determine overrepresented clusters using t-test......

17 17 The algorithm – final step ~5,000 genes 25 points final step compute motif aggregate expression 25 points......

18 18 Yeast

19 19 Example TF: BAS1 RANK MOTIF OCCUR corr score 1 gactcg 46 0.6446 66 2 cgagtc 46 0.6446 16 3 gactaa 163 0.6381 66 4 ttagtc 163 0.6381 33 5 tcggct 87 0.6374 33... 12 gctagt 110 0.6268 33 13 agtcac 137 0.6262 83 p-value=0.079... 27 gagtca 136 0.6192 100 p-value=0.004 Using cis/TF version 1:

20 20 Example TF: BAS1 Using cis/TF version 2: RANK MOTIF OCCUR signf corr score 1 ctgact 122 0.62 0.66 33 2 agtcag 122 0.62 0.66 83 3 ggttta 187 0.62 0.63 50 4 taaacc 187 0.62 0.63 33 5 gagtca 136 0.68 0.63 100 p-value=0.002 6 tgactc 136 0.68 0.63 33 7 atttga 378 0.64 0.63 33 8 tcaaat 378 0.64 0.63 50 9 agtggc 126 0.66 0.61 50 10 gccact 126 0.66 0.61 50

21 21

22 22

23 23

24 24

25 25

26 26 Conclusions Advantages of version 2:  gives ability to focus on gene cluster that correlates best with a given TF  thus, increases overall correlation and motif rank  offers a measure of motif significance  can be extended to pairs of TFs/motifs

27 27 Arabidopsis

28 28 Procedure Permute gene cluster assignment Compile list of putative motifs Compute significance score of known motifs Repeat 1000 times Compute p-value of the score

29

30 30 TF discovery? Need data for training! (TFs and their associated binding cites) Parameters to be estimated:  number of clusters  motif size & degeneracy

31 31 Pattern discovery

32 32 TF-driven pattern discovery Unsupervised pattern discovery Find groups of genes partially correlating with TF Apply statistical filter Look for over-represented motifs in genes’ upstream regions Data for validation?

33 33

34 34 Pattern discovery example

35 35 “Predicting Gene Expression form Sequence” Beer & Tavazoie, Cell 2004 Group genes in 49 clusters Predict gene cluster using motifs discovered in its upstream region

36 36

37 37 Conclusions

38 38 Conlusions Two options: Supervised training: –uses background knowledge to construct model –needs more training data Unsupervised pattern discovery: –minimal model bias (no prior knowledge) –needs more ‘expert’ help to filter results


Download ppt "Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science."

Similar presentations


Ads by Google