Presentation is loading. Please wait.

Presentation is loading. Please wait.

MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.

Similar presentations


Presentation on theme: "MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression."— Presentation transcript:

1 mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression

2 Using gene expression to identify regulatory elements 5’ upstream

3 Feb 2007: ~110,000 arrays in NCBI GEO Gene expression Allen Institute Mouse Brain Gene expression Atlas (in situ hybridization, ~23,000 genes) Feb 2008: ~202,000 arrays

4 Roth et al., 1998, Tavazoie et al., 1999: co-expressed genes often share the same regulatory elements Expression5’ upstream How do you identify regulatory elements from gene expression ? Motif finding programs: - AlignACE (Hughes et al, 2000) - MEME - REDUCE (Bussemaker et al, 2001) and many others …

5 Problems with current motif finding approaches Different approaches for different types of expression data Single microarray (e.g. log-ratios) REDUCE C0 C1 C2 C3 C4 Co-expression clusters ALIGNACE

6 Problems with current motif finding approaches Many unrealistic assumptions, e.g., zero th order background sequence model in AlignACE 1kb upstream region of Plasmodium falciparum PF11_0108 TTTAAAAAAAAAAAAAAAAAGAGAAAAACCATATTTATATGGATATAATATTTTTAAAGTATAGAAAA AATAATATATATTTATATACATTTATATTAATGAAAAAGCAAACAGCTAAATTACAAAAAAAAAAAAA AATTAGATTATCTCAATTAAAAGAACAATATATAAATAATTAATCCATGCTATTTTTTGATATATATAA GAATTTAATGCCTTATATTATAAATAGAGAAATAAATAAATAAATAAATATATAAACATATATATTAT ATATATATATATATATATAGTTATACATTATGATTTTGAAAAAATAGATATATACTATTAATTGTATAT GTTTATACATAAAGCATATTTTTATTAATTGTAATATATAGATTTTTTATTATAATAATATTATATATA TATATATATATATATATATTTTTTTTTTTTTGTTAAATAGCGAAATAAAAATACCTGACCTTTGTAATCT TTATTTGATTACTTCCTTCTTCATTCCTTCTTTGTTTGTTTGTTTGTTTCCCTTTTTTTTTTTTTTTTTTTT TTTTTTAGTTAATTCTTTTATATGTATAATAATATTATAAGACAATTGGACAATGATTACAAAAAGGTA AAAGTAATAATTTTCTAAAGTATAATATAATATTATAATAATATAATATAAATTTTTAATAAAATTTA ATAAAAAAGTTTATAAATACTTATCGACCATAAGTCGTTTAAGAAAAAAAAAAAAAAGAAAAAAAAAA AAAAATATTACAAAAATATTATAGTGTATTATATTTTATCATATCATCTTTTTTTTATTTTTTTTTATA TTTTTGTTACGGCACATCAAGCAACTATAAATATTTAAGATCAACCACCAAAAAAAAAAAAAAAAAAAA AAAAAAAAAACATTATTTATGGTATTTTAAA

7 Problems with current motif finding approaches Elevated false positive rate k-means clustered gene expression randomly clustered gene expression many motifs AlignACE many motifs

8 Re-thinking motif discovery from expression data One approach for all types of expression data Make as few a priori assumptions as possible Very low false positive rate Scale to complex metazoan and plant genomes Noam Slonim

9 0 1 2 3 4 Cluster Index Microarray Conditions All Genes on array Clusters of co-expressed genes

10 5’ upstream regions Cluster index 1 1 1 1 1 2 2 2 2 0 0 0 0 0 2 correlation is quantified using the mutual information 0.270.070.33 0.070.270.00 Motif Expression (Cluster Indices) Absent Present 0 1 2 Our approach: look for motifs whose profile of presence/absence is informative about the expression profile These genes belong to cluster 0 These genes belong to cluster 1 These genes belong to cluster 2...

11 0.45 0.12 0.01 -0.08 -0.87 -1.56 -2.32 -2.89 -5.65 1.54 1.98 3.50 4.39 6.45 -8.90 5’ upstream regionLog-ratio Continuous expression variables (e.g. microarray log-ratios)

12 3’UTRs 1 1 1 1 1 2 2 2 2 0 0 0 0 0 2 5’ upstream regions Expression cluster index 1 1 1 1 1 2 2 2 2 0 0 0 0 0 2 Expression cluster index DNA RNA

13 finding all informative DNA and RNA motifs How do we do it ?

14 Possible motif representations Search spaceAccuracy very good good acceptable very large large small Words (k-mers) GCGATGAG Weight matrices Degenerate code [AC]CGATGAG[TC]

15 Motif Search Algorithm k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040

16 Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? 5’ upstream regions Cluster Indices 1 1 1 1 1 2 2 2 2 0 0 0 0 0 2 A/G T/G C/GA/C/G A/T/G C/G/T

17 Optimizing k-mers into more informative degenerate motifs ATCC[C/G]TACA 5’ upstream regions Cluster Indices 1 1 1 1 1 2 2 2 2 0 0 0 0 0 2 A/C T/C C/GA/C/G A/T/C C/G/T......

18 change Motif Conservation with S. bayanus Similarity to ChIP-chip RAP1 motif Mutual information RAP1 binding site (ChIP-chip)

19 k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GCTCATC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265... Highly informative k-mers Only optimize k-mer if I(k-mer;expression | motif) is large enough (for all motifs optimized so far) MI=0.081 MI=0.045 Motifs optimized so far optimize ? Conditional mutual information I(X;Y|Z)

20 Each motif is subjected to a stringent statistical significance test Real mutual information value Maximum of 10,000 expression-shuffled mutual information values

21 The regulation of gene expression is highly combinatorial DNA Pol II Expression pattern 1 Expression pattern 2 Expression pattern 3

22 Can we group our predicted motifs into modules of combinatorially acting regulatory elements ? The regulation of gene expression is highly combinatorial

23 Predicting combinatorial regulation using mutual information 5’ upstream region 0.430.07 0.43 Motif 1 Motif 2 Absent Present Absent Present Is the presence of motif 1 informative about the presence of motif 2 ?

24 Discovering modules of combinatorially acting motifs modules

25 Yeast P. falciparum Human Results (malaria parasite)

26 Yeast stress gene expression program (Gasch et al, 2000) 173 microarray conditions ~ 5,500 genes 80 co-expression clusters Runtime ~ 1h (standard PC)

27 Predicted Motifs Expression Clusters 17 motifs in 5’ upstream regions 6 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the clustering partition 1129 motifs when applying AlignACE (with default parameters) to each cluster independently 880 “motifs” when applying AlignACE to the same shuffled clusters as above

28 Predicted Motifs 13 modules of co-occurring motifs All 23 motifs are highly conserved with S. bayanus PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4 14 previously known motifs Expression Clusters

29 PAC is under-represented in cluster 13 (p<1e-5) PAC is highly over-represented in cluster 66 (p<1e-20) over-represention under-represention PAC RRPE Puf4 Expression Clusters Motifs 5’ 3’UTR Predicted cooperation between DNA and RNA motifs

30 Predicted Motifs Expression Clusters PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4

31 Mitochondrial ribosome, p<1e-33 Mitochondrial ribosome, p<1e-29 Puf3 Cytosolic ribosome, p<1e-18

32 Functional enrichments Proteasome complex (p<1e-44) DNA replication (p<1e-7) Oxydative phosphorylation (p<1e-17) Rpn4 Mbp1 Novel motif

33 Beer and Tavazoie, 2004; Elemento and Tavazoie, 2005 We also use mutual information to discover … Non-random spatial distribution Orientation preferences Co-localization

34 0 0 0 0 0 1 1 1 1 5’ upstream region Cluster Indices 2 2 2 2 0 0 0 0 0 1 1 1 1 2 2 2 2 Cluster Indices CloseFarVery far Distance to TSS is informative about expression Distance between two motifs is informative about expression

35 Predicted Motifs Clusters Y Y Y Y Y Y Y Y Y Y Y Y > 50% of our predicted motifs have a non-random spatial distribution

36 Clusters where the motif is over- represented Clusters where the motif is NOT over- represented ATG -600bp PAC has a non-random spatial distribution

37 RAP1 motif has a different kind of non- random spatial distribution Unique cluster where the motif is over- represented Clusters where the motif is NOT over- represented RAP1 motif also has a strong orientation preference

38 Clusters where the TWO motifs are both over- represented Clusters where the motifs are NOT over- represented -600bp ATG PAC and RRPE tend be co- localize on the DNA

39 -600bp ATG PAC and the Msn2/4 binding site tend to avoid being in the same promoters

40 Single array analysis Down-regulatedUp-regulated Cy3/Cy5 expression log-ratios PAC Rpn4 Yap1 Puf3 H 2 O 2 treatment in ΔMsn2/ΔMsn4 background

41 Bozdech, Llinás, et al., PLoS Biol, 2003 P. falciparum intra-erythrocytic developmental cycle ~ 2,700 periodically expressed genes 0h Time 48h Associate a “ phase ” to each gene, which reflects the timing of maximal expression

42 -0.25 -0.12 0.01 0.08 0.34 0.67 2.32 2.89 3.01 -0.38 -1.68 -2.34 -2.56 -3.14 3.14 5’ upstream regionPhase Discovering motifs that are informative about the expression phase

43 21 motifs in 5’ upstream regions 0 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the phase profile -π Phase +π 71% highly conserved with P. yoelli DNA replication, p<1e-4 plastid, p<0.01 ribosome, p<0.001

44 Independent biochemical validation - Purified 3/26 predicted TF in P. falciparum -Identified DNA-binding specificities using protein binding microarrays Bulyk lab, Harvard Llinás lab, Princeton University, submitted

45 Motifs Match Predictions Protein Binding Microarray FIRE Prediction GST AP2 AP2 GST AP2

46 -π Phase +π Independent biochemical validation for 3/21 motifs More TFs being purified... bound by MAL6P1.44 bound by PF11_0404 bound by PF14_0633

47 Human gene expression atlas (Su et al, 2004, PNAS) 79 human tissues >17,000 genes 120 co-expression clusters Runtime ~ 24h (standard PC)

48 73 DNA motifs 42 RNA motifs ELK4 Sp1 AhR bZIP911 NF-Y E2F1 TCF11-MafG Pax2 E2F v-Myb TEAD Dof2 CHOP-C/EBPalpha HAND1-TCF3 GBP Skn-1 HFH-3 Sox17 miR-499/miR-505/miR-200a/miR-141 miR-525/miR-518f*/miR-526c/miR-526a/miR-520a* miR-380-3p/miR-215/miR-485-3p/hsa-let-7g/miR- 610/hsa-let-7i/hsa-let-7b/hsa-let-7a miR-30d/miR-30c/miR-30a-5p/miR-30e-5p/miR-30B miR-200b/miR-429/miR-200c miR-663 71 modules

49 NF-Y novel M phase (p<1e-43)

50 TCF11- MafG novel Olfactory receptor activity (p<1e-43)

51 miR-525 miR-518f* miR-526c Sp1

52 (NF-Y binding site)

53 let-7b over-expression in human fibroblasts let-7 microRNAs are up- regulated when fibroblasts enter quiescence Let-7 target genes ? A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted

54 ~14,000 genes C1 C2 C3 C4 C1C2C4C5 A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted 0h12h24h36h48h C5 let-7b over-expression in human fibroblasts UACCUC |||||| uugguguguuggaugAUGGAGu-5’ let-7b seed

55 -1000bpTSS Arabidopsis thaliana Experimental testing: Ken Birnbaum (NYU) Phil Benfey (Duke) ~22,300 genes on Affy chip

56 Biological insights Importance of RNA motifs in shaping transcriptomes ~30% of yeast, worm, human, arabidopsis motifs are RNA motifs In worm/human/mouse, many RNA motifs match miRNA targets “Cooperation” between DNA and RNA motifs UGUGAU |||||| cgaguaguuucgaccgACACUAu Yeast Puf4 motif Novel worm 3’UTR motif

57 Biological insights Avoidance of joint- presence for certain motifs Under-representation of certain motifs

58 works with any type of gene expression data FIRE (Finding Informative Regulatory Elements) Single microarray (e.g. log-ratios) Elemento, Slonim and Tavazoie, 2007, Molecular Cell C0 C1 C2 C3 C4 Clustered microarrays Gene expression phase

59 It looks for both DNA and RNA motifs FIRE (Finding Informative Regulatory Elements) 5’ 3’UTR 5’ 3’UTR 5’ Elemento, Slonim and Tavazoie, 2007, Molecular Cell

60 It is fast and scales well to large metazoan and plant genomes FIRE (Finding Informative Regulatory Elements) Elemento, Slonim and Tavazoie, 2007, Molecular Cell

61 It yields few or no false positives FIRE (Finding Informative Regulatory Elements) Real clustered gene expression randomly clustered gene expression 115 motifs 0 motifs (Human tissue microarray dataset)

62 It automatically evaluates: FIRE (Finding Informative Regulatory Elements) Functional coherence Defense response (p<1e-32) Inter-species conservation Spatial and orientation biases Compare to known motifs (JASPAR) Cooperativity and co-localization FIRE

63 fire --expfile=human_clusters.txt --exptype=discrete --species=human Expression file Expression typeSpecies FIRE (Finding Informative Regulatory Elements) Usage: http://tavazoielab.princeton.edu/FIRE/ NM_000030 0 NM_000040 0 NM_000042 0 NM_000045 0 NM_000046 1 NM_000053 1 NM_000065 1 NM_000066 1... - discrete - continuous - human - mouse - arabidopsis - drosophila - worm - plasmodium - budding yeast - fission yeast - sea squirt ~250 downloads since Nov 2007

64 Bambi Tsui http://tavazoielab.princeton.edu/FIRE/~1500 queries since Nov 2007

65 Acknowledgements Saeed Tavazoie Noam Slonim Sasan Amini Chang Chan Gordon Freckleton Hany Girgis Yir-Chung Liu Ilias Tagkopoulos Tiffany Vora Scott Breunig Anand Dharan Hani Goodarzi Danny Lieber Yael Marshall Kellen Olszewski Bambi Tsui Eric Wieschaus Manuel Llinás Hilary Coller Aster Lagesse-Miller Xuemin Lu Erandi De Silva Collaborators at Princeton

66

67 Additional slides

68 … Chan*, Elemento*, Tavazoie, PLoS Computational Biology, 2005

69 k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 CATCGCA 0.0293 AGATGAG 0.0288 TTTTTCA 0.0280 ATCTCAT 0.0265... ACGCGCG 0.0168 CGACGCG 0.0167 TACGCTA 0.0167 ACCCCCT 0.0167 CCACGGC 0.0164 TTCAAAA 0.0163 AGACGCG 0.0163 CGAGAGC 0.0163 GATAGAG 0.0155 GTAGCTC 0.0143 CTTATTA 0.0142... Test PASS DON’T PASS DON’T PASSDON’T PASS Most informative Less informative 10 consecutive “don’t pass” Optimize “seeds” into more degenerate motifs

70 Optimizing seeds into more informative degenerate motifs ATCGATCG S=C/G M=A/C W=A/T R=A/G K=T/G Y=T/C V=A/C/G H=A/C/T D=A/G/T B=C/G/T N=A/C/G/T TCC[C/G]TAC matches TCCCTAC and TCCGTAC

71 Predicted Motifs Clusters PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4 5’ 3’UTR Another example of predicted cooperation between DNA and RNA motifs RAP1 Novel

72 A gene expression map of Arabidopsis thaliana development (Schmid et al, 2005) 79 different tissue samples 22,300 genes 140 clusters Schmid et al, 2005, Nature Genetics

73 -Log(p) over-rep Log(p) under-rep 114 motifs in 5’ upstream regions 66 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the clustering partition

74 -Log(p) over-rep Log(p) under-rep telo-box ABRE-like W-box I-box DRE-core Few of these motifs are known

75 -Log(p) over-rep Log(p) under-rep Y Y Y Y Y Y Y Y Y Y Y Y Y Many have a non-random spatial distribution

76 Clusters where the motif is over- represented Clusters where the motif is NOT over- represented -1000bp TSS Motif has a non-random spatial distribution

77 Functional enrichments Defense response (p<1e-32) Ribosome (p<1e-84) Localized to chloroplast (p<1e-35) 3’UTR 5’ 3’UTR

78 -Log(p) over-rep Log(p) under-rep These two motifs are predicted to co- localize extensively

79 Clusters where the motifs are over- represented Clusters where the motifs are NOT over- represented -1000bpTSS These two motifs co- localize on the DNA

80 Examples of tissue-specific motifs... Collaborations with Phil Benfey (Duke), Ken Birnbaum (NYU)

81 Other data-types strong bindingno bindingp-values 20 Bicoid-bound vs 100 non-bound enhancers 20 Dorsal-bound vs 100 non-bound enhancers ChiP-chip, e.g. HNF6 (in human islet cells) Enhancers (Drosophila)

82 No 5’ upstream regionTissue-specific expression Yes All other genes Tissue-specific genes Binary expression variables (e.g. tissue specific expression)


Download ppt "MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression."

Similar presentations


Ads by Google