mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression
Using gene expression to identify regulatory elements 5’ upstream
Feb 2007: ~110,000 arrays in NCBI GEO Gene expression Allen Institute Mouse Brain Gene expression Atlas (in situ hybridization, ~23,000 genes) Feb 2008: ~202,000 arrays
Roth et al., 1998, Tavazoie et al., 1999: co-expressed genes often share the same regulatory elements Expression5’ upstream How do you identify regulatory elements from gene expression ? Motif finding programs: - AlignACE (Hughes et al, 2000) - MEME - REDUCE (Bussemaker et al, 2001) and many others …
Problems with current motif finding approaches Different approaches for different types of expression data Single microarray (e.g. log-ratios) REDUCE C0 C1 C2 C3 C4 Co-expression clusters ALIGNACE
Problems with current motif finding approaches Many unrealistic assumptions, e.g., zero th order background sequence model in AlignACE 1kb upstream region of Plasmodium falciparum PF11_0108 TTTAAAAAAAAAAAAAAAAAGAGAAAAACCATATTTATATGGATATAATATTTTTAAAGTATAGAAAA AATAATATATATTTATATACATTTATATTAATGAAAAAGCAAACAGCTAAATTACAAAAAAAAAAAAA AATTAGATTATCTCAATTAAAAGAACAATATATAAATAATTAATCCATGCTATTTTTTGATATATATAA GAATTTAATGCCTTATATTATAAATAGAGAAATAAATAAATAAATAAATATATAAACATATATATTAT ATATATATATATATATATAGTTATACATTATGATTTTGAAAAAATAGATATATACTATTAATTGTATAT GTTTATACATAAAGCATATTTTTATTAATTGTAATATATAGATTTTTTATTATAATAATATTATATATA TATATATATATATATATATTTTTTTTTTTTTGTTAAATAGCGAAATAAAAATACCTGACCTTTGTAATCT TTATTTGATTACTTCCTTCTTCATTCCTTCTTTGTTTGTTTGTTTGTTTCCCTTTTTTTTTTTTTTTTTTTT TTTTTTAGTTAATTCTTTTATATGTATAATAATATTATAAGACAATTGGACAATGATTACAAAAAGGTA AAAGTAATAATTTTCTAAAGTATAATATAATATTATAATAATATAATATAAATTTTTAATAAAATTTA ATAAAAAAGTTTATAAATACTTATCGACCATAAGTCGTTTAAGAAAAAAAAAAAAAAGAAAAAAAAAA AAAAATATTACAAAAATATTATAGTGTATTATATTTTATCATATCATCTTTTTTTTATTTTTTTTTATA TTTTTGTTACGGCACATCAAGCAACTATAAATATTTAAGATCAACCACCAAAAAAAAAAAAAAAAAAAA AAAAAAAAAACATTATTTATGGTATTTTAAA
Problems with current motif finding approaches Elevated false positive rate k-means clustered gene expression randomly clustered gene expression many motifs AlignACE many motifs
Re-thinking motif discovery from expression data One approach for all types of expression data Make as few a priori assumptions as possible Very low false positive rate Scale to complex metazoan and plant genomes Noam Slonim
Cluster Index Microarray Conditions All Genes on array Clusters of co-expressed genes
5’ upstream regions Cluster index correlation is quantified using the mutual information Motif Expression (Cluster Indices) Absent Present Our approach: look for motifs whose profile of presence/absence is informative about the expression profile These genes belong to cluster 0 These genes belong to cluster 1 These genes belong to cluster 2...
’ upstream regionLog-ratio Continuous expression variables (e.g. microarray log-ratios)
3’UTRs ’ upstream regions Expression cluster index Expression cluster index DNA RNA
finding all informative DNA and RNA motifs How do we do it ?
Possible motif representations Search spaceAccuracy very good good acceptable very large large small Words (k-mers) GCGATGAG Weight matrices Degenerate code [AC]CGATGAG[TC]
Motif Search Algorithm k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040
Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? 5’ upstream regions Cluster Indices A/G T/G C/GA/C/G A/T/G C/G/T
Optimizing k-mers into more informative degenerate motifs ATCC[C/G]TACA 5’ upstream regions Cluster Indices A/C T/C C/GA/C/G A/T/C C/G/T......
change Motif Conservation with S. bayanus Similarity to ChIP-chip RAP1 motif Mutual information RAP1 binding site (ChIP-chip)
k-mer MI CTCATCG TCATCGC AAAATTT GCTCATC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT Highly informative k-mers Only optimize k-mer if I(k-mer;expression | motif) is large enough (for all motifs optimized so far) MI=0.081 MI=0.045 Motifs optimized so far optimize ? Conditional mutual information I(X;Y|Z)
Each motif is subjected to a stringent statistical significance test Real mutual information value Maximum of 10,000 expression-shuffled mutual information values
The regulation of gene expression is highly combinatorial DNA Pol II Expression pattern 1 Expression pattern 2 Expression pattern 3
Can we group our predicted motifs into modules of combinatorially acting regulatory elements ? The regulation of gene expression is highly combinatorial
Predicting combinatorial regulation using mutual information 5’ upstream region Motif 1 Motif 2 Absent Present Absent Present Is the presence of motif 1 informative about the presence of motif 2 ?
Discovering modules of combinatorially acting motifs modules
Yeast P. falciparum Human Results (malaria parasite)
Yeast stress gene expression program (Gasch et al, 2000) 173 microarray conditions ~ 5,500 genes 80 co-expression clusters Runtime ~ 1h (standard PC)
Predicted Motifs Expression Clusters 17 motifs in 5’ upstream regions 6 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the clustering partition 1129 motifs when applying AlignACE (with default parameters) to each cluster independently 880 “motifs” when applying AlignACE to the same shuffled clusters as above
Predicted Motifs 13 modules of co-occurring motifs All 23 motifs are highly conserved with S. bayanus PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4 14 previously known motifs Expression Clusters
PAC is under-represented in cluster 13 (p<1e-5) PAC is highly over-represented in cluster 66 (p<1e-20) over-represention under-represention PAC RRPE Puf4 Expression Clusters Motifs 5’ 3’UTR Predicted cooperation between DNA and RNA motifs
Predicted Motifs Expression Clusters PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4
Mitochondrial ribosome, p<1e-33 Mitochondrial ribosome, p<1e-29 Puf3 Cytosolic ribosome, p<1e-18
Functional enrichments Proteasome complex (p<1e-44) DNA replication (p<1e-7) Oxydative phosphorylation (p<1e-17) Rpn4 Mbp1 Novel motif
Beer and Tavazoie, 2004; Elemento and Tavazoie, 2005 We also use mutual information to discover … Non-random spatial distribution Orientation preferences Co-localization
’ upstream region Cluster Indices Cluster Indices CloseFarVery far Distance to TSS is informative about expression Distance between two motifs is informative about expression
Predicted Motifs Clusters Y Y Y Y Y Y Y Y Y Y Y Y > 50% of our predicted motifs have a non-random spatial distribution
Clusters where the motif is over- represented Clusters where the motif is NOT over- represented ATG -600bp PAC has a non-random spatial distribution
RAP1 motif has a different kind of non- random spatial distribution Unique cluster where the motif is over- represented Clusters where the motif is NOT over- represented RAP1 motif also has a strong orientation preference
Clusters where the TWO motifs are both over- represented Clusters where the motifs are NOT over- represented -600bp ATG PAC and RRPE tend be co- localize on the DNA
-600bp ATG PAC and the Msn2/4 binding site tend to avoid being in the same promoters
Single array analysis Down-regulatedUp-regulated Cy3/Cy5 expression log-ratios PAC Rpn4 Yap1 Puf3 H 2 O 2 treatment in ΔMsn2/ΔMsn4 background
Bozdech, Llinás, et al., PLoS Biol, 2003 P. falciparum intra-erythrocytic developmental cycle ~ 2,700 periodically expressed genes 0h Time 48h Associate a “ phase ” to each gene, which reflects the timing of maximal expression
’ upstream regionPhase Discovering motifs that are informative about the expression phase
21 motifs in 5’ upstream regions 0 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the phase profile -π Phase +π 71% highly conserved with P. yoelli DNA replication, p<1e-4 plastid, p<0.01 ribosome, p<0.001
Independent biochemical validation - Purified 3/26 predicted TF in P. falciparum -Identified DNA-binding specificities using protein binding microarrays Bulyk lab, Harvard Llinás lab, Princeton University, submitted
Motifs Match Predictions Protein Binding Microarray FIRE Prediction GST AP2 AP2 GST AP2
-π Phase +π Independent biochemical validation for 3/21 motifs More TFs being purified... bound by MAL6P1.44 bound by PF11_0404 bound by PF14_0633
Human gene expression atlas (Su et al, 2004, PNAS) 79 human tissues >17,000 genes 120 co-expression clusters Runtime ~ 24h (standard PC)
73 DNA motifs 42 RNA motifs ELK4 Sp1 AhR bZIP911 NF-Y E2F1 TCF11-MafG Pax2 E2F v-Myb TEAD Dof2 CHOP-C/EBPalpha HAND1-TCF3 GBP Skn-1 HFH-3 Sox17 miR-499/miR-505/miR-200a/miR-141 miR-525/miR-518f*/miR-526c/miR-526a/miR-520a* miR-380-3p/miR-215/miR-485-3p/hsa-let-7g/miR- 610/hsa-let-7i/hsa-let-7b/hsa-let-7a miR-30d/miR-30c/miR-30a-5p/miR-30e-5p/miR-30B miR-200b/miR-429/miR-200c miR modules
NF-Y novel M phase (p<1e-43)
TCF11- MafG novel Olfactory receptor activity (p<1e-43)
miR-525 miR-518f* miR-526c Sp1
(NF-Y binding site)
let-7b over-expression in human fibroblasts let-7 microRNAs are up- regulated when fibroblasts enter quiescence Let-7 target genes ? A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted
~14,000 genes C1 C2 C3 C4 C1C2C4C5 A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted 0h12h24h36h48h C5 let-7b over-expression in human fibroblasts UACCUC |||||| uugguguguuggaugAUGGAGu-5’ let-7b seed
-1000bpTSS Arabidopsis thaliana Experimental testing: Ken Birnbaum (NYU) Phil Benfey (Duke) ~22,300 genes on Affy chip
Biological insights Importance of RNA motifs in shaping transcriptomes ~30% of yeast, worm, human, arabidopsis motifs are RNA motifs In worm/human/mouse, many RNA motifs match miRNA targets “Cooperation” between DNA and RNA motifs UGUGAU |||||| cgaguaguuucgaccgACACUAu Yeast Puf4 motif Novel worm 3’UTR motif
Biological insights Avoidance of joint- presence for certain motifs Under-representation of certain motifs
works with any type of gene expression data FIRE (Finding Informative Regulatory Elements) Single microarray (e.g. log-ratios) Elemento, Slonim and Tavazoie, 2007, Molecular Cell C0 C1 C2 C3 C4 Clustered microarrays Gene expression phase
It looks for both DNA and RNA motifs FIRE (Finding Informative Regulatory Elements) 5’ 3’UTR 5’ 3’UTR 5’ Elemento, Slonim and Tavazoie, 2007, Molecular Cell
It is fast and scales well to large metazoan and plant genomes FIRE (Finding Informative Regulatory Elements) Elemento, Slonim and Tavazoie, 2007, Molecular Cell
It yields few or no false positives FIRE (Finding Informative Regulatory Elements) Real clustered gene expression randomly clustered gene expression 115 motifs 0 motifs (Human tissue microarray dataset)
It automatically evaluates: FIRE (Finding Informative Regulatory Elements) Functional coherence Defense response (p<1e-32) Inter-species conservation Spatial and orientation biases Compare to known motifs (JASPAR) Cooperativity and co-localization FIRE
fire --expfile=human_clusters.txt --exptype=discrete --species=human Expression file Expression typeSpecies FIRE (Finding Informative Regulatory Elements) Usage: NM_ NM_ NM_ NM_ NM_ NM_ NM_ NM_ discrete - continuous - human - mouse - arabidopsis - drosophila - worm - plasmodium - budding yeast - fission yeast - sea squirt ~250 downloads since Nov 2007
Bambi Tsui queries since Nov 2007
Acknowledgements Saeed Tavazoie Noam Slonim Sasan Amini Chang Chan Gordon Freckleton Hany Girgis Yir-Chung Liu Ilias Tagkopoulos Tiffany Vora Scott Breunig Anand Dharan Hani Goodarzi Danny Lieber Yael Marshall Kellen Olszewski Bambi Tsui Eric Wieschaus Manuel Llinás Hilary Coller Aster Lagesse-Miller Xuemin Lu Erandi De Silva Collaborators at Princeton
Additional slides
… Chan*, Elemento*, Tavazoie, PLoS Computational Biology, 2005
k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC CATCGCA AGATGAG TTTTTCA ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC GATAGAG GTAGCTC CTTATTA Test PASS DON’T PASS DON’T PASSDON’T PASS Most informative Less informative 10 consecutive “don’t pass” Optimize “seeds” into more degenerate motifs
Optimizing seeds into more informative degenerate motifs ATCGATCG S=C/G M=A/C W=A/T R=A/G K=T/G Y=T/C V=A/C/G H=A/C/T D=A/G/T B=C/G/T N=A/C/G/T TCC[C/G]TAC matches TCCCTAC and TCCGTAC
Predicted Motifs Clusters PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4 5’ 3’UTR Another example of predicted cooperation between DNA and RNA motifs RAP1 Novel
A gene expression map of Arabidopsis thaliana development (Schmid et al, 2005) 79 different tissue samples 22,300 genes 140 clusters Schmid et al, 2005, Nature Genetics
-Log(p) over-rep Log(p) under-rep 114 motifs in 5’ upstream regions 66 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the clustering partition
-Log(p) over-rep Log(p) under-rep telo-box ABRE-like W-box I-box DRE-core Few of these motifs are known
-Log(p) over-rep Log(p) under-rep Y Y Y Y Y Y Y Y Y Y Y Y Y Many have a non-random spatial distribution
Clusters where the motif is over- represented Clusters where the motif is NOT over- represented -1000bp TSS Motif has a non-random spatial distribution
Functional enrichments Defense response (p<1e-32) Ribosome (p<1e-84) Localized to chloroplast (p<1e-35) 3’UTR 5’ 3’UTR
-Log(p) over-rep Log(p) under-rep These two motifs are predicted to co- localize extensively
Clusters where the motifs are over- represented Clusters where the motifs are NOT over- represented -1000bpTSS These two motifs co- localize on the DNA
Examples of tissue-specific motifs... Collaborations with Phil Benfey (Duke), Ken Birnbaum (NYU)
Other data-types strong bindingno bindingp-values 20 Bicoid-bound vs 100 non-bound enhancers 20 Dorsal-bound vs 100 non-bound enhancers ChiP-chip, e.g. HNF6 (in human islet cells) Enhancers (Drosophila)
No 5’ upstream regionTissue-specific expression Yes All other genes Tissue-specific genes Binary expression variables (e.g. tissue specific expression)