MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Transcriptional regulation and promoter analysis
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Inferring Transcriptional Regulation Using Transctiptomics Carsten O. Daub September 1 st, 2014 StratCan Summer School 2014 Vår Gård, Saltsjöbaden.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
The AMADEUS Motif Discovery Platform C. Linhart, Y. Halperin, R. Shamir Tel-Aviv University ApoSys workshop May ‘ 08 Genome Research 2008.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)
March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali.
Gene regulatory network
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Human: 78 tissues (Su et al, 2004) Stastical significance P. falciparum: intra-erythrocytic development cycle Yeast: 78 co-expression clusters From k-mers.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Computational biology seminar
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Bio277 Lab 3: Finding Transcription Factor Binding Motifs Adapted from a Lab Written by Prof Terry Speed Jess Mar Department of Biostatistics Quackenbush.
Fuzzy K means.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Computational Approaches for Understanding Biological Significance of Microarray Data Liangjiang (LJ) Wang KSU Bioinformatics Center, Biology.
Identifying conserved promoter motifs and transcription factor binding sites in plant promoters Endre Sebestyén, ARI-HAS, Martonvásár, Hungary 26th, November,
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Cis-regulatory element study in transcriptome Jin Chen CSE Fall
Fine Structure and Analysis of Eukaryotic Genes
Whole Genome Expression Analysis
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? Reg. ACGTGC.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Finish up array applications Move on to proteomics Protein microarrays.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
RNA Folding. RNA Folding Algorithms Intuitively: given a sequence, find the structure with the maximal number of base pairs For nested structures, four.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Tools for Comparative Sequence Analysis Ivan Ovcharenko Lawrence Livermore National Laboratory.
Cis-regulatory Modules and Module Discovery
Cluster validation Integration ICES Bioinformatics.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Finding genes in the genome
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Regulation of Gene Expression
Inferring Models of cis-Regulatory Modules using Information Theory
Dennis Shasha, Courant Institute, New York University With
Revealing Global Regulatory Perturbations across Human Cancers
Mapping Global Histone Acetylation Patterns to Gene Expression
Revealing Global Regulatory Perturbations across Human Cancers
Predicting Gene Expression from Sequence
Presentation transcript:

mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression

Using gene expression to identify regulatory elements 5’ upstream

Feb 2007: ~110,000 arrays in NCBI GEO Gene expression Allen Institute Mouse Brain Gene expression Atlas (in situ hybridization, ~23,000 genes) Feb 2008: ~202,000 arrays

Roth et al., 1998, Tavazoie et al., 1999: co-expressed genes often share the same regulatory elements Expression5’ upstream How do you identify regulatory elements from gene expression ? Motif finding programs: - AlignACE (Hughes et al, 2000) - MEME - REDUCE (Bussemaker et al, 2001) and many others …

Problems with current motif finding approaches Different approaches for different types of expression data Single microarray (e.g. log-ratios) REDUCE C0 C1 C2 C3 C4 Co-expression clusters ALIGNACE

Problems with current motif finding approaches Many unrealistic assumptions, e.g., zero th order background sequence model in AlignACE 1kb upstream region of Plasmodium falciparum PF11_0108 TTTAAAAAAAAAAAAAAAAAGAGAAAAACCATATTTATATGGATATAATATTTTTAAAGTATAGAAAA AATAATATATATTTATATACATTTATATTAATGAAAAAGCAAACAGCTAAATTACAAAAAAAAAAAAA AATTAGATTATCTCAATTAAAAGAACAATATATAAATAATTAATCCATGCTATTTTTTGATATATATAA GAATTTAATGCCTTATATTATAAATAGAGAAATAAATAAATAAATAAATATATAAACATATATATTAT ATATATATATATATATATAGTTATACATTATGATTTTGAAAAAATAGATATATACTATTAATTGTATAT GTTTATACATAAAGCATATTTTTATTAATTGTAATATATAGATTTTTTATTATAATAATATTATATATA TATATATATATATATATATTTTTTTTTTTTTGTTAAATAGCGAAATAAAAATACCTGACCTTTGTAATCT TTATTTGATTACTTCCTTCTTCATTCCTTCTTTGTTTGTTTGTTTGTTTCCCTTTTTTTTTTTTTTTTTTTT TTTTTTAGTTAATTCTTTTATATGTATAATAATATTATAAGACAATTGGACAATGATTACAAAAAGGTA AAAGTAATAATTTTCTAAAGTATAATATAATATTATAATAATATAATATAAATTTTTAATAAAATTTA ATAAAAAAGTTTATAAATACTTATCGACCATAAGTCGTTTAAGAAAAAAAAAAAAAAGAAAAAAAAAA AAAAATATTACAAAAATATTATAGTGTATTATATTTTATCATATCATCTTTTTTTTATTTTTTTTTATA TTTTTGTTACGGCACATCAAGCAACTATAAATATTTAAGATCAACCACCAAAAAAAAAAAAAAAAAAAA AAAAAAAAAACATTATTTATGGTATTTTAAA

Problems with current motif finding approaches Elevated false positive rate k-means clustered gene expression randomly clustered gene expression many motifs AlignACE many motifs

Re-thinking motif discovery from expression data One approach for all types of expression data Make as few a priori assumptions as possible Very low false positive rate Scale to complex metazoan and plant genomes Noam Slonim

Cluster Index Microarray Conditions All Genes on array Clusters of co-expressed genes

5’ upstream regions Cluster index correlation is quantified using the mutual information Motif Expression (Cluster Indices) Absent Present Our approach: look for motifs whose profile of presence/absence is informative about the expression profile These genes belong to cluster 0 These genes belong to cluster 1 These genes belong to cluster 2...

’ upstream regionLog-ratio Continuous expression variables (e.g. microarray log-ratios)

3’UTRs ’ upstream regions Expression cluster index Expression cluster index DNA RNA

finding all informative DNA and RNA motifs How do we do it ?

Possible motif representations Search spaceAccuracy very good good acceptable very large large small Words (k-mers) GCGATGAG Weight matrices Degenerate code [AC]CGATGAG[TC]

Motif Search Algorithm k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040

Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? 5’ upstream regions Cluster Indices A/G T/G C/GA/C/G A/T/G C/G/T

Optimizing k-mers into more informative degenerate motifs ATCC[C/G]TACA 5’ upstream regions Cluster Indices A/C T/C C/GA/C/G A/T/C C/G/T......

change Motif Conservation with S. bayanus Similarity to ChIP-chip RAP1 motif Mutual information RAP1 binding site (ChIP-chip)

k-mer MI CTCATCG TCATCGC AAAATTT GCTCATC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT Highly informative k-mers Only optimize k-mer if I(k-mer;expression | motif) is large enough (for all motifs optimized so far) MI=0.081 MI=0.045 Motifs optimized so far optimize ? Conditional mutual information I(X;Y|Z)

Each motif is subjected to a stringent statistical significance test Real mutual information value Maximum of 10,000 expression-shuffled mutual information values

The regulation of gene expression is highly combinatorial DNA Pol II Expression pattern 1 Expression pattern 2 Expression pattern 3

Can we group our predicted motifs into modules of combinatorially acting regulatory elements ? The regulation of gene expression is highly combinatorial

Predicting combinatorial regulation using mutual information 5’ upstream region Motif 1 Motif 2 Absent Present Absent Present Is the presence of motif 1 informative about the presence of motif 2 ?

Discovering modules of combinatorially acting motifs modules

Yeast P. falciparum Human Results (malaria parasite)

Yeast stress gene expression program (Gasch et al, 2000) 173 microarray conditions ~ 5,500 genes 80 co-expression clusters Runtime ~ 1h (standard PC)

Predicted Motifs Expression Clusters 17 motifs in 5’ upstream regions 6 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the clustering partition 1129 motifs when applying AlignACE (with default parameters) to each cluster independently 880 “motifs” when applying AlignACE to the same shuffled clusters as above

Predicted Motifs 13 modules of co-occurring motifs All 23 motifs are highly conserved with S. bayanus PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4 14 previously known motifs Expression Clusters

PAC is under-represented in cluster 13 (p<1e-5) PAC is highly over-represented in cluster 66 (p<1e-20) over-represention under-represention PAC RRPE Puf4 Expression Clusters Motifs 5’ 3’UTR Predicted cooperation between DNA and RNA motifs

Predicted Motifs Expression Clusters PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4

Mitochondrial ribosome, p<1e-33 Mitochondrial ribosome, p<1e-29 Puf3 Cytosolic ribosome, p<1e-18

Functional enrichments Proteasome complex (p<1e-44) DNA replication (p<1e-7) Oxydative phosphorylation (p<1e-17) Rpn4 Mbp1 Novel motif

Beer and Tavazoie, 2004; Elemento and Tavazoie, 2005 We also use mutual information to discover … Non-random spatial distribution Orientation preferences Co-localization

’ upstream region Cluster Indices Cluster Indices CloseFarVery far Distance to TSS is informative about expression Distance between two motifs is informative about expression

Predicted Motifs Clusters Y Y Y Y Y Y Y Y Y Y Y Y > 50% of our predicted motifs have a non-random spatial distribution

Clusters where the motif is over- represented Clusters where the motif is NOT over- represented ATG -600bp PAC has a non-random spatial distribution

RAP1 motif has a different kind of non- random spatial distribution Unique cluster where the motif is over- represented Clusters where the motif is NOT over- represented RAP1 motif also has a strong orientation preference

Clusters where the TWO motifs are both over- represented Clusters where the motifs are NOT over- represented -600bp ATG PAC and RRPE tend be co- localize on the DNA

-600bp ATG PAC and the Msn2/4 binding site tend to avoid being in the same promoters

Single array analysis Down-regulatedUp-regulated Cy3/Cy5 expression log-ratios PAC Rpn4 Yap1 Puf3 H 2 O 2 treatment in ΔMsn2/ΔMsn4 background

Bozdech, Llinás, et al., PLoS Biol, 2003 P. falciparum intra-erythrocytic developmental cycle ~ 2,700 periodically expressed genes 0h Time 48h Associate a “ phase ” to each gene, which reflects the timing of maximal expression

’ upstream regionPhase Discovering motifs that are informative about the expression phase

21 motifs in 5’ upstream regions 0 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the phase profile -π Phase +π 71% highly conserved with P. yoelli DNA replication, p<1e-4 plastid, p<0.01 ribosome, p<0.001

Independent biochemical validation - Purified 3/26 predicted TF in P. falciparum -Identified DNA-binding specificities using protein binding microarrays Bulyk lab, Harvard Llinás lab, Princeton University, submitted

Motifs Match Predictions Protein Binding Microarray FIRE Prediction GST AP2 AP2 GST AP2

-π Phase +π Independent biochemical validation for 3/21 motifs More TFs being purified... bound by MAL6P1.44 bound by PF11_0404 bound by PF14_0633

Human gene expression atlas (Su et al, 2004, PNAS) 79 human tissues >17,000 genes 120 co-expression clusters Runtime ~ 24h (standard PC)

73 DNA motifs 42 RNA motifs ELK4 Sp1 AhR bZIP911 NF-Y E2F1 TCF11-MafG Pax2 E2F v-Myb TEAD Dof2 CHOP-C/EBPalpha HAND1-TCF3 GBP Skn-1 HFH-3 Sox17 miR-499/miR-505/miR-200a/miR-141 miR-525/miR-518f*/miR-526c/miR-526a/miR-520a* miR-380-3p/miR-215/miR-485-3p/hsa-let-7g/miR- 610/hsa-let-7i/hsa-let-7b/hsa-let-7a miR-30d/miR-30c/miR-30a-5p/miR-30e-5p/miR-30B miR-200b/miR-429/miR-200c miR modules

NF-Y novel M phase (p<1e-43)

TCF11- MafG novel Olfactory receptor activity (p<1e-43)

miR-525 miR-518f* miR-526c Sp1

(NF-Y binding site)

let-7b over-expression in human fibroblasts let-7 microRNAs are up- regulated when fibroblasts enter quiescence Let-7 target genes ? A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted

~14,000 genes C1 C2 C3 C4 C1C2C4C5 A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted 0h12h24h36h48h C5 let-7b over-expression in human fibroblasts UACCUC |||||| uugguguguuggaugAUGGAGu-5’ let-7b seed

-1000bpTSS Arabidopsis thaliana Experimental testing: Ken Birnbaum (NYU) Phil Benfey (Duke) ~22,300 genes on Affy chip

Biological insights Importance of RNA motifs in shaping transcriptomes ~30% of yeast, worm, human, arabidopsis motifs are RNA motifs In worm/human/mouse, many RNA motifs match miRNA targets “Cooperation” between DNA and RNA motifs UGUGAU |||||| cgaguaguuucgaccgACACUAu Yeast Puf4 motif Novel worm 3’UTR motif

Biological insights Avoidance of joint- presence for certain motifs Under-representation of certain motifs

works with any type of gene expression data FIRE (Finding Informative Regulatory Elements) Single microarray (e.g. log-ratios) Elemento, Slonim and Tavazoie, 2007, Molecular Cell C0 C1 C2 C3 C4 Clustered microarrays Gene expression phase

It looks for both DNA and RNA motifs FIRE (Finding Informative Regulatory Elements) 5’ 3’UTR 5’ 3’UTR 5’ Elemento, Slonim and Tavazoie, 2007, Molecular Cell

It is fast and scales well to large metazoan and plant genomes FIRE (Finding Informative Regulatory Elements) Elemento, Slonim and Tavazoie, 2007, Molecular Cell

It yields few or no false positives FIRE (Finding Informative Regulatory Elements) Real clustered gene expression randomly clustered gene expression 115 motifs 0 motifs (Human tissue microarray dataset)

It automatically evaluates: FIRE (Finding Informative Regulatory Elements) Functional coherence Defense response (p<1e-32) Inter-species conservation Spatial and orientation biases Compare to known motifs (JASPAR) Cooperativity and co-localization FIRE

fire --expfile=human_clusters.txt --exptype=discrete --species=human Expression file Expression typeSpecies FIRE (Finding Informative Regulatory Elements) Usage: NM_ NM_ NM_ NM_ NM_ NM_ NM_ NM_ discrete - continuous - human - mouse - arabidopsis - drosophila - worm - plasmodium - budding yeast - fission yeast - sea squirt ~250 downloads since Nov 2007

Bambi Tsui queries since Nov 2007

Acknowledgements Saeed Tavazoie Noam Slonim Sasan Amini Chang Chan Gordon Freckleton Hany Girgis Yir-Chung Liu Ilias Tagkopoulos Tiffany Vora Scott Breunig Anand Dharan Hani Goodarzi Danny Lieber Yael Marshall Kellen Olszewski Bambi Tsui Eric Wieschaus Manuel Llinás Hilary Coller Aster Lagesse-Miller Xuemin Lu Erandi De Silva Collaborators at Princeton

Additional slides

… Chan*, Elemento*, Tavazoie, PLoS Computational Biology, 2005

k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC CATCGCA AGATGAG TTTTTCA ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC GATAGAG GTAGCTC CTTATTA Test PASS DON’T PASS DON’T PASSDON’T PASS Most informative Less informative 10 consecutive “don’t pass” Optimize “seeds” into more degenerate motifs

Optimizing seeds into more informative degenerate motifs ATCGATCG S=C/G M=A/C W=A/T R=A/G K=T/G Y=T/C V=A/C/G H=A/C/T D=A/G/T B=C/G/T N=A/C/G/T TCC[C/G]TAC matches TCCCTAC and TCCGTAC

Predicted Motifs Clusters PAC RRPE PUF4 PUF3 MSN2/4 RAP1 RPN4 REB1 MBP1 HAP4 XBP1 BAS1 CBF1 SWI4 5’ 3’UTR Another example of predicted cooperation between DNA and RNA motifs RAP1 Novel

A gene expression map of Arabidopsis thaliana development (Schmid et al, 2005) 79 different tissue samples 22,300 genes 140 clusters Schmid et al, 2005, Nature Genetics

-Log(p) over-rep Log(p) under-rep 114 motifs in 5’ upstream regions 66 motifs in 3’UTRs 0 “motifs” when shuffling the gene labels of the clustering partition

-Log(p) over-rep Log(p) under-rep telo-box ABRE-like W-box I-box DRE-core Few of these motifs are known

-Log(p) over-rep Log(p) under-rep Y Y Y Y Y Y Y Y Y Y Y Y Y Many have a non-random spatial distribution

Clusters where the motif is over- represented Clusters where the motif is NOT over- represented -1000bp TSS Motif has a non-random spatial distribution

Functional enrichments Defense response (p<1e-32) Ribosome (p<1e-84) Localized to chloroplast (p<1e-35) 3’UTR 5’ 3’UTR

-Log(p) over-rep Log(p) under-rep These two motifs are predicted to co- localize extensively

Clusters where the motifs are over- represented Clusters where the motifs are NOT over- represented -1000bpTSS These two motifs co- localize on the DNA

Examples of tissue-specific motifs... Collaborations with Phil Benfey (Duke), Ken Birnbaum (NYU)

Other data-types strong bindingno bindingp-values 20 Bicoid-bound vs 100 non-bound enhancers 20 Dorsal-bound vs 100 non-bound enhancers ChiP-chip, e.g. HNF6 (in human islet cells) Enhancers (Drosophila)

No 5’ upstream regionTissue-specific expression Yes All other genes Tissue-specific genes Binary expression variables (e.g. tissue specific expression)