Download presentation
Presentation is loading. Please wait.
Published byPrimrose Gilmore Modified over 9 years ago
1
Learning the cis regulatory code by predictive modeling of gene regulation (MEDUSA) Christina Leslie Center for Computational Learning Systems Columbia University, NY, USA http://www.cs.columbia.edu/compbio/medusa
2
Transcriptional Regulation Nuclear membrane
3
Transcriptional Regulation Nuclear membrane
4
Transcriptional Regulation Nuclear membrane Binding site/motif CCG__CCG
5
Transcriptional Regulation Nuclear membrane Binding site/motif CCG__CCG Genome-wide mRNA transcript data (e.g. microarrays)
6
Transcriptional Regulation Nuclear membrane Binding site/motif CCG__CCG Understand which regulators control which target genes Discover motifs representing regulatory elements Learning problems:
7
Previous work: Clustering Cluster-first motif discovery –Cluster genes by expression profile, annotation, … to find potentially coregulated genes –Find overrepresented motifs in promoter sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …) (Spellman et al. 1998)
8
Previous work: “Structure learning” Graphical models (and other methods) –Learn structure of “regulatory network”, “regulatory modules”, etc. –Fit interpretable model to training data –Model small number of genes or clusters of genes –Many computational and statistical challenges; often used for qualitative hypotheses rather than prediction (Segal et al, 2003, 2004) (Pe’er et al. 2001)
9
Our work: “Predictive modeling” MEDUSA = Motif Element Discrimination Using Sequence Agglomeration What is the prediction problem? –Predict up/down regulation of target genes under different experimental conditions Key ideas: –Learn motifs and identify regulators that predict differential expression in different contexts mechanistic inputs –Obtain single model for all genes and all experiments: context-specific, no clusters, no parameter tuning –Accurate predictions on test data M. Middendorf, A. Kundaje, M. Shah, Y. Freund, C. Wiggins, C. Leslie. Motif Discovery through Predictive Modeling of Gene Regulation. RECOMB 2005.
10
MEDUSA: Different view of training data Learn regulatory program that makes genome- wide, context-specific predictions for differential (up/down) expression of target genes
11
MEDUSA – Set up Target gene analysis, important regulators TPK1, USV1, AFR1, XBP1, …
12
Training data – Features label promoter sequence regulator expression feature vector
13
Boosting (Freund & Schapire 1995)
14
distribution over training data
15
Boosting (Freund & Schapire 1995) distribution over training data weak rule Minimize exponential loss function
16
Boosting (Freund & Schapire 1995) distribution over training data weak rule updated weights
17
Boosting (Freund & Schapire 1995) distribution over training data weak rule updated weights
18
Boosting (Freund & Schapire 1995) distribution over training data weak rule updated weights
19
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
20
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG
21
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC
22
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC
23
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC
24
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC dimers (gapped elements) TTT_AAA
25
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC dimers (gapped elements) TTT_AAA GCTA_GCTA
26
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC dimers (gapped elements) TTT_AAA GCTA_GCTA
27
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC dimers (gapped elements) TTT_AAA GCTA_GCTA Regulator expression Is AGCTATG present and USV1 up? Is AGCTATG present and USV1 down? Is GCTATGC present and USV1 up? Is GCTATGC present and TPK1 up? … try all motif-regulator pairs as weak rules …
28
MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG GCTATGC CTATGCC dimers (gapped elements) TTT_AAA GCTA_GCTA Regulator expression Is AGCTATG present and USV1 up? Is AGCTATG present and USV1 down? Is GCTATGC present and USV1 up? Is GCTATGC present and TPK1 up? … try all motif-regulator pairs as weak rules … minimizes boosting loss Is GCTATGC present and USV1 up?
29
Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting boosting loss
30
Hierarchical sequence agglomeration GCTATGC GCAATGC GGTATGC CCTAAGC GCTATTT … GGTATGG … PSSMs … Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting boosting loss Agglomerate
31
Hierarchical sequence agglomeration GCTATGC GCAATGC GGTATGC CCTAAGC GCTATTT … GGTATGG … PSSMs … Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting boosting loss Optimize over offsets when merging k-mers/PSSMs: - - GCTATGC GCTATTT - -
32
Hierarchical sequence agglomeration GCTATGC GCAATGC GGTATGC CCTAAGC GCTATTT … GGTATGG … PSSMs … Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting boosting loss
33
Hierarchical sequence agglomeration GCTATGC GCAATGC GGTATGC CCTAAGC GCTATTT … GGTATGG … PSSMs … Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting boosting loss Is present and USV1 up? Is present and USV1 up? …
34
Hierarchical sequence agglomeration GCTATGC GCAATGC GGTATGC CCTAAGC GCTATTT … GGTATGG … PSSMs … Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting boosting loss Is present and USV1 up? Is present and USV1 up? … minimize boosting loss final weak rule
35
MEDUSA strong rule Combine weak rules into a tree-structure Alternating decision tree = margin-based generalization of decision trees [Freund & Mason 1999] Lower nodes are conditionally dependent on higher nodes can possibly reveal combinatorial interactions Able to reveal motifs specific to subsets of target genes Able to learn any boolean function
36
Yeast Environmental Stress Response Gasch et al. (2000) dataset, 173 microarrays, 13 environmental stresses ~5500 target genes, 475 regulators (237 TF+ 250 SM) 500bp upstream promoter sequences Binning into +1/0/-1 expression levels based on wildtype vs. wildtype noise
37
Statistical validation 10-fold cross-validation (held-out experiments), ~60,000 (gene,experiment) training examples, 700 iterations (N k-mers +N dimers +N PSSMs )*N reg *2 ~= 10 7 possible weak rules at every node MEDUSA’s motifs give a better prediction accuracy on held-out experiments than database motifs
38
Yeast ESR: Biological Validation STRE element Universal stress repressor motif
39
Yeast ESR: Biological Validation Important regulators identified by MEDUSA Cellular localization of MSN2/4 Segal et al. 2003 Universal stress repressor
40
Visualizing MEDUSA motifs AAATTTTAAGGG 1.2.3. 5. 8. 14. 16.
41
Restrict regulatory program to particular target genes T, experimental conditions E smaller model Further statistical pruning of features using margin-based score: Identify most significant context-specific regulators and motifs for target set Biological validation – Context-specific analysis
42
Example: oxygen sensing and regulation in yeast (collaborator: Li Zhang) Biological validation – Context-specific analysis
43
Example: oxygen and heme inducible targets Biological validation – Context-specific analysis
44
Regulator-motif associations in nodes can have different meanings: Need other data to confirm binding relationship between regulator and target (e.g. ChIP chip) Still, can determine statistically significant regulator-target relationships from regulation program TF M TF P P MpMp P MMpMp Direct bindingIndirect effectCo-occurrence Biological validation – Network inference
45
Example: oxygen sensing and regulatory network Biological validation – Network inference
46
At least 2 usages: Makes accurate quantitative predictions –Can assess predictions statistically, i.e. on test data –Gives us confidence that model contains biologically relevant information vs. Generates biological hypotheses –Without statistical validation, can only evaluate quality of hypotheses through experiments –Issues: How much of model is correct? How many false positives? Is a network “edge” a meaningful prediction? (Cf. DREAM initiative) Discussion: What does “predictive” mean?
47
“Manifesto” –We’re interested in hypothesis generation, but still must give statistical validation on test data, i.e. show that you’re not overfitting –Not enough to show that model is non-random, e.g. good p-values for functional enrichment Possible goal: move towards making useful predictions for actual wet-lab experiments (e.g. fewer input variables in model) MEDUSA: statistically predictive model, can still interpret to extract biological hypotheses Discussion: “Predictive” modeling
48
Oxygen sensing and regulation in yeast (collaborator: Li Zhang, Public Health @ Columbia) Regulation of and by microRNAs in humans (collaborators: Sander group, Sloan Kettering) Sequence information controlling tissue-specific alternative splicing (collaborator: Larry Chasin, Biology @ Columbia) Integration of phosphorylation (“kinome”) data to reconstruct signaling pathways New Java MEDUSA software package – soon to be released Ongoing MEDUSA-related projects http://www.cs.columbia.edu/compbio/medusa
49
Manuel Middendorf (Physics) Anshul Kundaje (CS) David Quigley (DBMI) Steve Lianoglou (CS) Xuejing Li (Physics) Mihir Shah (CS) Marta Arias (CCLS) Chris Wiggins (APAM) Yoav Freund (CS@UCSD) Funding: NIH (MAGNet NCBC grant) Thanks
50
Visualizing MEDUSA motifs Pruning based on feature dependence statistic:
51
ChIP chip: genome-wide protein- DNA binding data, i.e. what promoters are bound by TF? Investigate regulatory network model: use ChIP chip data in place of motifs (no motif discovery) –Features: (regulator, TF-occupancy) pairs TFP2P2 P1P1 Biological validation – Binding data
52
Biological validation – Target gene analysis Restrict to target genes = protein chaperones; experiments = heat shock, hypo/hyper-osmolarity –CMK2 with HSF1 occupancy (CaMKII mammalian ortholog interacts with HSF1)
53
Biological validation – Signaling molecules Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features e.g. Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature) Hsf1Gac1 Gip1 Sds22 Glc7 phosphatase complex TF SM mRNA
54
SVM classifiers with string kernels for remote homology detection, fold recognition Update: Protein fold recognition YPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGK protein sequence profile I G ID k-mer based kernel computation prediction of structural class SVM R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, C. Leslie. Remote homology detection and motif extraction using profile-based string kernels. JBCB 2005.
55
SVM-Fold web server (soon to be deployed)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.