1. Interpreting rich epigenomic datasets

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Understanding the Human Genome: Lessons from the ENCODE project
Current Topics of Genomics and Epigenomics. Outline  Motivation for analysis of higher order chromatin structure  Methods for studying long range chromatin.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
P300 Marks Active Enhancers Ruijuan LiChao HeRui Fu.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Supplemental Figure 1A. A small fraction of genes were mapped to >=20 SNPs. Supplemental Figure 1B. The density of distance from the position of an associated.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Differential Principal Component Analysis (dPCA) for ChIP-seq
4 male, 4 female LCLs HumanChimpanzeeRhesus Macaque Expression: RNAseq Active Gene Marks: Pol II (ChIPseq) H3K4me3 (ChIPseq) Repressed Region Mark: H3K27me3.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Jason Ernst Broad Institute of MIT and Harvard
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Supplemental Figure 1. False trans association due to probe cross-hybridization and genetic polymorphism at single base extension site. (A) The Infinium.
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
Epigenetics Abira Khan. What is Epigenetics?  Histone code: Modifications associated with transcriptional activation- primarily methylation and acetylation-would.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
Selecting genomics assays William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
Squeezing out the histone modifications data Wieslawa Mentzen with Matteo Floris and Paolo Uva Connections between epigenetics and microRNAs during embryonic.
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Functional Elements in the Human Genome
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
upstream vs. ORF binding and gene expression?
Integration methods and analysis
Gene Hunting: Design and statistics
Comprehensively Evaluating cis-Regulatory Variation in the Human Prostate Transcriptome by Using Gene-Level Allele-Specific Expression  Nicholas B. Larson,
DNase‐HS sites are main independent determinants of DNA replication timing Simulations based on genome sequence features (GC content, CpG islands), or.
Jason Ernst Joint work with Pouya Kheradpour, Luke Ward
Jason Ernst Joint work with Pouya Kheradpour, Luke Ward
Genetic-Variation-Driven Gene-Expression Changes Highlight Genes with Important Functions for Kidney Disease  Yi-An Ko, Huiguang Yi, Chengxiang Qiu, Shizheng.
Chromatin state and DNA sequence in TF binding dynamics and disease
Epigenomic views of human disease reveal 1000s of regulatory variants
Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci  Gosia Trynka,
High-Resolution Profiling of Histone Methylations in the Human Genome
Volume 18, Issue 9, Pages (February 2017)
by Holger Weishaupt, Mikael Sigvardsson, and Joanne L. Attema
Volume 44, Issue 3, Pages (November 2011)
Lucas J.T. Kaaij, Robin H. van der Weide, René F. Ketting, Elzo de Wit 
Integrative analysis of genomic and epigenomic data
Latent Regulatory Potential of Human-Specific Repetitive Elements
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
Volume 67, Issue 6, Pages e6 (September 2017)
Revisiting the Thrifty Gene Hypothesis via 65 Loci Associated with Susceptibility to Type 2 Diabetes  Qasim Ayub, Loukas Moutsianas, Yuan Chen, Kalliope.
High-Resolution Profiling of Histone Methylations in the Human Genome
Volume 17, Issue 6, Pages (November 2016)
Systematic mapping of functional enhancer-promoter connections with CRISPR interference by Charles P. Fulco, Mathias Munschauer, Rockwell Anyoha, Glen.
Volume 44, Issue 3, Pages (November 2011)
Human Promoters Are Intrinsically Directional
Are Interactions between cis-Regulatory Variants Evidence for Biological Epistasis or Statistical Artifacts?  Alexandra E. Fish, John A. Capra, William.
Evolution of Alu Elements toward Enhancers
Presentation by: Hannah Mays UCF - BSC 4434 Professor Xiaoman Li
Volume 132, Issue 2, Pages (January 2008)
Imprinted Chromatin around DIRAS3 Regulates Alternative Splicing of GNG12-AS1, a Long Noncoding RNA  Malwina Niemczyk, Yoko Ito, Joanna Huddleston, Anna.
Systematic mapping of functional enhancer–promoter connections with CRISPR interference by Charles P. Fulco, Mathias Munschauer, Rockwell Anyoha, Glen.
Anh Pham Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.
Volume 20, Issue 3, Pages (July 2017)
Figure 1 Results of genome-wide association study for age at diagnosis of PD Results of genome-wide association study for age at diagnosis of PD Genome-wide.
Integrative analysis of 111 reference human epigenomes
The 3D Genome in Transcriptional Regulation and Pluripotency
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
HOXA9 and STAT5 co-occupy similar genomic regions and increase JAK/STAT signaling. HOXA9 and STAT5 co-occupy similar genomic regions and increase JAK/STAT.
Mutant TERT promoter displays active histone marks and distinct long-range interactions: A, cell lines that were used in the study with their origin and.
Presentation transcript:

1. Interpreting rich epigenomic datasets

Interpreting chromatin states Conservation hiCpG-TSS loCpG-TSS Transcribed %Genome Expression L1 repeat Alu repeat Repeats Lamina Dnase TSS CpG TES ZNF Interpreting chromatin states

How many states are meaningful: agreement between cell types Ratio vs. background H1-H9 H9-H1 H1/9-IMR90 IMR90-H1 IMR90-H9 Background Distinctions remain recoverable between cell types, even after 40-50 chromatin states (IMR90-H1-H9)

Preferential enhancer-promoter interactions IMR90 – Same chromosome interactions Transcribed 3’ Transcribed 5’ Transcribed strong Transcribed weak Transcribed enhancer Enhancer poised Enhancer Active Strongest Enhancer Strong Enhancer Weak Enhancer Low signal Heterochromatin Repressed Bivalent promoter Active Promoter Transcribed Enhancer Off Prom IMR90 – diff chrom H1 – same chrom H1 – diff chrom Different enhancer states show different interactions Enhancers/transcribed/promoters interact Inactive regions show fewer interactions overall (both to active states, and to each other) H3K9me3 states interact between chromosomes in ES cells

2. Prioritizing experiments

Ever-expanding dimensions of epigenomics Additional dimensions: Environment Genotype Disease Gender Stage Age Thousands of whole-genome datasets Chromatin marks Cell types Today: Cell-type and chromatin-mark dimensions Next: Personal epigenomes: genotype/phenotype Complete matrix of conditions, individuals, alleles

Prioritize experiments for additional cell types 2 methods Method 1 Method 2 Based on unique information Based on chromatin state recovery (1) Quantify state recovery using subsets of marks (2) Capture additional information from mark intensity  Beyond marks: Trade-offs of >cell types vs. >depth

Method 1 example: Rank chromatin marks for a new cell type IMR90 Using all marks Hardest to predict  Prioritize these marks? Easiest to predict (redundant) Mark Prediction Error2 Hardest marks to predict using all other IMR90 marks: H3K3me3, etc Match the marks usually identified as the most useful: a good metric?

Method 2 example: Rank additional marks for existing cell type Extend IMR90 set beyond initial 22 marks 22 Marks common with CD4T data H2AK5ac H3K27ac H3K27me3 H3K9me3 H2BK120ac H3K4ac H3K36me3 H4K20me1 H2BK12ac H3K9ac H3K4me1 H2BK20ac H4K5ac H3K4me2 H3K14ac H4K8ac H3K4me3 H3K18ac H4K91ac H3K79me1 H3K23ac H3K79me2 19 Marks only in CD4T data H2AK9ac H2BK5me1 H3K9me2 CTCF H2BK5ac H3K27me1 H3R2me1 H2AZ H3K36ac H3K27me2 H3R2me2 PolII H4K12ac H3K36me1 H4K20me3 H4K16ac H3K79me3 H4R3me2

3. Completing epigenomes computationally Chromatin mark imputation

Predicting signal for missing marks Question: Can we predict signal intensity of one mark given other sets of marks Datasets used: H1, IMR90 (+H9, K562, GM12878, HSMM) Methodological decisions: Focus on common set of marks Downsample one replicate to 10 million reads Split reads equally between training and test data Bin genome into 2kb bins Model/metrics: Use a linear regression model for predictions Used square error loss on mark signal as objective

Eg: Predicting H3K9ac signal Mark Coeff H3K56ac 0.32 H3K4me3 0.29 H3K4ac 0.22 H3K4me2 0.15 H3K27ac 0.14 H2AK5ac H4K8ac H3K23ac 0.13 H3K14ac H3K79me2 0.12 H4K5ac 0.06 H3K36me3 0.04 H4K91ac 0.01 H3K4me1 -0.01 H3K18ac H3K27me3 -0.02 H4K20me1 -0.04 H2BK120ac -0.05 H3K9me3 Input -0.07 H2BK15ac -0.1 H3K79me1 -0.15 H2BK12ac H2BK20ac -0.22 Intercept -0.16 H3K9ac Predicted H3K9ac True How good is the prediction? How similar to other marks? How does it compare to biological replicate?

Impute missing datasets / predict new cell types Predict missing mark from many others Predict many marks in new cell type Prediction of K27ac,K9ac,K4me1… in GM from DNase Prediction of H3K4me1 from DNase across cell types Use mark correlations to predict missing datasets as matrices become denser Applications: (1) Prediction in difficult to access conditions. (2) Detecting failed experiments/replicates. (3) Finding unexpected prediction/raw differences

4. Allele-specific chromatin marks

Known imprinted genes confirm allele specific methodology Map to phased GM12878 haplotypes Count maternal vs. paternal reads, Validation Known imprinted genes are allelic X-inactivation only one chromosome Requires sufficient SNPs and sufficient reads for significance Discover allelic genes genome-wide  Aggregate by gene / chromatin state

Allelic activity supported by many marks, Pol2, TFs Includes X-inactivated paternal chromosome genes

Genome-wide correlations for pairs of marks Aggregate signal across chromatin states Active marks positively correlated H3K27me3 negatively correlated Zoom in on indiv. examples

Active/repressive marks on paternal/maternal alleles Active transcription of paternal chromosome Repressive marks on maternal chromosome Pol2 reads on paternal chromosome Strong repressive signal (K27me3): reads mostly maternal Strong active signal (K79me2 tx): reads mostly paternal

Allele-specific chromatin marks: cis-vs-trans effects Maternal and paternal GM12878 genomes sequenced Map reads to phased genome, handle SNPs indels Correlate activity changes with sequence differences

5. Linking enhancers to promoters using many cell types

Power should increase with additional cell types Chromatin State Gene expression Chance of spurious correlation decreases

Power to predict links increases with more cell types True enhancers show excess of high correlation Can estimate number of non-random links at any FDR Number of non-random links increases linearly with number of cell types 30 cell types: 15,000 links

Visualizing 10,000s predicted enhancer-gene links Overlapping regulatory units, both few and many Both upstream and downstream elements linked Enhancers correlate with sequence constraint

6. Disease enrichments across 1000s of enhancers

Full T1D association spectrum  1000s of causal SNPs Rank all SNPs by P-value Find chromatin states with enrichment in high ranks Signal spans 1000s of SNPs GM12878 enhancer enrichment now seen GM12878 Lymphoblastoid K562 Myelogenous leukemia Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters Could bias in array design contribute to these enrichments?  Evaluate all 1000 genomes SNPs by imputing those in LD

Imputing SNPs in LDstronger cell/state separation Enhancers across cell types Chromatin states in GM12878 Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8) Promoters: 462 (excess 81) Transcribed: 4740 (excess 522) Repressed: 1351 (excess 76) Insulator: 240 (excess 23) Other: 21k (deplete 1093) Excess of 30,000 SNPs2049 enhancers (excess 392) Mostly found in independent loci (1730 with R2<0.2)  Systematically measure their regulatory contributions