Practical Guide to the (mod)ENCODE project February 27 2013.

Slides:



Advertisements
Similar presentations
Lecture 3. Felsenfeld & Groudine, Nature 2003 H2A, H2B, H3 and H4.
Advertisements

Methods to read out regulatory functions
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Functional Non-Coding DNA Part II DNA Regulatory Elements BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Outline Questions from last lecture? P. 40 questions on Pax6 gene Mechanism of Transcription Activation –Transcription Regulatory elements Comparison between.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Understanding the Human Genome: Lessons from the ENCODE project
Analysis of ChIP-Seq Data
[Bejerano Fall10/11] 1 Thank you for the midterm feedback! Projects will be assigned shortly.
Organism Estimated size (in bases) Estimated gene # Average gene density/base Diploid chromosome # Human2.9 x 10 9 ~30,0001/100,00046 Rat2.8 x 10 9 ~30,0001/100,00042.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger.
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Committee Meeting April 24 th 2014 Characterizing epigenetic variation in the Pacific oyster (Crassostrea gigas) Claire Olson School of Aquatic and Fishery.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Gene Regulation results in differential Gene Expression, leading to cell Specialization Eukaryotic DNA.
ENCODE The Human Genome project sequenced “the human genome” “the human genome” that we have labeled as such doesn’t actually exist What we call.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
P300 Marks Active Enhancers Ruijuan LiChao HeRui Fu.
Outline  Nucleosome distribution  Chromatin modification patterns  Mechanisms of chromatin modifications  Biological roles.
Current Topics in Genomics and Epigenomics – Lecture 2.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Introduction to the Tsinghua University ENCODE Journal Club Monica C. Sleumer ( 苏漠 )
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Lecture 4. Topics in Gene Regulation and Epigenomics (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Marco Magistri , Journal Club. A non-coding RNA (ncRNA) is any RNA molecule that is not translated into a protein “Structural genes encode proteins.
I519 Introduction to Bioinformatics, Fall, 2012
Sackler Medical School
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Overview of ENCODE Elements
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Jason Ernst Broad Institute of MIT and Harvard
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Accessing and visualizing genomics data
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
Transcriptional Enhancers Looking out for the genes and each other Sridhar Hannenhalli Department of Cell Biology and Molecular Genetics Center for Bioinformatics.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
Additional high-throughput sequencing techniques (finding all functional elements of genome) June 15, 2017.
The Transcriptional Landscape of the Mammalian Genome
Regulation of Gene Expression
Functional Elements in the Human Genome
Volume 8, Issue 5, Pages (May 2017)
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Regulation of Gene Expression by Eukaryotes
Volume 8, Issue 5, Pages (May 2017)
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
By Wenfei Jin Presenter: Peter Kyesmu
Integrative analysis of 111 reference human epigenomes
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Practical Guide to the (mod)ENCODE project February

Fundamental Goals Improve comprehensiveness and accuracy of gene annotation Define novel protein coding and noncoding gene products, including variants Define noncoding regulatory elements, including both sequence and epigenetic features Begin to measure the extent of tissue-specific deployment of functional elements

Rationale for the Consortium Synergistic expertise of large groups Coordinated sample and data collection procedures Systematic data analysis Rapid release of the data to the public Common data repository

U. S. National Human Genome Research Institute History and Relationship of ENCODE Projects pilot human ENCODE (1% of genome) modENCODE (100% of genome) C. elegans Drosophila human ENCODE scale-up (100% of genome) Henikoff (histone replacement) Waterston/Celniker (transcribed elements) Piano/Lai (3’ UTR elements) Snyder/White (TF binding sites) Lieb/Karpen (chromatin function) ??

Model organism advantages… Compact, well-annotated “simpler” genome Functional elements can be identified in vivo Experimental advantages for both generating and interpreting genomic data Not human Most studies performed in whole animals …and disadvantages

Publications of the “half-way point” in Science Dec 2010: 237 C. elegans datasets and >700 Drosophila datasets Verified data available at modENCODE

Defining the transcriptome early embryo L1 L2 L3 L4 adult hermaphrodite late embryo L4 male dauer Extract total RNA, mRNA, and small RNAs from samples taken at distinct developmental stages and conditions

C. elegans transcriptome features and alternative splicing M B Gerstein et al. Science 2010;330: stage-specific isoforms fractional differences in isoform composition for 12,875 genes in pair-wise comparison across seven developmental stages stage-specific pseudogene expression increase in splice junction confirmation

Drosophila coding and noncoding genes and structures Roy et al. Science 2010;330: combine RNA- seq data with conserved structures novel miRNA found in protein coding exon male- specific expression

Tagging (worm) vs endogenous (fly) TF-ChIP Create GFP-tagged transcription factor fosmids by recombineering Generate transgenic lines by microparticle bombardment Characterize expression and culture large scale preps Perform ChIP-seq define binding sites and analyze data 10 Generate antibodies to proteins of interest Characterize sensitivity and specificity culture large scale preps

C. elegans Highly Occupied Target (HOT) Regions M B Gerstein et al. Science 2010;330: TFs -> 304 HOT regions with 15+ TFs tend to be at the promoters of broadly expressed genes

Discovery and characterization of chromatin states and their functional enrichments in Drosophila Roy et al. Science 2010;330: discrete -> 9 continuous chromatin states

Statistical models predicting TF-binding and gene expression from chromatin features in C. elegans M B Gerstein et al. Science 2010;330: color represents accuracy of statistical model in which a chromatin feature(s) acts as a predictor for TF binding/HOT regions an example Spearman correlation coefficient of each chromatin feature with expression levels Chromatin based predictions for expression of both coding genes (top) and miRNAs (bottom)

Predictive models of regulator, region, and gene activity in Drosophila Roy et al. Science 2010;330: DREM: Dynamic Regulatory Events Miner predicting target gene expression from regulator expression predicting cell type specific regulators of chromatin activity

Human (and mouse) ENCODE PLoS Biol 9:e , 2011

ENCODE methods and organization PLoS Biol 9:e , 2011

Selected cell lines PLoS Biol 9:e , 2011

Standardized data collection and processing cell growth conditions antibody characterization requirements for controls requirements for replicates assessment of reproducibility data submission formats

Caveats assays on unsynchronized cell populations several of the cell lines are karyotypically unstable some Tier 3 lines could be of heterogenous composition mappability in the human genome is variable and repetitive sequences (~15% of the genome) are not included currently variable confidence regarding assigned function for the different types of elements data types lacking focal enrichment (spread over broad regions) could have variation across the enriched domain

Programs utilized for data analysis PLoS Biol 9:e , 2011

Location of data sources PLoS Biol 9:e , 2011

Exploring the ENCODE analysis

Companion Papers In the same issue of Nature (6 September 2012): Landscape of transcription in human cells Djebali, S., Davis, C.A. et al. The accessible chromatin landscape of the human genome Thurman, R.E., Rynes, E., Humbert, R. et al. An expansive human regulatory lexicon encoded in transcription factor footprints Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P. et al. Architecture of the human regulatory network derived from ENCODE data Gerstein, M.B., Kundaje, A., Hariharan, M., Landt, S.G., Yan, K.K. et al. The long-range interaction landscape of gene promoters Sanyal, A., Lajoie, B.R. et al. In Genome Biology (6 September 2012): Analysis of variation at transcription factor binding sites in Drosophila and humans Spivakov, M. et al. Genome Biol. Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3 Frietze, S. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription related factors Yip, K.Y. et al. Functional analysis of transcription factor binding sites in human promoters Whitfield, T.W. et al. Analysis of variation at transcription factor binding sites in Drosophila and humans Spivakov, M. et al. Modeling gene expression using chromatin features in various cellular contexts Dong, X. et al. The GENCODE pseudogene resource Pei, B. et al.

Companion Papers In Genome Research (6 September 2012): Annotation of functional variation in personal genomes using RegulomeDB. Boyle, A.P. et al. ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia. Landt, S.G. et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs Tilgner, H. et al. Discovery of hundreds of mirtrons in mouse and human small RNA data Ladewig, E. et al. GENCODE: The reference human genome annotation for the ENCODE project Harrow, J. et al. Linking disease associations with regulatory information in the human genome. Schaub, M.A. et al. Long noncoding RNAs are rarely translated in two human cell lines Bánfai, B. et al. Sequence and chromatin determinants of cell-type–specific transcription factor binding. Arvey, A. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors Wang, J. et al Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome Howald, C. et al. Personal and population genomics of human regulatory variation. Vernot, B. et al. Predicting cell-type–specific gene expression from regions of open chromatin. Natarajan, A. et al. RNA editing in the human ENCODE RNA-seq data Park, E. et al.

GENCODE GENCODE is a manual/automated curation of genes annotation is verified by RT-PCR and RACE experiments v7: 20,687 protein-coding genes with, on average, 6.3 alternatively spliced transcripts (3.9 different protein-coding transcripts) per locus Harrow et al., 2012 Frankish et al., Genome Research 2012

TF mapping by ChIP-seq across 72 cell lines data is organized in “Factorbook” Encode Project Consortium, Nature 489: 57-74, 2012

Chromatin accessibility mapping 2.89 million unique, non-overlapping DNase I hypersensitive sites (DHSs) by DNase-seq in 125 cell types 4.8 million sites across 25 cell types that displayed reduced nucleosomal crosslinking by FAIRE, many of which coincide with DHSs DNA methylation by RRBS [average of 1.2 million CpGs in each of 82 cell lines and tissues (8.6% of non-repetitive genomic CpGs), including CpGs in intergenic regions, proximal promoters and intragenic regions (gene bodies)] Encode Project Consortium, Nature 489: 57-74, 2012

Histone modification mapping 12 histone modifications and variants in 46 cell types, including a complete matrix of eight modifications across tier 1 and tier 2.

Modelling transcription levels from histone modification and transcription-factor-binding patterns histone modifications TFs Encode Project Consortium, Nature 489: 57-74, 2012

Patterns and asymmetry of chromatin modification at transcription-factor-binding sites histone modifications show asymmetric patterns across TFBS Encode Project Consortium, Nature 489: 57-74, 2012

Co-association between transcription factors Encode Project Consortium, Nature 489: 57-74, 2012

Integration of ENCODE data by genome-wide segmentation Encode Project Consortium, Nature 489: 57-74, 2012 LabelDescription CTCFCTCF-enriched element EPredicted enhancer PFPredicted promoter flanking region RPredicted repressed or low-activity region TSSPredicted promoter region including TSS TPredicted transcribed region WEPredicted weak enhancer or open chromatin cis-regulatory element

High-resolution segmentation of ENCODE data by self- organizing maps (SOM) Encode Project Consortium, Nature 489: 57-74, 2012

Allele-specific ENCODE elements Encode Project Consortium, Nature 489: 57-74, 2012 single genes Chrom HMM segments

Examining ENCODE elements on a per individual basis in the normal and cancer genome

Comparison of genome-wide-association-study-identified loci with ENCODE data

UCSC broswer

Browser interface PLoS Biol 9:e , > Genome Browser link both hg18 and hg19 genome versions are available and worth viewing – hg18 has the “Integrated Regulation Track” on by default, while hg19 has newer and more datasets

UCSC browser visualization of ENCODE data novel independent transcript in the first intron of TP53 session includes proteogenomics data in conjunction with ENCODE gene, transcriptome and regulatory data sets

Roadmap Epigenomics Project next-generation sequencing technologies to map DNA methylation, histone modifications, chromatin accessibility and small RNA transcripts in stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease rapid release of raw sequence data, profiles of epigenomics features and higher-level integrated maps to the scientific community development, standardization and dissemination of protocols, reagents and analytical tools to enable the research community to utilize, integrate and expand upon this body of data

Epigenomics Data

Epigenomics Data

Databases, data visualization, and access modENCODE: ENCODE: Epigenomics RoadMap: