Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monica C. Sleumer ( 苏漠 ) 2012-09-19. Human Genome 3,101,804,739 base pairs 22 chromosomes plus X and Y 21,224 protein-coding genes 15,952 ncRNA genes.

Similar presentations


Presentation on theme: "Monica C. Sleumer ( 苏漠 ) 2012-09-19. Human Genome 3,101,804,739 base pairs 22 chromosomes plus X and Y 21,224 protein-coding genes 15,952 ncRNA genes."— Presentation transcript:

1 Monica C. Sleumer ( 苏漠 ) 2012-09-19

2 Human Genome 3,101,804,739 base pairs 22 chromosomes plus X and Y 21,224 protein-coding genes 15,952 ncRNA genes 3–8% of bases are under selection – From comparative genomic studies Question: What is the genome doing?

3 Objectives Find all functional elements – Bound by specific proteins – Transcribed – Histone modifications – DNA methylation Use this information to annotate functional regions – Genes (coding and non-coding) – Promoters – Enhancers – Specific transcription factor binding sites – Silencers – Insulators – Chromatin states Cross-reference data from other studies – Comparative genomics – 1000 Genomes Project – Genome-wide association studies (GWAS)

4 ENCODE projects ENCODE pilot project: 1% of the genome 2003-2007 modENCODE: Drosophila and C. elegans ENCODE main project 2007-2012 – 1649 dataset-generating experiments – 147 cell types – 235 antibodies and assay protocols – 450 authors – 32 institutes 31 publications 2012-09-06 – 6 in Nature – 18 in Genome Research – 6 in Genome Biology – 1 in BMC Genetics www.nature.com/encode/category/research-papers

5 Materials 147 types of human cell lines, 3 priority levels Tier 1 cell lines: top priority for all experiments Tier 2 cell lines to be done after Tier 1 (next slide) Tier 3: any other cell lines NameDescriptionLineageTissueKaryotype GM12878 B-lymphocyte, lymphoblastoid, Epstein-Barr Virus, 1000 Genomes Project mesodermbloodnormal H1-hESCembryonic stem cells inner cell mass embryonic stem cellnormal K562 leukemia, 53-year-old female with chronic myelogenous leukemia mesodermbloodcancer

6 Tier 2 Cell Lines http://encodeproject.org/ENCODE/cellTypes.html

7 Methods RNA-SeqDifferent fractions of RNA -> sequencing CAGE5’ Capped RNA sequencing RNA-PETSequencing 5’ Cap plus poly-A tail ChIP-seqChromatin immunoprecipitation of a DNA binding protein -> sequencing DNase-seqCut exposed DNA with DNase I -> sequencing FAIRE-seqNucleosome-depleted DNA -> sequencing RRBSBisulphite treatment: unmethylated C->U -> sequencing 3C,5C, ChIA-PETChromatin interactions -> sequencing

8 Results: RNA Sequencing 62% of the genome is transcribed into sequences >200 bp long – 5.5% of this is exon – 31% is intergenic – no annotated gene – Remaining: intronic CAGE-seq: 62,403 TSS – 44% within 100bp of the 5’ end of a GENCODE gene – Others: exons and 3’ UTRs, significance unknown Lots of short ncRNAs: tRNA, miRNA, snRNA etc. Further description: Wu Dingming, 9:30

9 Results: Transcribed and protein-coding regions GENCODE reference gene set – 20,687 Protein-coding 6.3 alternatively spliced transcripts on average 3.9 protein isoforms on average Protein-coding exons: 1.22% of the genome Still more to come: unidentified peptides in mass-spec – 18,441 ncRNA genes 8801 short ncRNA 9640 long nc RNA – 11,224 pseudogenes 863 transcribed

10 ChIP-Seq AcronymDescriptionFactors analysed ChromRemATP-dependent chromatin complexes 5 DNARepDNA repair3 HISaseHistone acetylation, deacetylation or methylation complexes 8 OtherCyclin kinase associated with transcription 1 Pol2Pol II subunit1 (2 forms) Pol3Pol III-associated6 TFNSGeneral Pol II-associated factor, not site-specific 8 TFSSPol II transcription factor with sequence-specific DNA binding 87 Total119 www.illumina.com/technology/chip_seq_assay.ilmn

11 ChIP-Seq: Histone modifications Histone modification or variant Signal characteristics Association H2A.ZPeakdynamic chromatin H3K4me1Peak/regionenhancers and other distal elements, also downstream of transcription starts H3K4me2Peakpromoters and enhancers H3K4me3Peakpromoters/transcription starts H3K9acPeakpromoters H3K9me1Region5′ end of genes H3K9me3Peak/regionGene repression, constitutive heterochromatin and repetitive elements H3K27acPeakGene expression, active enhancers and promoters H3K27me3Regionpolycomb complex, repressive domains and silent developmental genes H3K36me3RegionElongation, transcribed portions of genes, 3′ regions after intron 1 H3K79me2RegionTranscription, 5′ end of genes H4K20me1Region5′ end of genes

12 Results: ChIP-Seq 636,336 binding regions 8.1% of the genome Sequence-specific TF ChIP-seq: – 86% of the DNA segments occupied by sequence- specific transcription factors contained a strong DNA-binding motif – 55% cases contained the expected motif Further description: Qin Zhiyi & Ma Xiaopeng, 13:30

13 DNase I hypersensitivity 2,890,000 unique hypersensitive sites (DHSs) 4,800,000 sites across 25 cell types Tier 1 and tier 2 cell types: 205,109 DHSs per cell type 98.5% of ChIP-seq TFBS within DHSs Further description: Guo Weilong 12:30, He Chao 14:30 https://www.nationaldiagnostics.com/electrophoresis/article/dnase-i-footprinting

14 FAIRE-seq Like the opposite of ChIP-seq Cross-link the nucleosomes to the DNA – But not the sequence-specific TFs Shear the DNA into small pieces Remove the protein-bound DNA Sequence the non-bound DNA GaultonGaulton KJ et al, Nature Genetics 42, 255–259 (2010) doi:10.1038/ng.530

15 DNA methylation CpG methylation: regulates gene expression – In promoters: gene repression – In genes: gene transcription 1,200,000 methylated CpGs in 82 cell lines and tissues – 96% differentially methylated, especially those in genes Unmethylated genic CpG islands associated with P300 binding, an enhancer-related histone acetyltransferase Allele-specific methylation: genomic imprinting Aberrant methylation in cancer cell lines Reproducible methylation outside CpG dinucleotides http://www.diagenode.com/en/applications/bisulfite-conversion.php

16 Chromosome conformation capture Montavon Montavon and Duboule, Trends in Cell Biology (2012) 22:7, 347–354Duboule22:7

17 Results: Chromosome interactions Chromosome conformation capture (3C) : – 5C: 3C-carbon copy – ChIA-PET Identified 127,417 promoter-centred chromatin interactions using ChIA-PET – 98% intra-chromosomal 2,324 promoters involved in ‘single-gene’ enhancer– promoter interactions 19,813 promoters were involved in ‘multi-gene’ interaction complexes spanning up to several megabases 50–60% of long-range interactions occurred in only one of the four cell lines Further discussion: Li Yanjian, 10:40

18 Primary Findings 80.4% of the human genome is doing at least one of the following: – Bound by a transcription factor – Transcribed – Modified histone 99% is within 1.7 kb of at least one of the biochemical events 95% within 8 kb of a DNA–protein interaction or DNase I footprint 7 chromatin states: – 399,124 enhancer-like regions – 70,292 promoter-like regions Correlation between transcription, chromatin marks, and TF binding Functional regions contain lots of SNPs – Disease-associated SNPs in non-coding regions tend to be in functional elements

19 End of Introduction

20 Summary of ENCODE elements 80.4% of the human genome is covered by at least one ENCODE-identified element 62% of the genome is transcribed 56% of the genome associated with histone modifications Excluding RNA elements and broad histone elements, 44.2% of the genome is covered – open chromatin (15.2%) – transcription factor binding (8.1%) – 19.4% DHS or transcription factor ChIP-seq peaks across all cell lines 8.5% of bases are covered by either a transcription-factor- binding-site motif (4.6%) or a DHS footprint (5.7%) – 4.5x the amount of protein-coding exons (1.2%) – 2x the amount of conserved sequence between mammals Estimate: 50% of DHS remain to be found – Based on saturation curves

21 Diversity vs Conservation: Interactive Figure Conservation Diversity A high-resolution map of human evolutionary constraint using 29 mammals Nature 478, 476–482 (2011)

22 Conservation in Bound Motifs vs Unbound Motifs Conservation Diversity http://www.nature.com/encode/interactive-figures/nature11247_F1

23 Model of gene expression – histone marks

24 Model of gene expression – TF binding

25 CTCF peaks vs H3K27me3: six patterns

26 Asymmetry profiles of TFs

27 Transcription factor co-associations

28 Seven major classes of genome states CTCFCTCF-enriched element CTCF signal, no histone modifications, open chromatin, may have insulator function, enriched for cohesin components RAD21 and SMC3 EPredicted enhancer Open chromatin, H3K4me1, other enhancer-associated marks, enriched for EP300, FOS, FOSL1, GATA2, HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6 and TAL1 sites, nuclear and whole-cell RNA poly(A) signal PFPredicted promoter flanking Regions that generally surround TSS segments RPredicted repressedH3K27me3 polycomb-enriched regions, REST, BRF2, CEBPB, MAFK, TRIM28, ZNF274 and SETDB1 sites or no signal at all TSSPredicted promoter including TSS H3K4me3, open chromatin, Pol II, Pol III, short RNAs, close to TSS sites TPredicted transcribed H3K36me3 transcriptional elongation signal., overlap with gene bodies, phosphorylated Pol II, cytoplasmic poly(A) + RNA WEWeak enhancerSimilar to the E state, but weaker signals and weaker enrichments.

29 Data integration and genome segmentation Transcribed RepressedTSS Enhancer

30 Association between genome states and annotations Transcription factors RNA expression Genome segment

31 Enhancer validation in mouse and fish Enhancer from K562 cell (leukemia) drives basal promoter with reporter gene in embryonic mouse blood cellsand medaka fish

32 Genome segment clustering 6 cell types

33 Genome cluster function Genome state is related to gene function

34 Allele-specific expression Pol II Txn Rpn

35 Correlation of allele-specific signal by gene by genomic segment

36 Analysis of SNPs from a single person paternal-haplotype-specific CTCF peak DNase I hypersensitivity variation by cell type

37 Genome-wide association studies Annotated disease- causing SNPs Control SNPs Selected TFBS tracks Diseases Significant overlap No genes, but several TFBS near the disease-causing SNPs

38 Conclusions 80% of human genome annotated with at least one association – Protein-binding – Histone modification – Transcription ENCODE data combination – Model gene expression – Genome segmented into 7 types Different in each cell line ENCODE data combined with other data – 1000 genomes: see influence of parental DNA – Genome-wide association studies

39 Discussion 147 types of cells, and the human body has a few thousand 80% functional : controversial – 80% of the genome is being transcribed and/or has a protein bound to it some of the time – Heterochromatin: tightly packed repeat sequences – most of that activity isn’t particularly specific or interesting and may not have impact – Important not to overstate the findings – Ewan Birney: “cumulative occupation of 8% of the genome by TFs” Reproducibility – In exactly the same cell lines, same conditions, different time or place – Same cell lines, different conditions – Same cell type, different people Cell lines vs tissue Cancer vs normal http://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html http://blogs.discovermagazine.com/notrocketscience/2012/09/05/encode-the-rough-guide-to-the-human-genome/

40 Applications Visible as genome tracks in UCSC Mutation from – Cancer sequencing – GWAS – Find out what that part of the genome is doing Compare with your cancer data (RNA-seq) Comparative genome analysis Gene or pathway of interest

41 Online Resources Interactive graphics in online version of paper Interactive app on Nature ENCODE main page www.nature.com/encode/


Download ppt "Monica C. Sleumer ( 苏漠 ) 2012-09-19. Human Genome 3,101,804,739 base pairs 22 chromosomes plus X and Y 21,224 protein-coding genes 15,952 ncRNA genes."

Similar presentations


Ads by Google