Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE 1 Rory Johnson Bioinformatics and Genomics Centre for Genomic Regulation AEEH 21 / 2 / 14
This talk: Our view of the human genome today thanks to ENCODE What it means for translational research 2
3 Epigenetics: the intermediate between genome and phenotype
4 (Hong Kong) Our changing view of the genome
5 Our changing view of the genome ChromatinHistones, + modifications Transcription factors CAGGCATTAACCTTAGTCCTAATGGTTAGAGTCGTCCCTGATAATCTTAGTGAGGAAGGGACATTTCCAGAGTCGCCCAG CAGCAAATTCCAGATGTCTAAGGTCCCCAAACAGAACAAAATTGCATAAT This organisation is encoded in non-protein coding genome sequence Enhancers
6 Genome sequence: Simple Static Epigenome sequence: Multi-layered Dynamic Cell-specific => Hence ENCODE The Genome and Epigenome
7 The human genome in numbers 3 x10^9 base pairs 20,345 protein coding genes 13,870 Long noncoding RNA genes 9013 Small noncoding RNA genes 3x10^6 regulatory regions (enhancers) 12,460 known trait-associated SNPs (short nucleotide variants) 88% of trait-associated SNPs lie outside protein coding sequence
8 Next Generation Sequencing The high throughput reading of DNA or RNA. The main system now is Illumina Hiseq Statistics: Read length: ~150nt Reads per lane: ~150 million Lanes per run: 16 Total nt per run: ~400 billion Cost per run: ~16,000 euro (Human genome project took 13 years and $3billion to sequence 3 billion nt, ending 2003)
9 NGS based methods for genome analysis: towards the clinic ChIP-seq (chromatin immunoprecipation) Transcription factor binding / chromatin state Dnase-seqTranscription factor binding / chromatin state RNAseqmRNA transcription / splicing Ribosome footprintingTranslation rate HiseqGenome 3D structure These methods have been demonstrated to be practical for continuous patient monitoring or diagnostics: Rui et al Cell, Volume 148, Issue 6, , “iPOP”Volume 148, Issue 6 Buenrostro et al Nat Methods Nature Methods 10, 1213–1218 (2013) “Using ATAC-seq maps of human CD4 + T cells from a proband obtained on consecutive days, we demonstrated the feasibility of analyzing an individual's epigenome on a timescale compatible with clinical decision-making.”
10 The ENCODE Project ENCODE: Encyclopedia of DNA Elements ( International consortium dedicated to comprehensively mapping the human epigenome. Created high quality ongoing gene annotations: GENCODE 32 laboratories, $400million In Spain: Roderic Guigo (CRG) was one of the leaders (with Tom Gingeras, CSHL) of the transcriptomics section. 147 cell types (mainly transformed cell lines) 1640 genome-wide datasets
11 RNAseqGene expression ChIPChromatin ChIPTranscription Factors ChIA-PETGenome structure / folding GENCODEGene annotation catalogue ENCODE integrates multiple data types across cell types
12 Visualizing ENCODE data at the UCSC Genome Browser
13 ENCODE data of relevance to hepatology ENCODE Tier 2: HepG2 cell line hepatocellular carcinoma (see for other cell types) Including: 8 RNAseq experiments 114 Transcription Factor ChIP experiments (inc CEBPB, HNF4A, HNF4G) Genes Chromatin Transcription Factors RNA
14 Chromatin state is extremely cell type specific
15 Other projects of relevance: Epigenomics Roadmap Project
16 Other projects of relevance: eQTL Gtex – Genotype Tissue Expression project Hunting for genetic variants that influence gene expression Linking genetic variants to changes in gene expression – regulatory variants or “expression quantitative trait loci” (eQTL) These will be different between tissues
What does this mean for translational research? Protein-focussed studies will miss the majority of functional disease causing variants / mutations Non-coding variants will usually be regulatory Non-coding variants will usually be cell type specific Large projects like ENCODE are producing rich data that can be used to interpret clinical results ` 17
18 How can genetic variants (SNPs) in noncoding regions cause phenotype? By altering the nucleotide sequence recognized by regulatory protein Hawkins et al Nature Reviews Genetics 11,
19 Gene Expression DiseaseGenetic Variant (SNP) How can genetic variants (SNPs) in noncoding regions cause phenotype?
20 How does ENCODE affect translational research projects? Genome wide association study (GWAS) Exome sequencing Gene expression profiling
21 Translational research approaches 1: Genetic approaches Genomic approaches to identify genetic variants underlying disease: GWAS – genome wide association study Exome sequencing – target genome sequencing AdvantagesDisadvantages Genome wideDepends on limited # of marker SNPs Not biased towards coding regionsLow resolution Good at identifying common variantsDoes not yield insights into mechanism AdvantagesDisadvantages Proteome wideNo information about noncoding variants Can identify rare causative variantsLikely missing most causative variants Usually yields mechanistic hypothesis High resolution
22 Interpretation of GWAS results GWAS gives an unbiased genome wide set of candidate SNPs The majority of these lie outside protein coding regions Two main challenges: 1.Identifying the causative SNP 2.Understanding the mechanism of action of that SNP Li et al PLoS Genet 8(7): e Hepatocellular carcinoma
23 Identifying the causative SNP using ENCODE data Schaub et al Genome Res Sep;22(9): doi: /gr e Hunt for the likely functional SNP in LD with marker
24 Schaub et al Genome Res Sep;22(9): doi: /gr e Understanding the mechanism of a noncoding SNP using ENCODE data
25 RegulomeDB: A web server for functional prediction of SNPs using ENCODE data
26 Exome sequencing Exome sequencing: targeted genome sequencing of protein coding exons Relies on capturing a selected subset of genome Advantages: lower cost and higher statistical power can detect rare private mutations Disadvantages: Presently ignoring the noncoding genome (~99%)
27 Exome sequencing: whats next? Whole genome sequence not likely to be practical: no statistical power Exome technology is highly customisable could be adapted to noncoding regions The main question: what are the target regions? How to define the target space? regulatory regions? Noncoding RNAs? Protein binding sites? Likely to be organ / disease specific Will require bioinformatic analysis to design reagents before experimental project begins.
28 Translational research approaches 2: Transcriptomic approaches ENCODE has made a major contribution to gene expression studies, by providing high quality annotations of novel noncoding genes through GENCODE. Microarray studies Microarrays are restricted by the catalogue of probes chosen Commercial arrays: usually protein coding genes MicroRNA arrays available Long noncoding RNA arrays available (CRG provide free designs) – based on ENCODE annotations
29 Translational research approaches 2: Transcriptomic approaches RNAseq Unbiased > can discover novel RNAs Can quantify expression of known and novel genes, and discover RNA from non “genic” loci Analysis requires more bioinformatic analysis Still more expensive than arrays
30 Translational research approaches 2: Transcriptomic approaches Problems: It is easy to discover and quantify the expression of novel genes It is difficult to understand the function of such genes We have no bioinformatic tools to predict the function of most novel ncRNAs We have limited experimental tools to investigate them
31 What does ENCODE mean for these studies? GWAS GWAS study design will not likely be affected ENCODE will allow better interpretation of discovered SNPs Exome Whole genome cohort studies may never be feasible Capture sequence approach can be redesigned to study noncoding variants in disease of choice ENCODE and other public data will aid in the design of these projects Gene expression New gene annotations can help in both microarray and RNAseq projects to discover novel noncoding gene targets. RNAseq will eventually replace arrays as costs drop, but right now new array designs are competitive in large experiments and given bioinformatic requirements
Nothing would have been possible without… CRG Bioinformatics & Genomics Roderic Guigó Bioinformatics and Genomics group ENCODE / GENCODE Jennifer Harrow Tim Hubbard (GENCODE, Sanger) FUNDING Ramón y Cajal RYC Plan Nacional BIO
33 The main message of ENCODE To understand genotypes and phenotypes, we must look beyond the protein coding gene. Further reading: Interpreting noncoding genetic variation in complex traits and human disease Lucas D Ward & Manolis KellisLucas D WardManolis Kellis Affiliations Nature Biotechnology 30, 1095–1106 (2012)
34 How could variants in noncoding regions cause phenotype? By altering the nucleotide sequence recognized by regulatory protein By altering a noncoding RNA gene, either in expression levels or mature sequence Hawkins et al Nature Reviews Genetics 11, Haas et al RNA Biol Jun;9(6):924-37RNA Biol.
Levels of genome regulation We now appreciate the genome is regulated at multiple levels: “Epigenetically” – chromatin structure Transcriptionally – RNA production Post-transcriptionally – RNA processing (splicing, transport, stability) Translationally – protein production at ribosome Structurally – the folding structure of the genome => These sequences all have effects on phenotype and thus may contribute to disease = > All of these are encoded in noncoding DNA sequence 35
36 Karczewski KJKarczewski KJ et al Proc Natl Acad Sci U S A.Proc Natl Acad Sci U S A Jun 4;110(23): A SNP for breast cancer creates a NFκB binding site Case study: Studying disease-associated regulatory SNPs incorporating cohort epigenome data