Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE 1 Rory Johnson Bioinformatics and Genomics Centre for Genomic Regulation AEEH.

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Interpreting Variation in Human Non-Coding Genomic Regions Using Computational Approaches with Experimental Support Lisa Brooks, Ph.D., Mike Pazin, Ph.D.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Gene regulation in cancer 11/14/07. Overview The hallmark of cancer is uncontrolled cell proliferation. Oncogenes code for proteins that help to regulate.
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Transcriptomics Jim Noonan GENE 760.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
Investigating the Importance of non-coding transcripts.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
High Throughput Sequencing
mRNA-Seq: methods and applications
Identification of obesity-associated intergenic long noncoding RNAs
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Manolis Kellis Broad Institute of MIT and Harvard
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Introduction to RNA Bioinformatics Craig L. Zirbel October 5, 2010 Based on a talk originally given by Anton Petrov.
Control of Gene Expression Eukaryotes. Eukaryotic Gene Expression Some genes are expressed in all cells all the time. These so-called housekeeping genes.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
Current Topics in Genomics and Epigenomics – Lecture 2.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
The Center for Medical Genomics facilitates cutting-edge research with state-of-the-art genomic technologies for studying gene expression and genetics,
Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
GENOM REGULATION BY LONG NONCODING RNA SUPERVISOR: DR.FARAJOLLAHI PRESENTED BY: BAHAREH SADAT RASOULI.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Genomics and Arabidopsis. What is ‘genomics’? Study of an organism’s entire genome –All the DNA encoded in the organism –Nucleus, mitochondria, chloroplasts.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
Bioinformatics and Biostatistics in Limagrain / Biogemma
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Eukaryotic Genomes  The Organization and Control of Eukaryotic Genomes.
Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.
Analysis of protein-DNA interactions with tiling microarrays
Introduction to RNAseq
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Overview of ENCODE Elements
Motif Search and RNA Structure Prediction Lesson 9.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Genetics of Gene Expression BIOS Statistics for Systems Biology Spring 2008.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Transcriptional Enhancers Looking out for the genes and each other Sridhar Hannenhalli Department of Cell Biology and Molecular Genetics Center for Bioinformatics.
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
Functional Mapping and Annotation of GWAS: FUMA
Gene Hunting: Design and statistics
Genome Biology & Applied Bioinformatics Mehmet Tevfik DORAK, MD PhD
In these studies, expression levels are viewed as quantitative traits, and gene expression phenotypes are mapped to particular genomic loci by combining.
One SNP at a Time: Moving beyond GWAS in Psoriasis
ChIP-seq Robert J. Trumbly
A systems view of genetics in chronic kidney disease
Deep Learning in Bioinformatics
Integrative analysis of 111 reference human epigenomes
Presentation transcript:

Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE 1 Rory Johnson Bioinformatics and Genomics Centre for Genomic Regulation AEEH 21 / 2 / 14

This talk: Our view of the human genome today thanks to ENCODE What it means for translational research 2

3 Epigenetics: the intermediate between genome and phenotype

4 (Hong Kong) Our changing view of the genome

5 Our changing view of the genome ChromatinHistones, + modifications Transcription factors CAGGCATTAACCTTAGTCCTAATGGTTAGAGTCGTCCCTGATAATCTTAGTGAGGAAGGGACATTTCCAGAGTCGCCCAG CAGCAAATTCCAGATGTCTAAGGTCCCCAAACAGAACAAAATTGCATAAT This organisation is encoded in non-protein coding genome sequence Enhancers

6 Genome sequence: Simple Static Epigenome sequence: Multi-layered Dynamic Cell-specific => Hence ENCODE The Genome and Epigenome

7 The human genome in numbers 3 x10^9 base pairs 20,345 protein coding genes 13,870 Long noncoding RNA genes 9013 Small noncoding RNA genes 3x10^6 regulatory regions (enhancers) 12,460 known trait-associated SNPs (short nucleotide variants) 88% of trait-associated SNPs lie outside protein coding sequence

8 Next Generation Sequencing The high throughput reading of DNA or RNA. The main system now is Illumina Hiseq Statistics: Read length: ~150nt Reads per lane: ~150 million Lanes per run: 16 Total nt per run: ~400 billion Cost per run: ~16,000 euro (Human genome project took 13 years and $3billion to sequence 3 billion nt, ending 2003)

9 NGS based methods for genome analysis: towards the clinic ChIP-seq (chromatin immunoprecipation) Transcription factor binding / chromatin state Dnase-seqTranscription factor binding / chromatin state RNAseqmRNA transcription / splicing Ribosome footprintingTranslation rate HiseqGenome 3D structure These methods have been demonstrated to be practical for continuous patient monitoring or diagnostics: Rui et al Cell, Volume 148, Issue 6, , “iPOP”Volume 148, Issue 6 Buenrostro et al Nat Methods Nature Methods 10, 1213–1218 (2013) “Using ATAC-seq maps of human CD4 + T cells from a proband obtained on consecutive days, we demonstrated the feasibility of analyzing an individual's epigenome on a timescale compatible with clinical decision-making.”

10 The ENCODE Project ENCODE: Encyclopedia of DNA Elements ( International consortium dedicated to comprehensively mapping the human epigenome. Created high quality ongoing gene annotations: GENCODE 32 laboratories, $400million In Spain: Roderic Guigo (CRG) was one of the leaders (with Tom Gingeras, CSHL) of the transcriptomics section. 147 cell types (mainly transformed cell lines) 1640 genome-wide datasets

11 RNAseqGene expression ChIPChromatin ChIPTranscription Factors ChIA-PETGenome structure / folding GENCODEGene annotation catalogue ENCODE integrates multiple data types across cell types

12 Visualizing ENCODE data at the UCSC Genome Browser

13 ENCODE data of relevance to hepatology ENCODE Tier 2: HepG2 cell line hepatocellular carcinoma (see for other cell types) Including: 8 RNAseq experiments 114 Transcription Factor ChIP experiments (inc CEBPB, HNF4A, HNF4G) Genes Chromatin Transcription Factors RNA

14 Chromatin state is extremely cell type specific

15 Other projects of relevance: Epigenomics Roadmap Project

16 Other projects of relevance: eQTL Gtex – Genotype Tissue Expression project Hunting for genetic variants that influence gene expression  Linking genetic variants to changes in gene expression – regulatory variants or “expression quantitative trait loci” (eQTL)  These will be different between tissues

What does this mean for translational research? Protein-focussed studies will miss the majority of functional disease causing variants / mutations Non-coding variants will usually be regulatory Non-coding variants will usually be cell type specific Large projects like ENCODE are producing rich data that can be used to interpret clinical results ` 17

18 How can genetic variants (SNPs) in noncoding regions cause phenotype? By altering the nucleotide sequence recognized by regulatory protein Hawkins et al Nature Reviews Genetics 11,

19 Gene Expression DiseaseGenetic Variant (SNP) How can genetic variants (SNPs) in noncoding regions cause phenotype?

20 How does ENCODE affect translational research projects? Genome wide association study (GWAS) Exome sequencing Gene expression profiling

21 Translational research approaches 1: Genetic approaches Genomic approaches to identify genetic variants underlying disease: GWAS – genome wide association study Exome sequencing – target genome sequencing AdvantagesDisadvantages Genome wideDepends on limited # of marker SNPs Not biased towards coding regionsLow resolution Good at identifying common variantsDoes not yield insights into mechanism AdvantagesDisadvantages Proteome wideNo information about noncoding variants Can identify rare causative variantsLikely missing most causative variants Usually yields mechanistic hypothesis High resolution

22 Interpretation of GWAS results GWAS gives an unbiased genome wide set of candidate SNPs The majority of these lie outside protein coding regions Two main challenges: 1.Identifying the causative SNP 2.Understanding the mechanism of action of that SNP Li et al PLoS Genet 8(7): e Hepatocellular carcinoma

23 Identifying the causative SNP using ENCODE data Schaub et al Genome Res Sep;22(9): doi: /gr e Hunt for the likely functional SNP in LD with marker

24 Schaub et al Genome Res Sep;22(9): doi: /gr e Understanding the mechanism of a noncoding SNP using ENCODE data

25 RegulomeDB: A web server for functional prediction of SNPs using ENCODE data

26 Exome sequencing Exome sequencing: targeted genome sequencing of protein coding exons Relies on capturing a selected subset of genome Advantages: lower cost and higher statistical power can detect rare private mutations Disadvantages: Presently ignoring the noncoding genome (~99%)

27 Exome sequencing: whats next? Whole genome sequence not likely to be practical: no statistical power Exome technology is highly customisable could be adapted to noncoding regions The main question: what are the target regions? How to define the target space? regulatory regions? Noncoding RNAs? Protein binding sites? Likely to be organ / disease specific Will require bioinformatic analysis to design reagents before experimental project begins.

28 Translational research approaches 2: Transcriptomic approaches ENCODE has made a major contribution to gene expression studies, by providing high quality annotations of novel noncoding genes through GENCODE. Microarray studies Microarrays are restricted by the catalogue of probes chosen Commercial arrays: usually protein coding genes MicroRNA arrays available Long noncoding RNA arrays available (CRG provide free designs) – based on ENCODE annotations

29 Translational research approaches 2: Transcriptomic approaches RNAseq Unbiased > can discover novel RNAs Can quantify expression of known and novel genes, and discover RNA from non “genic” loci Analysis requires more bioinformatic analysis Still more expensive than arrays

30 Translational research approaches 2: Transcriptomic approaches Problems: It is easy to discover and quantify the expression of novel genes It is difficult to understand the function of such genes We have no bioinformatic tools to predict the function of most novel ncRNAs We have limited experimental tools to investigate them

31 What does ENCODE mean for these studies? GWAS GWAS study design will not likely be affected ENCODE will allow better interpretation of discovered SNPs Exome Whole genome cohort studies may never be feasible Capture sequence approach can be redesigned to study noncoding variants in disease of choice ENCODE and other public data will aid in the design of these projects Gene expression New gene annotations can help in both microarray and RNAseq projects to discover novel noncoding gene targets. RNAseq will eventually replace arrays as costs drop, but right now new array designs are competitive in large experiments and given bioinformatic requirements

Nothing would have been possible without… CRG Bioinformatics & Genomics Roderic Guigó Bioinformatics and Genomics group ENCODE / GENCODE Jennifer Harrow Tim Hubbard (GENCODE, Sanger) FUNDING Ramón y Cajal RYC Plan Nacional BIO

33 The main message of ENCODE To understand genotypes and phenotypes, we must look beyond the protein coding gene. Further reading: Interpreting noncoding genetic variation in complex traits and human disease Lucas D Ward & Manolis KellisLucas D WardManolis Kellis Affiliations Nature Biotechnology 30, 1095–1106 (2012)

34 How could variants in noncoding regions cause phenotype? By altering the nucleotide sequence recognized by regulatory protein By altering a noncoding RNA gene, either in expression levels or mature sequence Hawkins et al Nature Reviews Genetics 11, Haas et al RNA Biol Jun;9(6):924-37RNA Biol.

Levels of genome regulation We now appreciate the genome is regulated at multiple levels: “Epigenetically” – chromatin structure Transcriptionally – RNA production Post-transcriptionally – RNA processing (splicing, transport, stability) Translationally – protein production at ribosome Structurally – the folding structure of the genome => These sequences all have effects on phenotype and thus may contribute to disease = > All of these are encoded in noncoding DNA sequence 35

36 Karczewski KJKarczewski KJ et al Proc Natl Acad Sci U S A.Proc Natl Acad Sci U S A Jun 4;110(23): A SNP for breast cancer creates a NFκB binding site Case study: Studying disease-associated regulatory SNPs incorporating cohort epigenome data