March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
Topic 7.3 Transcription.
Transcriptome Sequencing with Reference
Understanding the Human Genome: Lessons from the ENCODE project
Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
1 ENCODE Pseudogene Summary for GT call Mark Gerstein 2005, :00 EDT summary of 6 Calls: Sept. 15, 22; Oct. 6, 13, 20, 27.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005, :00 EST.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Mapping Sites of Transcription Across the Drosophila Genome Using High Resolution Tiling Microarrays LBNL, Berkeley CA August 20, 2007 A. WillinghamAffymetrix,
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Transcription … from DNA to RNA.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
1 ENCODE Pseudogene Call Summary Mark Gerstein 2005, :00 EDT (Draft for G&T call on 2005, :00 EDT)
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
August 20, 2007 BDGP modENCODE Data Production. BDGP Data Production Project Goals 21,000 RACE experiments 6,000 cDNA’s from directed screening and full.
Overview of ENCODE Elements
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
The Central Dogma of Molecular Biology replication transcription translation.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.
CFE Higher Biology DNA and the Genome Transcription.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Considerations for multi-omics data integration Michael Tress CNIO,
GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.
Presented by: Matthew Tippin, Bianca Sanchez Mora
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
EGASP 2005 Evaluation Protocol
Experimental Verification Department of Genetic Medicine
ENCODE Pseudogenes and Transcription
International Conference on Bioinformatics HKUST, Hong Kong 2007
Predicting Active Site Residue Annotations in the Pfam Database
From: TopHat: discovering splice junctions with RNA-Seq
Introduction to Bioinformatics II
DNA and the Genome Key Area 3b Transcription.
Daily Warm-Up Dec. 11th -What are the three enzymes involved with replication? What is the function of each? Homework: -Read 13.1 Turn in: -Nothing.
Volume 116, Issue 4, Pages (February 2004)
closing in on the set of human genes. The ENCODE project.
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
The Structure of the Genome
Human Promoters Are Intrinsically Directional
Universal Alternative Splicing of Noncoding Exons
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

March 9, 2007 Bologna, February the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February genes and proteins One gene, one enzyme Beadle and Tatum The Central Dogma Francis Crick

March 9, 2007 Bologna, February from DNA to proteins most of the transcriptional output of the human genome is localized in well defined genomic loci, which encode mRNAs that, when exported into the cytosol, are translated into proteins

March 9, 2007 Bologna, February

March 9, 2007 Bologna, February % of the genome. 44 regions target selection. commitee to select sequence targets –manual targets – a lot of information –radom targets – stratified by non exonic conservation with mouse gene density

March 9, 2007 Bologna, February

March 9, 2007 Bologna, February Long-range regulatory elements (enhancers, repressors/silencers, insulators) Cis-regulatory elements (promoters, transcription factor binding sites) DNA Replication DNase Hypersensitive Sites Genes and Transcripts Epigenetic 

March 9, 2007 Bologna, February gencode: encyclopedia of genes and gene variants Roderic Guigó, IMIM-UPF-CRG Stylianos Antonarakis, Geneve Alexandre Reymond Ewan Birney, EBI Michael Brent, WashU Lior Pachter, Berkeley Manolis Dermitzkakis, Sanger Jennifer Ashurst, Tim Hubbard identify all protein coding genes in the ENCODE regions: identify one complete mRNA sequence for at least one splice isoform of each protein coding gene. eventually, identify a number of additional alternative splice forms.

March 9, 2007 Bologna, February the gencode pipeline 1.mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the human genome 2.manual curation to resolve conflicting evidence 3.additional computational predictions 4.experimental verification 5.FINAL ANNOTATION

THE GENCODE PIPELINE manual curation: havana (sanger) experimental verification: geneva bioinformatics: imim 2608 transcripts in 487 loci 137 transcripts in 53 non-coding loci 1097 coding transcripts and 1374 non-coding transcripts in 434 protein coding loci most of protein coding loci encode a mixture of protein coding and non-coding transcripts

March 9, 2007 Bologna, February one gene - many proteins very complex transcription units

March 9, 2007 Bologna, February chimering tandem transcription / intergenic splicing

March 9, 2007 Bologna, February KUA and UEV, Thomson et al., Genome Research 2000

March 9, 2007 Bologna, February systematic search for functional chimeras in ENCODE : 165 tandem pairs in the same orientation 126 chimeric predictions obtained 96 tested, at least 4 positve Parra et al., Genome Research 2006 Akiva et al., Genome Research 2006

Locus RP11-298J23.1 codes for pepsinogen C. The structure of pepsinogen C is 1htrA. Isoform -003 is missing 80 residues with respect to pepsinogen C. Here the missing section of -003 is in light green. The missing section in this isoform would remove the core from both subdomains of the structure. Both the N-terminal sub-domain (on the left) and the C-terminal sub-domain would have to refold. This is the view from above looking down into the active cleft of the proteinase. Active site aspartates are shown in ball and chain. One of the two active site residues is in the missing section. The symmetry apparent in this isoform suggests that although it will have to refold it may very well be able to reform into a single subdomain. Structural Effects of Pepsinogen C Alternative Splice Variant Michael Tress & Alfonso Valencia CNB, Madrid

March 9, 2007 Bologna, February ITGB4B 11 supporting ESTs Adam Frankish Sanger

03/09/07 Bologna, February ALL EXONS CODING EXONS GENCODE vs OTHER GENE SETS

March 9, 2007 Bologna, February from the ENCODE Chromatin and Replication Group, John Stamatoyannopoulos

March 9, 2007 Bologna, February EGASP’05 the complete annotation of 13 regions was released in january 30. –The annotation of the remaining 31 regions was being obtained, and it was withheld. gene prediction groups were asked to submit predictions by april 15 in the remaining 31 regions. –18 groups participated, submiting 30 prediction sets predictions were compared to the annoations in an NHGRI sponsored workshop at the Wellcome Trust Sanger Institute, on may 6 and 7.

March 9, 2007 Bologna, February EGASP’05 two main goals: 1.to assess how automatic methods are able to reproduce the (costly) manual/computational/experimental gencode annotation 2.how complete is the gencode annotation. are there still genes consistenly predicted by computational methods

March 9, 2007 Bologna, February accuracy measures

March 9, 2007 Bologna, February accuracy at the coding exon level evidence-based dual genome “ab intio”

March 9, 2007 Bologna, February accuracy at the exon level evidence-based dual genome “ab intio”

March 9, 2007 Bologna, February programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete transcripts (considering only the coding fraction)

March 9, 2007 Bologna, February many novel exons predicted: - 8,634 unique exons predicted in intergenic regions - we ranked the exons according to the accuracy of te predicted programs - tested 238 exon pairs by RT-PCR in 24 tissues - only 7 (less than 3%) were confirmed positive

March 9, 2007 Bologna, February

March 9, 2007 Bologna, February Long-range regulatory elements (enhancers, repressors/silencers, insulators) Cis-regulatory elements (promoters, transcription factor binding sites) DNA Replication DNase Hypersensitive Sites Genes and Transcripts Epigenetic 

March 9, 2007 Bologna, February

March 9, 2007 Bologna, February TRANSCRIPTION OF PROCESSED POLY A+ RNA based on a number of high throughput tecnologies 0.1%24,939Ditags* 14.7%2,355,238 TOTAL UNIQUE Transcribed Bases 0.5%151,149CAGE Tags* 9.3%1,278,588transfrag/tar 9.8%1,650,821Annotated exons % nucleotides covered Nb of nucleotide covered Total # of nucleotides : 29,998,060 non repeat masked : 14,707,189

March 9, 2007 Bologna, February cell specific transcription

(92.6%) (66.4%) (80.0%) (64.6%) (24.1%) (14.7%) (9.3%) (0.1%) (0.8%) (9.8%) INTERROGA TED (91.1%) (65.5%) (77.7%) (59.2%) (16.1%) (8.4%) (4.6%) (0.1%) (0.5%) (5.9%) TOTAL ( interrogated and uninterrogated ) Total Bases 12 (%)* Bases between PETs 11 (%)* Bases with 5'RACE 10 (%)* Bases in Exons and Introns 9 (%)* Bases in PT (ESTs included) 8 (%)* Total Bases in PT 7 (%)* bp in TF 6 (%)* bpin PET 5 (%)* bp in CAGE tags 4 (%)* bp in Exons 3 (%)*% Total Interro- gated Bases 2 Total Bases 1 PRIMARY TRANSCRIPTSPROCESSED TRANSCRIPTS (PT) Table 1: Summary of Transcriptional Coverage of ENCODE Regions.

(92.6%) (66.4%) (80.0%) (64.6%) (24.1%) (14.7%) (9.3%) (0.1%) (0.8%) (9.8%) INTERROGA TED (91.1%) (65.5%) (77.7%) (59.2%) (16.1%) (8.4%) (4.6%) (0.1%) (0.5%) (5.9%) TOTAL ( interrogated and uninterrogated ) Total Bases 12 (%)* Bases between PETs 11 (%)* Bases with 5'RACE 10 (%)* Bases in Exons and Introns 9 (%)* Bases in PT (ESTs included) 8 (%)* Total Bases in PT 7 (%)* bp in TF 6 (%)* bpin PET 5 (%)* bp in CAGE tags 4 (%)* bp in Exons 3 (%)*% Total Interro- gated Bases 2 Total Bases 1 PRIMARY TRANSCRIPTSPROCESSED TRANSCRIPTS (PT) Table 1: Summary of Transcriptional Coverage of ENCODE Regions.

March 9, 2007 Bologna, February tiling arrays reveal many novel sites of transcription TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME COURSE (data by Tom Gingeras, affymerix)

03/09/07 Bologna, February SENSITIVITY OF GENCODE genome tiling arrays more than 6,000 unique transfrags corresponding to unannotated sites of transcription (from 11 cell lines/conditions) 4044 unique intronic transfrags –3572 predicted into 1105 alternatively skipped protein coding exons –240 tested by rt-pcr. Results this week!

March 9, 2007 Bologna, February characteristics of unannotated transfrags short: 78bp on average compared with 121 for exonic transfrags very gc-rich: 56% vs 42% in the background of unannoated regions lack splice sites no matches to protein or domain databases lack of selective constraints HOWEVER: reproducible across cell lines support by independent evidence of transcription (mostly unspliced ESTs). enriched for RNA structures.

03/09/07 Bologna, February RACEarray experiments 5’ RACE on 12 tissues primers in internal exons of 399 protein coding loci RACE products hybridized into genome tiling arrays –4573 race exons detected novel –only 15% corresponding to unannotated transfrags

March 9, 2007 Bologna, February Denoeud et al., “Prominent use of distal 5’ transcription start sites and discovery of a large number of additional exons in ENCODE regions”, accepted for publication Genome Research 5’ RACE on 12 tissues primers in internal exons of 399 protein coding loci RACE products hybridized into genome tiling arrays –4573 race exons detected novel the RACE/array experiments

March 9, 2007 Bologna, February ’ RACE/array of C6ORF 148

March 9, 2007 Bologna, February

March 9, 2007 Bologna, February Target gene 5’ RACE from TMEM15 Gene (region Enr232) identifies several tissue specific distal 5’ exons.

03/09/07 Bologna, February

distal RACEfrags are associated to independently predictes sites of transcription initiation

March 9, 2007 Bologna, February cloning and sequencing of RACEarray products

Bologna, February cloning and sequencing of RACEarray products almost 30% of the sequenced products incorporate exons from upstream genes in chimeric structures

March 9, 2007 Bologna, February RT-PCR/arrays, cloning and sequencing 136 novel transcripts (29 chimeric) in 69 loci 71 potential new CDS in 37 loci (14 chimeric) 225 novel exons

March 9, 2007 Bologna, February CONCLUSIONS there is substantial amount of transcription which does not appear to be associated to protein coding loci only a fraction of the transcript diversity of protein coding loci appears to have been surveyed so far. –in particular, protein coding loci appear to have tissue specific distal alternative transcriptional start sites ENCODE transcriptional landscape: network of overlapping coding and non-coding transcripts, resulting in a continuum of transcription (more than 90% of the ENCODE regions are transcribed in at least one strand)

ACKNOWLEDGEMNTS ENCODE GT GROUP Stilyanos Antonarakis (Geneva) Robert Baertsch (UCSC) Ian Bell (Affx) Ewan Birney (EBI) Robert Castelo (IMIM) Jill Cheng (Affx) Evelyn Cheung (Affx) Hiram Clawson (UCSC) France Denoeud (IMIM) Sujit Dike (Affymetrix) Jorg Drenkow (Affymetrix) Olof.Emanuelsson (Yale) Paul Flicek (Sanger) Mark Gerstein (Yale) Srinka Ghosh (Affx) Jenn Harrow (Sanger) Greg Helt (Afffx) Ivo Hofacker (U. Vienna) Tim Hubbard (Sanger) Phil Kapranov (Affx) Damian Keefe (EBI) Jan Korbel (Yale) Julien Lagarde (IMIM) Jeff Long (Affx) Todd Lowe (UCSC) G. Madhavan (Affx) Anton Nekrutenko (Penn State) David Nix (Affx) Jakob Pedersen (UCSC) Alex Reymond Geneva) Joel Rozowsky (Yale) Yijun Runan (GIS) Albin Sandelin (RIKEN) Mike Snyder (Yale) Peter F. Stadler (U. Vienna) Kevin Struhl (Harvard) Hari Tammana (Affx) Scott Tennenbaun (SUNY, Albany) Chia Lin Wei (GIS) Matt Weirauch (UCSC) Deyou Zheng (Yale) Addam Frankish(Sanger) Tom Gingeras (Affymetrix) Roderic Guigó (CRG)

March 9, 2007 Bologna, February

March 9, 2007 Bologna, February