Download presentation
Presentation is loading. Please wait.
1
gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005
2
6/1/2015 Advanced Bioinformatics CHSL, 2005 2
3
6/1/2015 Advanced Bioinformatics CHSL, 2005 3 1% of the genome. 44 regions target selection. commitee to select sequence targets –manual targets – a lot of information –radom targets – stratified by non exonic conservation with mouse gene density
5
Long-range regulatory elements (enhancers, repressors/silencers, insulators) Cis-regulatory elements (promoters, transcription factor binding sites) DNA Replication DNase Hypersensitive Sites Genes and Transcripts Epigenetic
6
6/1/2015 Advanced Bioinformatics CHSL, 2005 6 gencode: encyclopedia of genes and gene variants Roderic Guigó, IMIM-UPF-CRG Stylianos Antonarakis, Geneve Alexandre Reymond Ewan Birney, EBI Michael Brent, WashU Lior Pachter, Berkeley Manolis Dermitzkakis, Sanger Jennifer Ashurst, Tim Hubbard identify all protein coding genes in the ENCODE regions: identify one complete mRNA sequence for at least one splice isoform of each protein coding gene. eventually, identify a number of additional alternative splice forms.
7
the gencode annotation pipeline manual curation: havana (sanger) experimental verification: geneva bioinformatics: imim
8
6/1/2015 Advanced Bioinformatics CHSL, 2005 8 ALL EXONS CODING EXONS comparison with other gene sets
9
6/1/2015 Advanced Bioinformatics CHSL, 2005 9 from the encode Cromatin and Replication Group, John Stamatoyannopoulos
10
6/1/2015 Advanced Bioinformatics CHSL, 2005 10 one gene - many proteins very complex transcription units
11
6/1/2015 Advanced Bioinformatics CHSL, 2005 11 chimering tandem transcription / intergenic splicing
12
6/1/2015 Advanced Bioinformatics CHSL, 2005 12 KUA and UEV, Thomson et al., Genome Research 2000
13
6/1/2015 Advanced Bioinformatics CHSL, 2005 13 systematic search for functional chimeras in ENCODE : 165 tandem pairs in the same orientation 126 chimeric predictions obtained 96 tested, at least 4 positve Parra et al., Genome Research in press
14
6/1/2015 Advanced Bioinformatics CHSL, 2005 14 EGASP’05 the complete annotation of 13 regions was released in january 30. –The annotation of the remaining 31 regions was being obtained, and it was withheld. gene prediction groups were asked to submit predictions by april 15 in the remaining 31 regions. –18 groups participated, submiting 30 prediction sets predictions were compared to the annoations in an NHGRI sponsored workshop at the Wellcome Trust Sanger Institute, on may 6 and 7.
15
6/1/2015 Advanced Bioinformatics CHSL, 2005 15
16
6/1/2015 Advanced Bioinformatics CHSL, 2005 16
17
6/1/2015 Advanced Bioinformatics CHSL, 2005 17 EGASP’05 two main goals: 1.to assess how automatic methods are able to reproduce the (costly) manual/computational/experimental gencode annotation 2.how complete is the gencode annotation. are there still genes consistenly predicted by computational methods
18
6/1/2015 Advanced Bioinformatics CHSL, 2005 18
20
6/1/2015 Advanced Bioinformatics CHSL, 2005 20 accuracy measures
21
6/1/2015 Advanced Bioinformatics CHSL, 2005 21 accuracy at the exon level -- coding exons 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio”
22
6/1/2015 Advanced Bioinformatics CHSL, 2005 22 accuracy at the exon level -- all exons 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio”
23
programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete CDS exonic structures, and in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures
24
programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete transcripts (considering only the coding fraction) in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures
25
the issue of completness
26
6/1/2015 Advanced Bioinformatics CHSL, 2005 26 many novel exons predicted: we will prioritize a few hundred for experimental verification using race + rt-pcr although our experiment in the 13 regions suggests that only a few of them are likely to be real
27
6/1/2015 Advanced Bioinformatics CHSL, 2005 27 many computational predictions outside of the annotation In 13 ENCODE regions: 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334 (27%) are outside annotations (could correspond to novel genes)
28
6/1/2015 Advanced Bioinformatics CHSL, 2005 28 many computational predictions outside of the annotation In 13 ENCODE regions: 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334 (27%) are outside annotations (could correspond to novel genes) all tested by rt-pcr on 24 tissues 25 (2.0%) confirmed by rt-pcr in 24 tissues 16 (1.2%) with correctly predicted intron junctions 3 (0.2%) outside annotations (1% confirmation)
29
6/1/2015 Advanced Bioinformatics CHSL, 2005 29 Overview of the verification efforts II AFFX-GenCode: novel regions 40 intergenic transfrags from HL60 cell line that overlap GenCode gene predictions –20 overlapping gene predictions with no verification attempted by GenCode –20 overlapping gene predictions where verification by GenCode was negative 40 intergenic GenCode gene predictions that do not overlap HL60 transfrags –20 where no verification was attempted by GenCode –20 where verification by GenCode was negative (slide by Phil Kaphranov, Affymetrix)
30
6/1/2015 Advanced Bioinformatics CHSL, 2005 30 Some preliminary stats on the 80 regions: 3’ RACE only Gene predictions overlapping transfrags: total 39 (1/40 is a duplicated transfrag) 27 (69%) are positive in HL60 and 31(80%) in HepG2 in the 3’ RACE assays (slide by Phil Kaphranov, Affymetrix) Gene predictions not overlapping transfrags: total 38 (2/40 are outside of the regions where we have probes on the ENCODE array) 18 (47%) are positive in HL60 and 25 (66%) in HepG2 in the 3’ RACE assays
31
6/1/2015 Advanced Bioinformatics CHSL, 2005 31 3’ RACE based on a predicted exon ENr131_egasp_224555_224677 identifies new major and minor exons (shown by arrows) of a gene BC042133 in HepG2 cell line only. Good correspondence between RACE exons and GenScan exons. HepG2 3’RACE Bottom strand HepG2 3’RACE Top strand GenScan
32
6/1/2015 Advanced Bioinformatics CHSL, 2005 32 high-throughput genome-wide unbiased transcription interrogation techniques transcriptionstrandconnectiviystructure transfrags cages ditags genes the encode genes and transcripts group: transfrags, Tom Gingeras (Affymetrix) and Mike Snyder (Yale) cage tags, Albin Sandelin, Riken ditags Yijun Ruan, Genome Insitute of Singapore
33
6/1/2015 Advanced Bioinformatics CHSL, 2005 33 Proteasome (prosome, macropain) 26S subunit, non-ATPase, 4 (inhibits cholera-induced intestinal fluid secretion) Chrom 2
34
6/1/2015 Advanced Bioinformatics CHSL, 2005 34 protein coding genes are only a fraction of the transcription detected in ENCODE Total nb of nucleotides : 29409540 Nb of nucleotide covered % nucleotides covered Annotated exon1624,3265,5% transfrag/tar(Affymetrix,Yale)26992569,2% Cage (RIKEN)1465880,5% ditags (GIS)265280,1% TOTAL UNIQUE 3,534,868 12.0%
35
6/1/2015 Advanced Bioinformatics CHSL, 2005 35 transcription (aparently) not associated to protein coding genes TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME COURSE (data by Tom Gingeras, affymerix)
36
6/1/2015 Advanced Bioinformatics CHSL, 2005 36 THREADING TRANSFRAGS into PROTEIN CODING GENES inferring novel protein coding genes from transfrags
37
6/1/2015 Advanced Bioinformatics CHSL, 2005 37
38
6/1/2015 Advanced Bioinformatics CHSL, 2005 38
39
6/1/2015 Advanced Bioinformatics CHSL, 2005 39
40
6/1/2015 Advanced Bioinformatics CHSL, 2005 40
41
6/1/2015 Advanced Bioinformatics CHSL, 2005 41
42
6/1/2015 Advanced Bioinformatics CHSL, 2005 42
43
6/1/2015 Advanced Bioinformatics CHSL, 2005 43
44
6/1/2015 Advanced Bioinformatics CHSL, 2005 44
45
6/1/2015 Advanced Bioinformatics CHSL, 2005 45
46
6/1/2015 Advanced Bioinformatics CHSL, 2005 46
47
6/1/2015 Advanced Bioinformatics CHSL, 2005 47
48
6/1/2015 Advanced Bioinformatics CHSL, 2005 48
49
6/1/2015 Advanced Bioinformatics CHSL, 2005 49
50
6/1/2015 Advanced Bioinformatics CHSL, 2005 50
51
6/1/2015 Advanced Bioinformatics CHSL, 2005 51 http://genome.imim.es/gencode ENCODE France Denoeud (IMIM) Julien Lagarde Josep F. Abril Robert Castelo Eduardo Eyras Stylianos Antonarakis (Geneva) Alexandre Reymond Catherine Ucla Ewan Birney (EBI) Damian Keefe Paul Fliceck Michael Brent (WashU) Lior Patcher (Berkeley) Manolis Dermitakis (Sanger) HAVANA (Sanger) Jennifer Ashurst Tim Hubbard Adam Frankish David Swarbreck James Gilbert AFFYMETRIX Tom Gingeras Sujit Dike Phil Kaphranov EGASP’05 Michael Ashburner Vladimir Bajic Suzanne Lewis Martin Reese Peter Good Elise Feingold
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.