gene prediction roderic guigó i serra IMIM/UPF/CRG
number of genes in chromosome 22 initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 reviewed annotation726chr22 team, sanger, 2001 mouse shotgun data+20(our data) geneid predictions794 genscan predictions1128
number of genes in human genome Consortium Celera Consortium+Celera Hogenesch et al DBsearches Wrigth et al., 2001 HumanGenomeSciences Haseltine, 2001
decodificació del genoma ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGAT GTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCA GCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACA GCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGA CACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAAT GTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGAT GGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTG TTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAG TCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATT CCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGC CCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTG AGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCT CCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTG AGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGC GTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGT CCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCA TTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCAC CATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCC GGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTG GGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTC AGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCC AGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACA CAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCA TTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATT AGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCA CGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACC CTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACC TTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCG GCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCA GGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTG CCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTT CTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGG GGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCAC CTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGT TCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC the human genome sequence
QIKDLLVSSSTDLDTTLVLVNAIYFKGMW KTAFNAEDTREMPFHVTKQESKPVQMMCM NNSFNVATLPAEKMKILELPFASGDLSML VLLPDEVSDLERIEKTINFEKLTEWTNPN TMEKRRVKVYLPQMKIEEKYNLTSVLMAL GMTDLFIPSANLTGISSAESLKISQAVHG AFMELSEDGIEMAGSTGVIEDIKHSPESE QFRADHPFLFLIKHNPTNTIVYFGRYWSP the amino acid sequence of the proteins
EXONS INTRONS ELEMENT REGULADOR ‘UPSTREAM’ ELEMENT REGULADOR ‘DOWNSTREAM’ PROMOTOR Estructura dels Gens
Del DNA al RNA
Del RNA a la Proteïna
Mecanisme Molecular
Prediction of splice sites
accuracy of gene prediction programs
rosseta ( Batzoglou et al., 2000 ) cem (Bafna and Huson, 2000) sgp1 (Wiehe et al., 2000) twinscan (Korf et al., 2001) slam ( Patcher et al., 2001) sgp2 (Guigó et al., in preparation) comparative gene prediciton
Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons syntenic gene prediction (sgp2)
benchmarking sgp2 - accuracy scimog mit
Predicting “novel” genes in the human genome golden path annotations additional blastn matches to ENSEMBL + REFSEQ tblastx geneid exons tblastx sgp genes Golden Path Oct 7, 2000 freeze. RepeatMasked TraceDB, as on February 2001
“novel” genes ? 48,890 genic regions (known genes or similar) 15,489 genes longer than 100 aa predicted by sgp 13,302 non redundant predictions 8,416 supported by tblastx hits to mouse 1.5 3,331 predicted genes with at least two exons suported by tblastx hits predicted genes supported by tblastx hits covering at least 75% of the prediction 4,050 supported sgp predictions 25% of them not overlapping genscan predictions
validation of predictions EST identity18% NR similarity31% CDD (NCBI)24% Mouse ESTs28% Rat ESTs19% Tetraodon15% at least one of the above 56%
Experimental validation
chr22 chr21 human genome vs. Mouse traceDB
SN SP CC SNe SPe SNSP ME WE chr22.assem chr22.shot human genome vs. Mouse assemblies
chr22chr21 776Predicted known low complexity-5 -26short intronless testing novel predictions experimentally In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr
Positive controls N Success rate refseq7896% Known tissue specific genes 2025% Low expressing genes13Not ready Twinscan with EST support Not ready Test sets TwinscanNot ready SGP4028% preliminary results
aknowledgments IMIM-UPF-CRG, Barcelona Josep F. Abril Genís Parra Roderic Guigó GlaxoSmithKline, King of Prussia Pankaj Agarwal Max Plank Institute for Chemical Ecology, Jena Thomas Wiehe Whitehead Institute/MIT Center for Genome Research, Cambridge Gwen Acton Dan Brown Kerstin Mouse Sequence Consortium