Download presentation
Presentation is loading. Please wait.
Published byMercy Montgomery Modified over 9 years ago
1
finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona
2
número de genes en el cromosoma 22 initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 reviewed annotation726chr22 team, sanger, 2001 mouse shotgun data+20(our data) geneid predictions794 genscan predictions1128
3
número de genes en el genoma humano Consortium30.000-40.000 2001 Celera27.000-38.000 2001 Consortium+Celera50.000 Hogenesch et al. 2001 DBsearches65.000-75.000 Wrigth et al., 2001 HumanGenomeSciences 90.000-120.000 Haseltine, 2001
4
sequence conservation and coding function
6
rosseta ( Batzoglou et al., 2000 ) cem (Bafna and Huson, 2000) sgp1 (Wiehe et al., 2000) twinscan (Korf et al., 2001) slam ( Patcher et al., 2001 ) doublescan ( Meyer and Durbin, 2002 ) sgp2 ( Parra et al., 2003 ) comparative gene prediciton
7
comparative gene prediction 1. THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences. This problem is usually solved through a complex extension of the classical dynamic programming algorithm for sequence alignment. blayo et al., 2002 pedersen and scharl, 2002
8
comparative gene prediction 2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs progen – novichkov et al., 2001 slam – pachter et al, 2001 doublescan – meyer and durbin, 2002
9
comparative gene prediction 3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions. rosseta – batzoglou et al., 2000 cem – bafna and huson, 2000 sgp-1 – wiehe et al., 2001
10
comparative gene prediction 4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms. twinscan – korf et al., 2001 sgp-2 – parra et al., 2003
11
Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons syntenic gene prediction (sgp2)
12
programs based on mouse human genome sequence comparisons improve gene predictions sensitivityspecificity genscan0.790.46 twinscan0.800.62 SGP0.790.66 Accuracy on human chromosome 22
13
how accurate are the sgp predictions nucleotide level
14
how accurate are the sgp predictions exon level
15
gene predicition programs predict a large number of genes TWINSCANSGP 48462total47055 17562novel21942 3171 multiexonic long no low complexity 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome
16
and a large number of novel genes... TWINSCANSGP 48462total47055 17562novel21942 3171 multiexonic long no low complexity 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome
17
...with exons... TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome
18
that look fine proteins TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 4543 954 human ts 2217 orphans 1560 Orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome
19
almost every mouse gene has the human orthologue counterpart TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned predictions in the mouse genome
20
|1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b |2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR orthologous human mouse genes have conserved exonic structure
21
orthologous human mouse genes have conserved exonic structure. 85% of the orhologous pairs have identical number of exons 91% of the orthologous exons have identical length 99.5% of the orthologous exons have identical phase there are a few cases of intron insertion/deletion (22) U12 introns appear to be strongly conserved between human and mouse non-canonical GC-AG are less conserved. data on 1506 human/mouse refseq orthologues
22
we will target genes with conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV *. ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ.****. : :********************** ************.**..* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
23
sequence conservation and coding function
24
ortholgous splice sites are more conserved than expected solely from their splicing function
26
prediction of splice sites
27
we will target genes with conserved intron positions
28
the final pools TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned predictions in the mouse genome
29
rtpcr: targeting conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV *. ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ.****. : :********************** ************.**..* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
30
rt-pcr on 12 normal mouse adult tissues, and direct sequencing of the amplimers poolpredictionstestedpositivesuccess rate intron aligned 142821413362% similar212538411% orphan34256323%
31
rt-pcr on 12 normal mouse adult tissues, and direct sequencing of the amplimers
32
about 1000 human genes not in ensembl low support by ESTs: 34% match EST sequences low representation in other vertebrate genomes: 33% have sequence matches in fish genomes restricted expression patterns
37
limitations: sensitivity of the procedure twisncanensemblsgp2 initial predictions484642302648451 multiexonic genes368311756538979 25320163681695221184 69%94%97%54% orhtolog pairs2474330927 21099153551675719831 85%87%95%64% intron aligned1727118056 16337137091511215977 94%78%86%88%
38
specificity of the prediction can be improved: Ka/Ks ratio
39
further work scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed intronless genes human specific gene families (if any) genes with non-canonical splicing
40
selenoproteins Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid. Function: mostly redox enzymes Distribution: 3 domains of life Number: 22 families in mammals
41
selenoproteins UGA (STOP) is the codon for Sec There is a tRNA sec with the UGA anticodon Recoding: 1.RNA structure: the SECIS element 2.SECIS binding proteins
42
selenoproteins
43
the SECIS element. computational search for selenoproteins dSelG SECIS Pattern
44
using geneid to search for selenoproteins 1.Predict SECIS (PatScan) 1.Gene prediction with 1.TGA in-frame 2.SECIS
45
genome wide search in drosophila SECIS predicted35876 SECIS thermo assessment 1220 Genes predicted12194 Predicted Selenoproteins (4) Real Selenoproteins 3
46
dSelG
47
dSelM
48
dSelG and dSelM: experimental verification
49
dSelM has selenoprotein homologues in vertebrates
50
IMIM/UPF/CRGGenís Parra, Josep F. Abril, Roderic Guigó University of GenevaManolis Dermitzakis, Alexandre Reymond, Robert Lyle, Catherine Ucla, Stylianos Antonarakis GlaxoSmithKlinePankaj Agarwal University of OxfordChris Ponting Washington UniversityEvan Keibler, Michael Brent Universitat de Barcelona University of Lincon Harvard University Montserrat Corominas, Florenci Serras, Marta Morey, Sergi Bertran Vadim Gladishev, Gregory Kruikov Marla Berry, Nadia Morozova IMIM/UPF/CRGSergi Castellano COMPARATIVE GENE PREDICTION SELENOPROTEINS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.