Download presentation
Presentation is loading. Please wait.
1
From Genomes to Genes Rui Alves
2
How to make sense of genome sequences?
How do I know where genes are? …atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgtttgaagaaattcc…
3
Predicting ORFs is easy, predicting genes is hard
An ORF is a sequence of nucleotides that goes from a start codon (ATG, GTG,…) to a stop codon (GTA) Finding them is as easy as reading the DNA sequence How do we know if an ORF is a gene?
4
There are several ways to predict genes
By homology
5
Homology predictions Sequence of known gene Homologue gene
…Sequenced … Genome… Homologue gene
6
How are sequences aligned?
Substitution probability table A C - … 1 0.001 …UUACAUUUCCCGUCCGCUCU… …GGGGUUAAUUUGCCCGUCCA… …UUACAUUUCCCGUCCGCUCU… …GGGGUUAAUUUGCCCGUCCA… S2>S1 S1
7
Problems of homology predictions: The genetic code
NO HOMOLOGY!! …UUAAUUUCCCGUCCG… …CUUAUAAGUAGACCA… Yet, the code is for the same peptide …LISRP…
8
Solution for redundancy of genetic code:
Use synonymous substitution when doing the DNA alignment The problem of doing this: …UUAAUUUCCCGUCCG… …UUAAUUUCCCGUCCA… …UUAAUUUCCAGACCG… … …CUUAUAAGUAGACCA… Combinatorial Explosion!!! Solutions? Not many, efficient algorithms, more computer power, pacience
9
Homology predictions most effective for closely related organisms
Thus, homology-based gene predictions works best when the genome of a close organism has been fully sequenced and annotated!!!
10
There are other ways to predict if Orfs are genes
By homology Ab initio methods Signal Sensors ATG sites Promoter elements id Regulatory elements id Shine-Dalgarno sequences id (i.e. rybosome binding sites) …
11
Using initiation and termination codons to identify ORFs
ATG is the start codon GTG, CTG, TTG are minor start codons If termination codon too close to ATG then ORFs unlikely to be gene atgaatgaatgctgccgaagatctctggcaccaaattttggagcggttgcag… atgaatgaatgctgccgaagatctctggcaccaaattttggagcggtgacag…
12
Using Promoter sequences to identify ORFs
Many promoters have a known structure Identifying Promoters close to initiation codons increases likelihood of ORF being gene Lac promoter
13
Using response elements to identify ORFs
Regulatory binding sites (RBS) have a known structure Identifying RBS close to initiation codons increases likelihood of ORF being gene
14
Using Rybosomal binding sequences to identify ORFs
Rybosomal binding sites (SDS) have a known structure Identifying SDS close to initiation codons increases likelihood of ORF being gene AGGAGG Consensus Shine-Dalgarno sequence
15
There are several ways to predict genes
By homology Ab initio methods Signal Sensors Promoter elements id Regulatory elements id Shine-Dalgarno sequences id (i.e. rybosome binding sites) ATG sites … Content Sensors Codon usage GC content Position assymetry CpG islands
16
Using codon bias to predict expressed ORFs
Average Codon usage Ile RF1 ATT ATC ATA 0.34 0.26 0.40 Frequency of synonymous codons in an organism are not uniform Frequency of synonymous codons in coding sequences is different from that in non-coding sequences This can be used to predict coding open reading frames Average Codon Usage Ile ATT ATC ATA 0.34 0.46 0.20 atgaatgcatgctgccgaagatctctggcaccaaattttggagcggttgcag… Average Codon usage Ile RF2 ATT ATC ATA 0.40 0.20 Average Codon usage Ile RF3 ATT ATC ATA 0.32 0.42 0.25 The third reading frame is the most likely to be a gene
17
Using GC content to predict expressed ORFs
gtgattagctctgccgaagatctctggcaccaaattttggagcggttgcag… Frame 1 Frame 2 Frame 3 11 9 5 The G+C content of the third position of codons in coding sequences is biased Genes have a very high (low) G+C content on the third position of the codons in the reading frame. Frame 1 (3) more likely to be expressed Not very usefull for eukaryotes
18
Using position assymetry to predict expressed ORFs
Av Gene A T C G Position 1 0.20 0.22 0.40 Position 2 0.38 Position 3 0.30 0.24 Coding sequences have a characteristic distribution of nucleotides in each of the three positions of codons gtgaatgtatgctctgccgaagatctctggcaccaaattttggagcggttgcag… RF3 A T C G Position 1 0.45 0.15 0.25 Position 2 0.20 0.18 0.30 0.32 Position 3 0.11 0.36 RF2 A T C G Position 1 0.38 0.24 0.19 Position 2 Position 3 0.25 RF1 A T C G Position 1 0.19 0.24 0.38 Position 2 Position 3 0.29
19
Using position assymetry to predict expressed ORFs
Reading Frame 1 the most likely because it has the highest similarity to the position assymetry of known genes.
20
CpG Islands are signals for transcription initiation
Near the promoter of known genes, the content of CG dinucleotides is higher than that away from initiation of transcription sites Thus, ATG preceded by CpG island are more likely to be genes
21
Other assimetry measures of gene likelihood
Dinucleotide bias Hexanucleotide bias …
22
Summary Genes can be predicted by Homology Content sensors
Signal sensors If you need to annotate a genome, e.g. go to TIGR
23
How are eukaryotic genes different?
DNA mRNA RNA Pol Ryb Protein
24
How are eukaryotic genes different?
DNA mRNA RNA Pol Ryb Spliceosome Protein mRNA Correctly Identifying Splicing sites is not a trivial task
25
How do we predict splicing sites?
By Homology Ab initio SS motifs Codon usage Exonic Splicing Enhancers Intronic Splicing Enhancers Exonic Splicing Silencers Intronic Splicing Silencers
26
Homology Splice Site Prediction
Known spliced gene Predicted spliced gene
27
Splice Site Motifs
28
Exonic Splicing Enhancers
29
Exonic Splicing Silencers
Genes & Development 18:
30
Interaction between SE and SI
31
Rules for Splicing 3’ end likely target for repression
Distance between SE and 3’ end < 100bp Splicing efficiency a p(interaction SEC-3’ end)
32
Methods for splicing detection
Set of know spliced genes Test set of know spliced genes Training set of know spliced genes GA, NN, HMM Bayes,ME GA, NN, HMM Bayesian Algorithm Test set Predictions
33
A Genetic Algorithm Method
Motif DM1 … AMi … EM AM p(i) IM Shuffle lines and columns k times and each time calculate the probability of a given combination of motifs getting spliced Select m best combinations and continue to evolve the algorithm until it predicts training set
34
A Neural Net Method Sequences Predicted Splicing
Corrected Weight Table for splice elements Weight Table for splice elements Hidden Nodes Predicted Splicing
35
Summary Eukaryotic genes have exons
Biological rules combined with mathematical and statistical approaches can be used to predict the boundaries for the exons and to predict the splice variants
36
How to find what genes a string of DNA contains
Rui Alves
37
Simple steps Go to a known gene prediction server (or google for one)
Input sequence and wait for prediction Get prediction(s), either as cDNA or as a tranlated protein sequence and do homology searches to identify them in a known database (e.g. NCBI or SWISSPROT)
38
Simple steps a) Go to a known gene prediction server (or google for one) Input sequence and wait for prediction Get prediction(s), either as cDNA or as a translated protein sequence and do homology searches to identify them
39
Paper Presentation The human genome (Science) vs. The human genome (Nature) Nature : Pages 875 to 901 Science: Pages Compare the differences in methods and results for the annotation DO NOT SPEND TIME TALKING ABOUT THE SEQUENCING OR ASSEMBLY ITSELF Do not go into the comparative genome analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.