Download presentation
Presentation is loading. Please wait.
Published byTabitha Jocelyn Nash Modified over 8 years ago
1
1 Gene Finding
2
2 “The Central Dogma” TranscriptionTranslation RNA Protein
3
3 Gene Finding in Prokaryotes
4
4 Reminder: The Genetic Code 1 start, 3 stop Codons
5
5 Finding Genes in Prokaryotes High gene density –~85% coding in E.coli => is every ORF a gene? Gene structure
6
6 Finding ORFs Many more ORFs than genes –In E.Coli one finds 6500 ORFs while there are 4290 genes. In random DNA, one stop codon every 64/3=21 codons on average. Average protein is ~300 codons long. => search long ORFs. Problems: –Short genes –Overlapping long ORFs on opposite strands
7
7 Codon Frequencies Coding DNA is not random: –In random DNA, expect Leu : Ala : Trp ratio of 6 : 4 : 1 Leu : Ala : TrpLeu : Ala : Trp –In real proteins, 6.9 : 6.5 : 1 Different frequencies for different species.
8
8 Human and Yeast codon usage
9
9 Using Codon Frequencies/Usage The probability that the ith reading frame is the coding region: Assume each codon is independent. For codon abc calculate frequency f(abc) in coding region. Given coding sequence a1b1c1,…, an+1bn+1cn+1 Calculate
10
10 CodonPreference ORF The real genes
11
11 C+G Content C+G content (“isochore”) has strong effect on gene density, gene length etc. –< 43% C+G : 62% of genome, 34% of genes –>57% C+G : 3-5% of genome, 28% of genes Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than rich A+T regions –Amount of intronic DNA is 3 times higher for A+T rich regions. (Both intron length and number). –Etc…
12
12 CodonPreference : 3 rd position GC bias
13
13 RNA Transcription Not all ORFs are expressed. Transcription depends on regulatory regions. Common regulatory region – the promoter RNA polymerase binds tightly to a specific DNA sequence in the promoter called the binding site.
14
14 Prokaryotic Promoter One type of RNA polymerase.
15
15 Positional Weight Matrix For TATA box:
16
16 Gene Finding in Eukaryotes
17
17 Coding density
18
18 Eukaryote gene structure Gene length: 30kb, coding region: 1-2kb Binding site: ~6bp; ~30bp upstream of TSS Average of 6 exons, 150bp long Huge variance: - dystrophin: 2.4Mb long – –Blood coagulation factor: 26 exons, 69bp to 3106bp; intron 22 contains another unrelated gene
19
19 Splicing Splicing: the removal of the introns. Performed by complexes called spliceosomes, containing both proteins and snRNA. The snRNA recognizes the splice sites through RNA-RNA base-pairing Recognition must be precise: a 1nt error can shift the reading frame making nonsense of its message. Many genes have alternative splicing which changes the protein created.
20
20 Splice Sites
21
21 Gene prediction programs Scan the sequence in all 6 reading frames: 1. Start and stop codons 2. Long ORF 3. Codon usage 4. GC content 5. Gene features: promotor, terminator, poly A sites, exons and introns, … Frame +1 Frame +2 Frame +3
22
22 Gene prediction programs Genscan:Vertebrates/Maize/Arbidopsis. Predict location and gene features. Can handle few genes in one sequence
23
23 Gene prediction programs Results:
24
24 GenScan Output
25
25 GenScan Performance Predicts correctly 80% of exons –With multiple exons probability declines… Prediction per bp > 90%
26
26 Many prediction Tools Dynamic programming to make the high scoring model from available features. – e.g. Genefinder Markov model based on a typical gene model. – e.g. GENSCAN or GLIMMER Neural net trained with confirmed gene models. – e.g. GRAIL
27
27 An Unsolved Problem www.hgmp.mrc.ac.uk/NIX
28
28 An end to ab initio prediction An end to ab initio prediction ab initio gene prediction is inaccurate. High false positive rates for most predictors. High false positive rates for most predictors. Rarely used as a final product Rarely used as a final product Human annotation runs multiple algorithms and scores exon predicted by multiple predictors. Human annotation runs multiple algorithms and scores exon predicted by multiple predictors. Used as a starting point for refinement/verification Used as a starting point for refinement/verification
29
29 Comparative Genomics Use homologue sequences: 1. Annotated genes. 2. mRNA sequences. 3. Proteins sequences 4. ESTs
30
30 ESTs EST – Expressed Sequence Tags. Short sequences which are obtained from cDNA (mRNA).
31
31 Transcript-based prediction Transcript-based prediction Align transcript data to genomic sequence using a pair-wise sequence comparison. EST cDNA Gene Model:
32
32 Transcript-based prediction Example: BlastN against a ESTs/human database.
33
33 Annotation of eukaryotic genomes transcription RNA processing translation AAAAAAA Genomic DNA Unprocessed RNA Mature mRNA Nascent polypeptide folding Reactant AProduct B Function Active enzyme ab initio gene prediction (w/o prior knowledge) Comparative gene prediction (use other biological data) Functional identification Gm 3
34
34 FIN
35
35 Ribosome
36
36 Generalized HMM (Burge & Karlin, J. Mol. Bio. 97 268 78-94) –Semi-Markov model with different output length at each node –HMM with different output length and different output distribution at each node
37
37 Generalized HMM (Burge & Karlin, J. Mol. Bio. 97 268 78-94) Overview: –Hidden Markov states q 1,…q n –State q i has output length distribution f i –Output of each state can have a separate probabilistic model (weight matrix model, HMM…) –Initial state probability distribution –State transition probabilities T ij
38
38 Length Distribution Since an HMM is a memory-less process, the only length distribution that can be modeled is geometric. exonintron p q 1-p 1-q Above is a simple HMM for gene structure The length of each exon (intron) has a geometric distribution:
39
39 GenScan Model Exon Intron Exon init/term 5’/3’ UTR Promoter/PolyA Forward strand Backward strand Burge & Karlin JMB 97
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.