Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

Similar presentations


Presentation on theme: "1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein."— Presentation transcript:

1 1 Gene Finding

2 2 “The Central Dogma” TranscriptionTranslation RNA Protein

3 3 Gene Finding in Prokaryotes

4 4 Reminder: The Genetic Code 1 start, 3 stop Codons

5 5 Finding Genes in Prokaryotes High gene density –~85% coding in E.coli => is every ORF a gene? Gene structure

6 6 Finding ORFs Many more ORFs than genes –In E.Coli one finds 6500 ORFs while there are 4290 genes. In random DNA, one stop codon every 64/3=21 codons on average. Average protein is ~300 codons long. => search long ORFs. Problems: –Short genes –Overlapping long ORFs on opposite strands

7 7 Codon Frequencies Coding DNA is not random: –In random DNA, expect Leu : Ala : Trp ratio of 6 : 4 : 1 Leu : Ala : TrpLeu : Ala : Trp –In real proteins, 6.9 : 6.5 : 1 Different frequencies for different species.

8 8 Human and Yeast codon usage

9 9 Using Codon Frequencies/Usage The probability that the ith reading frame is the coding region: Assume each codon is independent. For codon abc calculate frequency f(abc) in coding region. Given coding sequence a1b1c1,…, an+1bn+1cn+1 Calculate

10 10 CodonPreference ORF The real genes

11 11 C+G Content C+G content (“isochore”) has strong effect on gene density, gene length etc. –< 43% C+G : 62% of genome, 34% of genes –>57% C+G : 3-5% of genome, 28% of genes Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than rich A+T regions –Amount of intronic DNA is 3 times higher for A+T rich regions. (Both intron length and number). –Etc…

12 12 CodonPreference : 3 rd position GC bias

13 13 RNA Transcription Not all ORFs are expressed. Transcription depends on regulatory regions. Common regulatory region – the promoter RNA polymerase binds tightly to a specific DNA sequence in the promoter called the binding site.

14 14 Prokaryotic Promoter One type of RNA polymerase.

15 15 Positional Weight Matrix For TATA box:

16 16 Gene Finding in Eukaryotes

17 17 Coding density

18 18 Eukaryote gene structure Gene length: 30kb, coding region: 1-2kb Binding site: ~6bp; ~30bp upstream of TSS Average of 6 exons, 150bp long Huge variance: - dystrophin: 2.4Mb long – –Blood coagulation factor: 26 exons, 69bp to 3106bp; intron 22 contains another unrelated gene

19 19 Splicing Splicing: the removal of the introns. Performed by complexes called spliceosomes, containing both proteins and snRNA. The snRNA recognizes the splice sites through RNA-RNA base-pairing Recognition must be precise: a 1nt error can shift the reading frame making nonsense of its message. Many genes have alternative splicing which changes the protein created.

20 20 Splice Sites

21 21 Gene prediction programs Scan the sequence in all 6 reading frames: 1. Start and stop codons 2. Long ORF 3. Codon usage 4. GC content 5. Gene features: promotor, terminator, poly A sites, exons and introns, … Frame +1 Frame +2 Frame +3

22 22 Gene prediction programs Genscan:Vertebrates/Maize/Arbidopsis. Predict location and gene features. Can handle few genes in one sequence

23 23 Gene prediction programs Results:

24 24 GenScan Output

25 25 GenScan Performance Predicts correctly 80% of exons –With multiple exons probability declines… Prediction per bp > 90%

26 26 Many prediction Tools Dynamic programming to make the high scoring model from available features. – e.g. Genefinder Markov model based on a typical gene model. – e.g. GENSCAN or GLIMMER Neural net trained with confirmed gene models. – e.g. GRAIL

27 27 An Unsolved Problem www.hgmp.mrc.ac.uk/NIX

28 28 An end to ab initio prediction An end to ab initio prediction ab initio gene prediction is inaccurate. High false positive rates for most predictors. High false positive rates for most predictors. Rarely used as a final product Rarely used as a final product Human annotation runs multiple algorithms and scores exon predicted by multiple predictors. Human annotation runs multiple algorithms and scores exon predicted by multiple predictors. Used as a starting point for refinement/verification Used as a starting point for refinement/verification

29 29 Comparative Genomics Use homologue sequences: 1. Annotated genes. 2. mRNA sequences. 3. Proteins sequences 4. ESTs

30 30 ESTs EST – Expressed Sequence Tags. Short sequences which are obtained from cDNA (mRNA).

31 31 Transcript-based prediction Transcript-based prediction Align transcript data to genomic sequence using a pair-wise sequence comparison. EST cDNA Gene Model:

32 32 Transcript-based prediction Example: BlastN against a ESTs/human database.

33 33 Annotation of eukaryotic genomes transcription RNA processing translation AAAAAAA Genomic DNA Unprocessed RNA Mature mRNA Nascent polypeptide folding Reactant AProduct B Function Active enzyme ab initio gene prediction (w/o prior knowledge) Comparative gene prediction (use other biological data) Functional identification Gm 3

34 34 FIN

35 35 Ribosome

36 36 Generalized HMM (Burge & Karlin, J. Mol. Bio. 97 268 78-94) –Semi-Markov model with different output length at each node –HMM with different output length and different output distribution at each node

37 37 Generalized HMM (Burge & Karlin, J. Mol. Bio. 97 268 78-94) Overview: –Hidden Markov states q 1,…q n –State q i has output length distribution f i –Output of each state can have a separate probabilistic model (weight matrix model, HMM…) –Initial state probability distribution  –State transition probabilities T ij

38 38 Length Distribution Since an HMM is a memory-less process, the only length distribution that can be modeled is geometric. exonintron p q 1-p 1-q Above is a simple HMM for gene structure The length of each exon (intron) has a geometric distribution:

39 39 GenScan Model Exon Intron Exon init/term 5’/3’ UTR Promoter/PolyA Forward strand Backward strand Burge & Karlin JMB 97


Download ppt "1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein."

Similar presentations


Ads by Google