Presentation is loading. Please wait.

Presentation is loading. Please wait.

bacteria and eukaryotes

Similar presentations


Presentation on theme: "bacteria and eukaryotes"— Presentation transcript:

1 bacteria and eukaryotes
Gene prediction in bacteria and eukaryotes

2 Gene structure Bacteria Eukaryotes

3 Outline Identification of genes in prokaryotic genomes
- Genome and gene structure - Introduction to Hidden Markov models (HMMs) - Example programs: GLIMMER and FGENESB 2. Identification of genes in eukaryotic genomes - Intrinsic and extrinsic approaches - Example programs: FGENESH

4 Prediction of genes Typically, the first bioinformatic step after sequencing a genome is the identification and annotation of genes. Identification of the functional content of a genome. - protein encoding genes - ribosomal RNA genes (rRNA) - transfer RNA genes (tRNA) - small RNAs Gene identification is more difficult in eukaryotes than prokaryotes.

5 Characteristics of bacterial and archaeal genomes
Gene annotation in prokaryotes (bacteria and archaea) is relatively simple compared to eukaryotes because: 1. High gene density – one gene per kilobase 2. Absence of introns 3. Very little repetitive DNA

6 Annotation of genes in bacteria (and archaea)
Genes are most simply identified by the presence of long open reading frames (ORFs) Prokaryotic genes are often in an operon structure

7 Gene recognition in genomic DNA
1. Open reading frame (ORF) length An ORF is defined by a start codon and a stop codon. 5’-ATG GTG TTG TAA-3’ TAG TGA Alternative start codons in bacteria

8 Gene recognition in genomic DNA
2. Patterns of codon usage that are consistent with genes

9 Gene recognition in genomic DNA
2. Patterns of codon usage that are consistent with genes Markov models are very useful in defining the coding potential of putative protein-coding DNA sequences. e.g. GLIMMER and FGENESB

10 Markov chains and models
A Markov model (or chain) refers to a series of observations in which the probability of an observation depends on a number of previous observations. Reflects codon bias and dependencies between codons

11 Markov chains and models
A Markov model (or chain) refers to a series of observations in which the probability of an observation depends on a number of previous observations. The number of previous observations defines the order of the chain  Fifth order Markov model are used in gene prediction  For coding regions of DNA, it is well known that the probability of a given base depends on the 5 bases preceding it. Reflects codon bias and dependencies between codons

12 Fifth order Markov models
Coding DNA sequence: * 5’-ATG M GAT D ATC I GCC A ATC I CAC H TAA-3’ * Reflects codon bias and dependencies between codons How well does the local nucleotide sequence conform to the fifth order dependencies observed in coding regions? The higher the conformity, the higher the probability the DNA sequence is protein-encoding

13 Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. Genetic elements (e.g. coding or noncoding) are referred to as states

14 Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. Genetic elements (e.g. coding or noncoding) are referred to as states Transition probabilities: how likely a change of state is, as one moves through the sequence

15 Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. Genetic elements (e.g. coding or noncoding) are referred to as states Transition probabilities: how likely a change of state is, as one moves through the sequence Emission probabilities: each state emits a particular nucleotide with some probability

16 Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. The sequence characteristics provide information on how likely a state is as one moves through the sequence. The user “sees” the nucleotide sequence being analyzed, but does not actually see the state that the base is in - hence the term “hidden” markov model.

17 HMMs need training sets
Hidden Markov models (HMMs) are derived from training sets, where the correct structure is already known. Transition and emission probabilities are derived from training sets. The objective of training is to define a set of parameters that maximize the correct prediction for a new sequence of interest. Model parameters differ from organism to organism, therefore the success of a HMM-based method depends on how well the training set represents the sequence of interest.

18 Gene recognition in genomic DNA
3. A consensus sequence for ribosome binding site in the vicinity of a start codon. -20 +13 5’-ATG ATG TAA-3’ TAG TGA In bacteria, ribosome binding site is called a Shine-Dalgarno sequence.

19 A logo of the RBS and start codon in E. coli genes
Fig 10.4

20 Gene recognition in genomic DNA
4. Homology of putative genes to other previously described genes - Genomic DNA can be searched against protein databases using blastx - Exons can be matched to cDNA sequences

21 Gene recognition in genomic DNA
1. Open reading frame (ORF) length 2. Patterns of codon usage that are consistent with genes 3. A consensus sequence for ribosome binding site in the vicinity of a start codon. 4. Homology of putative genes to other previously described genes Intrinsic approaches (ab initio) Extrinsic approaches

22 Bacterial gene prediction: GLIMMER
GLIMMER is a bacterial (archaeal and viral) gene finding algorithm that uses a fifth order Markov chain. Step 1. Build a Markov model from a training set Step 2. Scan genomic DNA sequence to predict genes Criteria for gene finding: - start and stop codon - minimal length for an ORF

23

24

25 FGENESB: bacterial operon and gene prediction
FGENESB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites.

26 FGENESB: step by step description of annotation
1. Finds all potential ribosomal RNA genes using BLAST search against ribosomal RNA databases. 2. Predicts tRNA genes using tRNAscan-SE program

27 FGENESB: step by step description of annotation
1. Finds all potential ribosomal RNA genes using BLAST search against ribosomal RNA databases. 2. Predicts tRNA genes using tRNAscan-SE program 3. Initial prediction of ORFs using fifth and second order Markov models 4. Predict operons based on distance between predicted genes

28 FGENESB: step by step description of annotation
5. Runs BLAST for predicted proteins against COG database 6. Uses information about known neighboring gene pairs to improve operon prediction 7. Runs BLAST for predicted proteins against NCBI nr database 8. Predict promoters and terminators 9. Refine operon predictions using promoter and terminator evidence.

29 Example of FGENESB output
Genomic features Location of features BLAST results No. of operons No. of genes

30 Outline Identification of genes in prokaryotic genomes
- Genome and gene structure - Introduction to Hidden Markov models (HMMs) - Example programs: GLIMMER and FGENESB 2. Identification of genes in eukaryotic genomes - Intrinsic and extrinsic approaches - Example programs: FGENESH

31 From eukaryotic DNA to protein
Fig 10.10

32 Additional difficulties with gene identification in eukaryotes
1. Eukaryotic genes are split into introns and exons. 2. For many eukaryotes, most of the genome does not encode genes. - e.g. less than 2% of vertebrate genomes code for proteins

33 Annotation of genes in eukaryotes
Intrinsic approaches: 1. Predicting gene structure through computational analysis of genomic DNA sequence Extrinsic approaches: Aligning ESTs or cDNA to genomic DNA sequences 2. Mapping genes from one organism to conserved regions of a closely related organism

34 Computational gene prediction
Typically, gene prediction from eukaryotic genomes involves the following steps: Identify and score exon-intron splice sites and start and stop signals along the DNA sequence Predict candidate exons from these signals Score exons and incorporate any homology-based or comparative genome information. Assemble a subset of exon candidates into a predicted gene structure

35 Prediction of Exon-Defining Signals
There are four basic signals involved in defining coding exons. 5’ splice site 3’ splice site These sequence signals can be detected using position weight matrices (PWMs) calculated from known functional signals.

36 Computational gene prediction in eukaryotes
Fig 9.4

37 Assembly of exons into a gene structure
Splicing exons together into a gene structure can eliminate false exons by examining whether the ORF established by the initial exon is preserved. PROBLEM: the number of possible exon assemblies increases exponentially with the number of predicted exons. SOLUTION 1: Dynamic programming methods e.g. GRAIL2, FGENESH, GENEID SOLUTION 2: HMMs to define highly complex, multi-exonic genes. e.g. GENESCAN, GENIE, HMM-gene

38 HMMs in Eukaryotic Gene Prediction
There are additional “states” for eukaryotic gene models compared to prokaryotic gene models. - exons, introns, splice donors and acceptors 5’ splice site 3’ splice site

39 HMMs in Eukaryotic Gene Prediction
Working from 5’ to 3’ along a DNA sequence, a Hidden Markov Model may take into account the unique characteristics of: - Promoter regions - Transcriptional start sites (TSSs) - 5’ UTRs - Start codons - Exons and introns (as well as the splice sites) - Stop codons - 3’ UTRs - PolyA tails

40 Sequence Similarity-based Gene Prediction
Expressed sequence tags (EST) are extremely valuable for identifying genes and defining exonic structure. Sequences arising from mature mRNA are mapped back onto genomic DNA sequences. Homology search of a DNA sequence that contains three exons against the EST database Fig 9.1

41 Gene Prediction Programs
GRAIL: one of the first gene finding algorithms developed

42 Gene Prediction Programs: Annotation pipeline

43 Gene Prediction Programs

44 Gene Prediction Programs

45 Gene prediction methods have different levels of accuracy and efficiency.
They are scored according to two criteria: (i) Sensitivity – i.e., the proportion of genes that have been correctly predicted. (ii) Specificity – the proportion of predicted genes that is correct.


Download ppt "bacteria and eukaryotes"

Similar presentations


Ads by Google