bacteria and eukaryotes

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Ka-Lok Ng Dept. of Bioinformatics Asia University
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Gene Finding Charles Yan.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
ORF Calling.
Genome Annotation (protein coding genes)
Bacterial infection by lytic virus
A Quest for Genes What’s a gene? gene (jēn) n.
What is a Hidden Markov Model?
Genes, Genomes, and Genomics
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Predicting Genes in Actinobacteriophages
Introduction to Bioinformatics II
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Genome Annotation and the Human Genome
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Presentation transcript:

bacteria and eukaryotes Gene prediction in bacteria and eukaryotes

Gene structure Bacteria Eukaryotes

Outline Identification of genes in prokaryotic genomes - Genome and gene structure - Introduction to Hidden Markov models (HMMs) - Example programs: GLIMMER and FGENESB 2. Identification of genes in eukaryotic genomes - Intrinsic and extrinsic approaches - Example programs: FGENESH

Prediction of genes Typically, the first bioinformatic step after sequencing a genome is the identification and annotation of genes. Identification of the functional content of a genome. - protein encoding genes - ribosomal RNA genes (rRNA) - transfer RNA genes (tRNA) - small RNAs Gene identification is more difficult in eukaryotes than prokaryotes.

Characteristics of bacterial and archaeal genomes Gene annotation in prokaryotes (bacteria and archaea) is relatively simple compared to eukaryotes because: 1. High gene density – one gene per kilobase 2. Absence of introns 3. Very little repetitive DNA

Annotation of genes in bacteria (and archaea) Genes are most simply identified by the presence of long open reading frames (ORFs) Prokaryotic genes are often in an operon structure

Gene recognition in genomic DNA 1. Open reading frame (ORF) length An ORF is defined by a start codon and a stop codon. 5’-ATG GTG TTG TAA-3’ TAG TGA Alternative start codons in bacteria

Gene recognition in genomic DNA 2. Patterns of codon usage that are consistent with genes

Gene recognition in genomic DNA 2. Patterns of codon usage that are consistent with genes Markov models are very useful in defining the coding potential of putative protein-coding DNA sequences. e.g. GLIMMER and FGENESB

Markov chains and models A Markov model (or chain) refers to a series of observations in which the probability of an observation depends on a number of previous observations. Reflects codon bias and dependencies between codons

Markov chains and models A Markov model (or chain) refers to a series of observations in which the probability of an observation depends on a number of previous observations. The number of previous observations defines the order of the chain  Fifth order Markov model are used in gene prediction  For coding regions of DNA, it is well known that the probability of a given base depends on the 5 bases preceding it. Reflects codon bias and dependencies between codons

Fifth order Markov models Coding DNA sequence: * 5’-ATG M GAT D ATC I GCC A ATC I CAC H TAA-3’ * Reflects codon bias and dependencies between codons How well does the local nucleotide sequence conform to the fifth order dependencies observed in coding regions? The higher the conformity, the higher the probability the DNA sequence is protein-encoding

Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. Genetic elements (e.g. coding or noncoding) are referred to as states

Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. Genetic elements (e.g. coding or noncoding) are referred to as states Transition probabilities: how likely a change of state is, as one moves through the sequence

Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. Genetic elements (e.g. coding or noncoding) are referred to as states Transition probabilities: how likely a change of state is, as one moves through the sequence Emission probabilities: each state emits a particular nucleotide with some probability

Hidden Markov models Hidden Markov models (HMMs) are used to provide a statistical representation of real biological processes. The sequence characteristics provide information on how likely a state is as one moves through the sequence. The user “sees” the nucleotide sequence being analyzed, but does not actually see the state that the base is in - hence the term “hidden” markov model.

HMMs need training sets Hidden Markov models (HMMs) are derived from training sets, where the correct structure is already known. Transition and emission probabilities are derived from training sets. The objective of training is to define a set of parameters that maximize the correct prediction for a new sequence of interest. Model parameters differ from organism to organism, therefore the success of a HMM-based method depends on how well the training set represents the sequence of interest.

Gene recognition in genomic DNA 3. A consensus sequence for ribosome binding site in the vicinity of a start codon. -20 +13 5’-ATG ATG TAA-3’ TAG TGA In bacteria, ribosome binding site is called a Shine-Dalgarno sequence.

A logo of the RBS and start codon in E. coli genes Fig 10.4

Gene recognition in genomic DNA 4. Homology of putative genes to other previously described genes - Genomic DNA can be searched against protein databases using blastx - Exons can be matched to cDNA sequences

Gene recognition in genomic DNA 1. Open reading frame (ORF) length 2. Patterns of codon usage that are consistent with genes 3. A consensus sequence for ribosome binding site in the vicinity of a start codon. 4. Homology of putative genes to other previously described genes Intrinsic approaches (ab initio) Extrinsic approaches

Bacterial gene prediction: GLIMMER GLIMMER is a bacterial (archaeal and viral) gene finding algorithm that uses a fifth order Markov chain. Step 1. Build a Markov model from a training set Step 2. Scan genomic DNA sequence to predict genes Criteria for gene finding: - start and stop codon - minimal length for an ORF

FGENESB: bacterial operon and gene prediction FGENESB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. http://linux1.softberry.com/berry.phtml

FGENESB: step by step description of annotation 1. Finds all potential ribosomal RNA genes using BLAST search against ribosomal RNA databases. 2. Predicts tRNA genes using tRNAscan-SE program

FGENESB: step by step description of annotation 1. Finds all potential ribosomal RNA genes using BLAST search against ribosomal RNA databases. 2. Predicts tRNA genes using tRNAscan-SE program 3. Initial prediction of ORFs using fifth and second order Markov models 4. Predict operons based on distance between predicted genes

FGENESB: step by step description of annotation 5. Runs BLAST for predicted proteins against COG database 6. Uses information about known neighboring gene pairs to improve operon prediction 7. Runs BLAST for predicted proteins against NCBI nr database 8. Predict promoters and terminators 9. Refine operon predictions using promoter and terminator evidence.

Example of FGENESB output Genomic features Location of features BLAST results No. of operons No. of genes

Outline Identification of genes in prokaryotic genomes - Genome and gene structure - Introduction to Hidden Markov models (HMMs) - Example programs: GLIMMER and FGENESB 2. Identification of genes in eukaryotic genomes - Intrinsic and extrinsic approaches - Example programs: FGENESH

From eukaryotic DNA to protein Fig 10.10

Additional difficulties with gene identification in eukaryotes 1. Eukaryotic genes are split into introns and exons. 2. For many eukaryotes, most of the genome does not encode genes. - e.g. less than 2% of vertebrate genomes code for proteins

Annotation of genes in eukaryotes Intrinsic approaches: 1. Predicting gene structure through computational analysis of genomic DNA sequence Extrinsic approaches: Aligning ESTs or cDNA to genomic DNA sequences 2. Mapping genes from one organism to conserved regions of a closely related organism

Computational gene prediction Typically, gene prediction from eukaryotic genomes involves the following steps: Identify and score exon-intron splice sites and start and stop signals along the DNA sequence Predict candidate exons from these signals Score exons and incorporate any homology-based or comparative genome information. Assemble a subset of exon candidates into a predicted gene structure

Prediction of Exon-Defining Signals There are four basic signals involved in defining coding exons. 5’ splice site 3’ splice site These sequence signals can be detected using position weight matrices (PWMs) calculated from known functional signals.

Computational gene prediction in eukaryotes Fig 9.4

Assembly of exons into a gene structure Splicing exons together into a gene structure can eliminate false exons by examining whether the ORF established by the initial exon is preserved. PROBLEM: the number of possible exon assemblies increases exponentially with the number of predicted exons. SOLUTION 1: Dynamic programming methods e.g. GRAIL2, FGENESH, GENEID SOLUTION 2: HMMs to define highly complex, multi-exonic genes. e.g. GENESCAN, GENIE, HMM-gene

HMMs in Eukaryotic Gene Prediction There are additional “states” for eukaryotic gene models compared to prokaryotic gene models. - exons, introns, splice donors and acceptors 5’ splice site 3’ splice site

HMMs in Eukaryotic Gene Prediction Working from 5’ to 3’ along a DNA sequence, a Hidden Markov Model may take into account the unique characteristics of: - Promoter regions - Transcriptional start sites (TSSs) - 5’ UTRs - Start codons - Exons and introns (as well as the splice sites) - Stop codons - 3’ UTRs - PolyA tails

Sequence Similarity-based Gene Prediction Expressed sequence tags (EST) are extremely valuable for identifying genes and defining exonic structure. Sequences arising from mature mRNA are mapped back onto genomic DNA sequences. Homology search of a DNA sequence that contains three exons against the EST database Fig 9.1

Gene Prediction Programs GRAIL: one of the first gene finding algorithms developed http://compbio.ornl.gov/grailexp/

Gene Prediction Programs: Annotation pipeline http://compbio.ornl.gov/tools/pipeline//

Gene Prediction Programs http://genes.mit.edu/GENSCAN.html

Gene Prediction Programs http://linux1.softberry.com/

Gene prediction methods have different levels of accuracy and efficiency. They are scored according to two criteria: (i) Sensitivity – i.e., the proportion of genes that have been correctly predicted. (ii) Specificity – the proportion of predicted genes that is correct.