Gene Prediction and Annotation techniques Basics

Gene Prediction and Annotation techniques Basics
Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 Assumption for this lecture is we are looking at Eukaryotic organism with a bias for Trypanosoma and especially malaria The objective of the topic is to find genes in your organism of interest. In some cases, you have to computationally predict the gene rather than finding the gene. We are not going to cover pidentification of sequences, such as promoter prdiction that regulate the activity of genes. Acknowledgement: Daniel Lawson, Neil Hall

What is gene prediction?
Detecting meaningful signals in uncharacterised DNA sequences. Knowledge of the interesting information in DNA. Sorting the ‘chaff from the wheat’ GATCGGTCGAGCGTAAGCTAGCTAG ATCGATGATCGATCGGCCATATATC ACTAGAGCTAGAATCGATAATCGAT CGATATAGCTATAGCTATAGCCTAT Human 5% coding and 95% repetitive or not certain; so what is important and what is not important; Daniel Lawson slide Gene prediction is ‘recognising protein-coding regions in genomic sequence’

Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence 1. Translate in all six reading frames and compare to protein sequence databases 2. Perform database similarity search of expressed sequence tag Sites (EST) database of same organism, or cDNA sequences if available Use gene prediction program to locate genes With the use of whole genome more data, thusgreater demand for computer program that scan genomic DNA sequences to find genes, particularly ones that encodes proteins. Usually gene prediction is done through a pipeline approach. You only predict genes for unknown genes!!! No one replaces a computer predicted gene (hypothetical proteins) for one that is known experimentally. That would be ridiculous. Once you have the genomic sequence, the mostly likely protein encoding regions are identified and the predicted proteins are then subjected to a database similiarity search. The genomic DNA sequence is then annotated with info on the exon-intron structure and lcation of each predicted gene along with any functional information based on the database searches. Analyze regulatory sequences in the gene

ACEDB View This is ACEDB; see exon and intron running 3’ to 5’. Dan Lawson slide.

Why is gene prediction important?
Increased volume of genome data generated Paradigm shift from gene by gene sequencing (small scale) to large-scale genome sequencing. No more one gene at a time. A lot of data. Foundation for all further investigation. Knowledge of the protein-coding regions underpins functional genomics. Note: this presentation is for the prediction of genes that encode protein only; Not promoter prediction, sequences regulate activity of protein encoding genes

Wormbase to go from 3’to5’ add more data and make cross connection;

Map Viewer Genome Scan Models Contig GenBank Genes Mouse EST hits
Human EST hits

Ensembl automated gene prediction; see exon structure

Artemis – Free Genome Visualization/ Annotation Workbench
Artemis; visualization work bench. S. pombe; See splice site

Genome WorkBench

Knowing what to look for
What is a gene? Not a full transcript with control regions The coding sequence (ATG -> STOP) Start Middle N End Recognizing what is out of there. Protein coding gene; Not trying to find upstream element to control ATG = methioinine; Daniel Lawson

ORF Finding in Prokaryotes
Simplest method of finding DNA sequences that encode proteins by searching for open reading frames An ORF is a DNA sequence that contains a contiguous set of codons that species an amino acid Six possible reading frames Good for prokaryotic system (no/little post translation modification) Runs from Met (AUG) on mRNA  stop codon TER (UAA, UAG, UGA) NCBI ORF Finder ORF finding is fine for prokaryotic system because dna sequence that encode proteins are transcribed into mRNA and the mRNA is usually translated directly into proteins without significant modifications

ORF Finder (Open Reading Frame Finder)

Annotation of eukaryotic genomes
Genomic DNA ab initio gene prediction (w/o prior knowledge) transcription Unprocessed RNA RNA processing Comparative gene prediction (use other biological data) Mature mRNA Gm3 AAAAAAA translation Nascent polypeptide folding Red is coding = exon; central dogma; blue are introns Three area of annotation; Ab initio genome to mature DNA Information of mature mRNA to polypeptide = comparative gene prediction Then functional identification = what this peptide does In Eukaryotic organisms, transcription of protein encoding regions initiated at specific promoter sequences is followed by removal of noncoding sequences (introns) from premRNA by a splicing mechanism leaving the protein encoding exons. Some more modificatin, then the mature mRNA is translated 5’ to 3’ direction usually from the first start codon to the first stop codon. As a result of the presence of the intron sequences in the genomic DNA sequences of eukaryoties, the ORF corresponding to an encoded gene will be interrupted by the presence of introns that usually generate codons. Note: ab initio gene prediction is a statistical process that finds protein coding genes and similarity to experimentally confirmed genes or proteins. Active enzyme Functional identification Function Reactant A Product B

Two Classes of Sequence Information
Signal Terms – short sequence motifs (such as splice sites, branch points,Polypyrimidine tracts, start codons, and stop codons) Content Terms – pattern of codon usage that are unique to a species and allow coding sequences to be distinguished from surrounding noncoding sequences by a statistical detection algorithm You can find signal terms in almost all eukaryotic genomes. In smaller eukaryotes such as yeast, signal terms are sufficient to characterize a gene

Problem Using Codon Usage
Program must be taught what the codon usage patterns look like by presenting the program with a TRAINING SET of known coding sequences. Different programs search for different patterns. A NEW training set is needed for each species Untranslated regions (UTR) at the ends of the genes cannot be detected, but most programs can identify polyadenylation sites Non-protein coding RNA genes cannot be detected (attempt detection in a few specialized programs) Non of these program can detect alternatively spliced transcripts

Explanation of False Positive/Negative in Gene Prediction Programs

Issues regarding gene finding in general Genome size
Gene finding: Issues Issues regarding gene finding in general Genome size (larger genome ~ more genes, but …) Genome composition Genome complexity (more complexity -> less coding density; fewer genes per kb) cis-splicing (processing mRNA in Eukaryotics) trans-splicing (in kinetisplastid) alternate splicing (e.g. in different tissues; higher organism) Variation of genetic code from the universal code Larger genome theoretically more genes, but it is more related on the complexity of the genome. Less dense your coding; the more complexity

Gene finding: genome Genome composition Genome complexity
Long ORFs tend to be coding Presence of more putative ORFs in GC rich genomes (Stop codons = UAA, UAG & UGA) Genome complexity Simple repetitive sequences (e.g. dinucleotide) and dispersed repeats tend to be anti-coding May need to mask sequence prior to gene prediction Plasmodium falciparum for AT rich – long GC then it is likely be coding; Reading frames in other frames Don’t be completely based on length Generally you see something is highly repetive you tend to avoid it but in highly repetitive organisms like malaria, that is not the case

Gene finding: coding density
As the coding/non-coding length ratio decreases, exon prediction becomes more complex Human Fugu worm Slide from Daniel Lawson Orthologue throughout; intron shorter in fugu and worm. No intron in e.coli, so easier E.coli

Gene finding: splicing
cis-splicing of genes Finding multiple (short) exons is harder than finding a single (long) exon. trans-splicing of genes A trans-splice acceptor is no different to a normal splice acceptor Eukaryotics harder because of splicing; Trans-splicing where small pieces of DNA are attached to the 5’ end of the transcript – kinetoplastid and worms; No difference between in normal trans splicing acceptor and normal 5’ splice acceptor. Be aware of the difference or may have problems in the functional annotation. worm E.coli

Gene finding: alternate splicing
Alternate splicing (isoforms) are very difficult to predict. Human A Human B Human C Three isoform a human gene; permutation; Don’t know if they skip the exons here; only can find by looking at the transcript themselves

What is ab initio gene prediction?
ab initio prediction What is ab initio gene prediction? Prediction from first principles using the raw DNA sequence only. GATCGGTCGAGCGTAAGCTAGCTAG ATCGATGATCGATCGGCCATATATC ACTAGAGCTAGAATCGATAATCGAT CGATATAGCTATAGCTATAGCCTAT All ab initio gene prediction is only as good as their training set. Requires ‘training sets’ of known gene structures to generate statistical tests for the likelihood of a prediction being real.

Gene finding: ab initio
What features of an ORF can we use? Size - large open reading frames DNA composition - codon usage / 3rd position codon bias Kozak sequence CCGCCAUGG Ribosome binding sites Termination signal (stops) Splice junction boundaries (acceptor/donor) Usually the start off with a cut off, arbitrary cut off threshold. For example, yeast 100bp, but you may get a lot of putative genes that you will have to look at later!!! Size a lone is not a good rationale, guide line, first step. Cam start scoring intiation methionine based on the Kozak sequence. Can wobble on the third position. The initiation site for translation in Eukaryotes mRNAs is usually the AUG codon nearest the 5’ end of the mRNA, but sometimes downstream AUG codons still close to the 5’ end of the mRNA may also be used (Kozak 1999)

Gene finding: features
Think of a CDS gene prediction as a linear series of sequence features: Initiation codon Coding sequence (exon) Splice donor (5’) N times Non-coding sequence (intron) Splice acceptor (3’) Coding sequence (exon) Termination codon

A model ab initio predictor
Locate and score all sequence features used in gene models dynamic programming to make the high scoring model from available features. e.g. Genefinder (Green) Running a 5’-> 3’ pass the sequence through a Markov model based on a typical gene model e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER (Salzberg) Running a 5’->3’ pass the sequence through a neural net trained with confirmed gene models e.g. GRAIL (Oak Ridge) May want to skip this step Locate all the ATG and stop and give a score. Look for small or large opening reading frame;

Ab initio Gene finding programs
Most gene finding software packages use a some variant of Hidden Markov Models (HMM). Predict coding, intergenic, and intron sequences Need to be trained on a specific organism. Never perfect! You need to train for your own organism. How do you train? Basically provide experimentally derived genes from the literature for a specific organisms as many as you can so the program has a sense of what the sequence pattern is like and you go from there. We are not going into the detail of how HMM works!!!

What is an HMM? A statistical model that represents a gene.
Similar to a “weight matrix” that can recognise gaps and treat them in a systematic way. Has different “states” that represent introns, exons, and intergenic regions. Definitely don’t go too much detail on what HMM.

Malaria Gene Prediction Tool
Hexamer – ftp://ftp.sanger.ac.uk/pub/pathogens/software/hexamer/ Genefinder – GlimmerM – Phat – Already Trained for Malaria!!!! The more experimental derived genes used for training the gene prediction tool the more reliable the gene predictor.

GlimmerM Salzberg et al. (1999) genomics 59 24-31
Adaption of the prokaryotic genefinder Glimmer. Delcher et al. (1999) NAR Based on a interpolated HMM (IHMM). Only used short chains of bases (markov chains) to generate probabilities. Trained identically to Phat GlimmerM uses different criteria than PHAT. You need a good training set; the more genes identified and trained for the gene prediction tool, the more reliable the gene predictor will be come, generally

An end to ab initio prediction
ab initio gene prediction is inaccurate Have high false positive rates, but also low false negative rates for most predictors Incorporating similarity info is meant to reduce false positive rate, but at the same also increase false negative rate. Biggest determinant of false positive/negative is gene size. Exon prediction sensitivity can be good Rarely used as a final product Human annotation runs multiple algorithms and scores exon predicted by multiple predictors. Used as a starting point for refinement/verification Prediction need correction and validation -- Why not just build gene models by comparative means? High false positive in humans. Why go for raw sequence?

PAUSE (continue)

Annotation of eukaryotic genomes
Genomic DNA ab initio gene prediction (w/o prior knowledge) transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA Comparative gene prediction (use other biological data) translation Nascent polypeptide folding Information we have from the biology. WE have mature mRNA or polypetptide and we want to map back to the genome Active enzyme Functional identification Function Reactant A Product B

If a cell was human? The cell ‘knows’ how to splice a gene together.
We know some of these signals but not all and not all of the time So compare with known examples from the species and others Central dogma for molecular biology Genome Transcriptome Proteome DNA Protein RNA

When a human looks at a cell
Compare with the rest of the genome/transcriptome/proteome data DNA Protein RNA Extract DNA and sequence genome Extract RNA, reverse transcribe and sequence cDNA Peptide sequence inferred from gene prediction

comparative gene prediction
Use knowledge of known coding sequences to identify region of genomic DNA by similarity transcriptome - transcribed DNA sequence proteome - peptide sequence genome - related genomic sequence

Transcript-based prediction: datasets
Generation of large numbers of Expressed Sequence Tags (ESTs) Quick, cheap but random Subtractive hybridisation to find rare transcripts Use multiple libraries for different life-stages/conditions Single-pass sequence prone to errors Generation of small number of full length cDNA sequences Slow and laborious but focused Large-scale sequencing of (presumed) full length cDNAs Systematic, multiplexed cloning/sequencing of CDS Expensive and only viable if part of bigger project Malaria have many life stages – subtractive in asexual won’t find in another life cycle. So you need to make in the correct life cycle. Careful when. Can do subtractive hybridization to find rare transcripts – time consuming, but work well. Genome sequence assembled by many subclones with multiple reads so higher confidence it is correct. So in EST have insertion/deletion events. Can multiplexed the sequencing of fulll length cDNAs

Gene Prediction in Eukaryotes – Simplified
For highly conserved proteins: Translate DNA sequence in all 6 reading frames BLASTX or FASTAX to compare the sequence to a protein sequence database Or Protein compared against nucleic acid database including genomic sequence that is translated in all six possible reading frame sby TBLASTN, TFASTAX/TFASTY programs. Note: Approximation of the gene structure only.

Transcript-based prediction: How it works
Align transcript data to genomic sequence using a pair-wise sequence comparison Gene Model: EST Sequence 5’ and 3’. If you have enough EST you get complete coverage. OST = open reading ST, need to design oligo to this sequence. Remove OST – I don’t see it used cDNA

Transcript-based gene prediction: algorithm
BLAST (Altshul) (36 hours) Widely used and understood HSPs often have ‘ragged’ ends so extends to the end of the introns EST_GENOME (Mott) (3 days) Dynamic programming post-process of BLAST Slow and sometimes cryptic BLAT (Kent) (1/2 hour) Next generation of alignment algorithm Design for looking at nearly identical sequences Faster and more accurate than BLAST Idea is to map transcript EST data back to the genome. How, e.g. using BLAST. BLAST extend match into the introns; can’t use raw data to build gene models without validating the HSP EST_Genome Richard Mott – blast and dynamic programming to play with edge of HSP; Slow and sometimes cryptic. Knows what is the 5’ reads. BLAT – Jim Kent more accurate than BLAST in EST, design to look for identical sequences basically for 100% match in two seqences; doesn’t replace BLAST

Peptide-based gene prediction: algorithm
BLAST (Altshul) Widely used and understood Smith-Waterman Preliminary to further processing Used in preference to DNA-based similarities for evolutionary diverged species as peptide conservation is significantly higher than nucleotide

Genomic-based gene prediction: algorithm
BLAST (Altshul) Can be used in TBLASTX mode BLAT (Kent) Can be used in a translated DNA vs translated DNA mode Significantly faster than BLAST WABA (Kent) Designed to allow for 3rd position codon wobble Slow with some outstanding problems Only really used in C.elegans v C.briggsae analysis

Comparative gene predictors
This can be viewed as an extension of the ab initio prediction tools – where coding exons are defined by similarities and not codon bias GAZE (Howe) is an extension of Phil Green’s Genefinder in which transcript data is used to define coding exons. Other features are scored as in the original Genefinder implementation. This is being evaluated and used in the C.elegans project. GENEWISE (Birney) is a HMM based gene predictor which attempts to predict the closest CDS to a supplied peptide sequence. This is the workhorse predictor for the ENSEMBL project.

Comparative gene predictors
A new generation of comparative gene prediction tools is being developed to utilise the large amount of genomic sequence available. Twinscan (WashU) attempts to predict genes using related genomic sequences. Doublescan (Sanger) is a HMM based gene predictor which attempts to predict 2 orthologous CDS’s from genomic regions pre-defined as matching. Both of these predictors are in development and will be used for the C.elegans v C.briggsae match and the Mouse v Human match later this year.

Summary Genes are complex structure which are difficult to predict with the required level of accuracy/confidence We can predict stops better than starts We can only give gross confidence levels to predictions (i.e. confirmed, partially confirmed or predicted) Gene prediction is only part of the annotation procedure Movement from ab initio to comparative methodology as sequence data becomes available/affordable Curation of gene models is an active process – the set of gene models for a genome is fluid and WILL change over time.

The Annotation Process
DNA SEQUENCE Useful Information ANNALYSIS SOFTWARE The more analysis software; more information churn out, but we are only interested in the useful information; the only filter is the annotator (human). There are automated annotation package, but they don’t provide as good of a quality yet as a qualified trained annotator; so human better than machine for now; Annotator

Genes Annotation Process Blastn DNA sequence Repeats Promoters
RepeatMasker Blastn Halfwise Blastx Gene finders tRNA scan Repeats Promoters Pseudo-Genes rRNA Genes tRNA Fasta BlastP Pfam Prosite Psort SignalP TMHMM This is overview of annotation system at the Sanger Centre; software used and how to use it. Have a range of software packages to analyze sequences. Gene finding tools to identify genes and some cases pseudo genes Blastx to look at conserved protein regions Have set of genes run software skip; assign specific function to genes.

Artemis Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation. Highly portable; some support; run off laptop e.g.

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa cacataacaattataatgacatatcaaataataataataataataataatattaatggggtgaaagaccatataaataataacactctggaaaataatga tgaaccaatcttatctatatataatgaagatcttaatgttttatatatatgccaaaatatgtataacgtcctttttgttttgaatttaaataacctaagt You turn DNA sequences into something more meaningful

DNA in Artemis GC content Forward translations Reverse Translations
Black bar = stop codon Forward translations This would like in Artemis; Black bar are stop codon. Notice your GC content increases where you have genes. Reverse Translations DNA and amino acids

Extra Slides

Gene prediction What is gene prediction?
Why is gene prediction important? Ab initio gene prediction (w/o prior knowledge) Comparative gene prediction (use other biological data) Summary

Genome annotation is central to functional genomics
ORFeome based functional genomics RNAi phenotypes Gene Knockout Foundation for further investigation. Daniel Lawson slide – very C. elegan centric here. Expression Microarray

Gene finding Artemis genome viewer
Coding sequence vs non coding sequence Gene finding software Homology between species ESTs Gene finding software

ACEDB output, if you do ab inition gene prediction, you have to mark up everything of your interest; yellow bar is the

Pretty Handy Annotation Tool (PHAT)
Based on a generalised hidden Markov model (GHMM) Free easily installed and run. Is good at predicting multiexon genes but will in some cases miss out genes altogether and will over predict. Cawley et al. (2001) Mol. Bio. Para. 118 p167  This website doesn’t work anymore!!!! Comes ready train with your organism.

Phat http://linkage.rockefeller.edu/wli/gene/krogh98.pdf
This is example of HMM used in PHAT.

GlimmerM Under predicts splicing
Hardly hardly ever misses a gene completely. Does over predict. Free with TIGR license

Comparison Of Gene Finders
Show different gene finders in Artemis; with different gene finders. Gene models needed;

Gene Prediction and Annotation techniques Basics

Similar presentations

Presentation on theme: "Gene Prediction and Annotation techniques Basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Prediction and Annotation techniques Basics

Similar presentations

Presentation on theme: "Gene Prediction and Annotation techniques Basics"— Presentation transcript:

Similar presentations

About project

Feedback