An Analysis of “Gene Finding in Novel Genomes” Michael Sneddon
Basic Reference Information “Gene Finding in Novel Genomes” Written by Ian Korf BMC Bioinformatics Published in May of
Purpose of Gene Finding Given a genome, we would like to predict which areas actually code for proteins and which areas do not This is important because we can then focus on the areas that actually code for something Can also point us at places in the genome to look for unknown genes
Gene Finding Techniques Gene Finding is very difficult to do accurately Current methods employ Hidden Markov Models to discover genes We are able to recognize patterns by training our HMM with test data where we already know which areas are genes and which are not
Gene finding in new Genomes The problem is that we are sequencing genomes faster than we can research them and therefore we have a lack of training sets to create good HMMs Currently, the best way to find genes in new genomes is to use a program designed for a different genome and hope it gives a good approximation
SNAP – Korf’s Approach Korf believes that the current approach does not provide a good approximation for finding genes in new genomes Designed SNAP, which runs several other gene finding programs and estimates parameters based on their results SNAP also uses a Hidden Markov Model
SNAP HMM State Diagram E: Exon State I: Intron State N: Intergenic
Methods of Testing Used genomes from A. thaliana, O. sativa, C. elegans, and D. melanogaster. Simple genomes Compared his software to other leading gene finding software including Genescan, Genefinder, HMMGene, and Augustus Compared how well the programs performed
Data Used in Testing Table 1. Data set characteristics At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa.Arabidopsis thalianaCaenorhabditis elegansDrosophila melanogasterOryza sativa GenomeSequenceGenesGCSingle-exon GenesMean ExonMean Intron At 1.89 Mb %19.8%230 bp157 bp Ce 3.02 Mb %2.2%220 bp334 bp Dm 3.66 Mb %24.9%394 bp948 bp Os 1.55 Mb %22.9%237 bp350 bp
Performance of SNAP
Parameters taken from other species
Analysis of parameters that his program used and demonstration of how they would be better suited for new genomes
Next Steps Since he used a relatively simple genome, the next step is to analyze larger genomes to see if he gets similar results Gene finding is still very difficult and additional research will be made regarding how to better estimate HMM parameters
My Opinions Results were very clear and organized Program is available free online Needed a better explanation of how his program took results from other programs and used that information Better documentation for his program so that more people are able to use and specialize it for specific genomes