Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G
In Vitro Approaches Transcription Translation Linked Site Specific Mutagenesis Promoter Fusions
Runoff Protocol and Controls
Ribosome Binding Sequences
In vitro translation
96-well (high througput) translation
Linked In Vitro Transcription- Translation
APC Protein Truncation Test
In Vivo Approaches Prokaryotic expression –E. coli maxicells –E. coli minicells Metazoan expression –Yeast Overexpression –Baculovirus Expression for Rapid Analysis –X. laevis oocytes Expression in Mammalian Cells –Transient Transfection –Stable Transfection –ES Cells –Transgenic Mice Knock in Knock out
Background Definitions Working Draft – A working draft sequence has come to mean a genomic sequence before it is finished. Working draft sequences contain multiple gaps, underrepresented areas and misassemblies. In addition, the error rate of working draft sequence is higher than the 1 in 10,000 error rate that is standard for finished sequences. FASTA file – A common file format used for the storage and tranfer of sequence data. It contains raw DNA or protein sequence, but no annotation information.
SENSORS An algorithm specialized to identify a feature of a sequence, such as a possible splice site.
Neural Network Neural networks are analytical techniques modeled after the (proposed) processes of learning in cognitive systems and the neurological functions of the brain. Neural networks use a data ‘training set’ to build rules that can make predictions or classifications on data sets.
Rule-Based System A type of computer algorithm that uses an explicit set of rules to make decisions.
Hidden Markov Model A type of computer algorithm that represents a system as a set of discrete states and transitions between those states. Each transition has an associated probability. Markov models are ‘hidden’ when one or more of the states cannot be directly observed.
AB INITIO GENE PREDICTION A class of software that attempts to predict genes from sequence data without the use of prior knowledge about similarities to other genes.
In Silico Approaches Sensors –Single Feature Predictors HEXON MZEF Neural Networks –GRAIL Rule Based Systems –GeneFinder under construction by Hidden Markov Models –GenScan –Genie –Fgenes –GeneMark.hmm –HMMGene
Ab Initio Methods Comparative Genomics dbEST BLASTX TAP and PASS
Evaluation of In Silico Approaches
Scheme for an Ab Initio Approach
Diagrammatic Evaluation of an In Silico Approach
Hidden Markov Model A hidden Markov model explicitly models the probabilities for the transition from one part of a gene to another. In this model, used by the GENSCAN algorithm, each circle or diamond represents a functional unit in the gene. For example Eint is the initial exon and Eterm is the last. The arrows represent the probability of a transition from one part of a gene to another. The algorithm is ‘trained’ by running a set of known genes through the model and adjusting the weights of each transition to reflect realistic transition probabilities. Thereafter, test sequence data can be run through the model one base position at a time, and the model will read out the probability of a gene being present at that position. The states that occur below the dashed line correspond to a gene in the reversed strand, and thus are symmetric with those abovethe line. E, exon, I, intron, UTR untranslated region, pro, promoter.
Evaluating Ab Initio Gene Predictions