What do you with a whole genome sequence?
Translate it into all 6 reading frames……
Identify all of the stop codons..…
And the start codons…… Can then identify all Open Reading Frames (ORFs) But are all real genes?
Three major prokayotic gene modelers: Generation uses predominantly 6-mer statistics to recognize coding regions; it uses a proximity rule-based start call with ATG and GTG as potential starts. Glimmer uses interpolated Markov models (IMMs) to identify the coding regions; it uses ATG, GTG, and TTG as potential starts. Critica uses blastn to produce alignments from the entire dataset and derives dicodon statistics to recognize coding sequences. It uses an SD sensor with ATG, GTG, and TTG as potential starts.
Now what? BLAST genes: To assign functions based on similarity with known genes
Basic Local Alignment Search Tool BLAST Basic Local Alignment Search Tool finds regions of local similarity between sequences >my favorite gene Atgtcgctagctagctsctagctag Database of many gene sequences GenBank is one example Answers the questions— Is there a match? And how good is it?
But--Gene D has only 20% identity to gene A! What are the genes doing? Function is assigned based on degree of similarity of an already characterized gene in the database 2 potential problems with this approach Transitive catastrophe Gene A Assigned function based on mutant phenotype or biochemical characterization of protein product Gene B From genome sequence: 70% identity to gene A Gene C From genome sequence: 60% identity to gene B Gene D From genome sequence: 70% identity to gene C But--Gene D has only 20% identity to gene A!
Would like to propagate function only to orthologous genes Homolog– genes sharing a common origin note: two genes are homologs or they or not no such thing as %homology or “more homologous” Two main kinds of homologs Orthologs-genes orginating from a single ancestral gene in the last common ancestor of the compared genomes Paralogs-genes related via duplication
X,Y,Z are genes in the same family A, B, C are three species
Two more complicated cases: Xenologs-genes orginating from a HGT of an ortholog in a distant lineage Pseudoparalogs- homologous genes that appear to paralogs in a single genome analysis but have arisen due to a combination of vertical and lateral descent
How to identify orthologs: One way: Reciprocal BLAST analysis >Genome A gene1 AGTGCATGTCCC >Genome A gene 2 TGTGCGTAGTCCAAA Database: Genome B AND >Genome B gene1 GGTTTTTACA >Genome B gene 2 AAACCTCTCTGA Database: Genome A ASK: are two genes each other’s Best BLAST hit?
Can be confounded by lineage specific gene loss
What if there is nothing at all similar in the database? 4% 4% 2% 20% Call it a “hypothetical” gene If it has a match but that is to another hypothetical gene? “conserved hypothetical” 1% 4% 1% 2% 32% 1% Conserved Hypothetical 25% Hypothetical 1% 4% DNA Replication & Repair Energy Metabolism Nucleotide Metabolism Lipid Metabolism Transcription Amino Acid Metabolism Translation Carbohydrate Metabolism Transport Cofactor Metabolism Unassigned