Survey of Misannotations and Pseudogenes in the Arabidopsis Genome Tanmay Prakash
Objectives Objectives Find Possible Misannotations Find Possible Pseudogenes Why Misannotation can hinder research Pseudogenes can be used to study natural selection
Misannotations CDS Intron UTR Many misannotations are the result of gene prediction programs mislabeling introns because of the presence of a stop codon
Pseudogenes Pseudogenes are DNA sequences that no longer function but resemble the functional genes they once were. There are two types: Processed Non-processed Common Properties of Pseudogenes Stop Codons Frameshift mutations Lack of Selective Pressure Processed:formed by retrotransposition and comprise most of the pseudogenes in mammals Non-processed:products of duplication of the entirety of portion of a segment of genes followed by mutations. Because polyploidiszation (the process of having more one sets of chromosomes) is common in plants, the majority of pseudogenes in plants are non-processed Lack of Selective Pressure: Measured using Ka/Ks. Ka(nonsyn) Ks(syn). Functional genes have more syn so Ka/Ks significantly less than one. Pseudogenes don’t care so Ka/Ks significantly closer one. Because pseudogenes have these stop codons and frameshift mutations, the gene prediction programs often misannotate them agtacatgcataggactcgatcgactc STCIGLDRL agtacatgataggactcgatcgactc ST..DSID
Pipeline Query Protein Domains Genes BLAST Matching Search In Introns Subject Arabidopsis Introns BLAST Search HMMER CDS Genes Matching In Introns In CDS In Both Possibly Misannotated Check for Stop Codons Frameshift Check Ka/Ks Possible Pseudogenes
Query Protein Domains Genes BLAST Matching Search In Introns Subject Arabidopsis Introns Query Protein Domains HMMER Search Genes Matching In Exons Subject Arabidopsis CDS Each of the 8296 protein domain families is searched against the introns of the 25000 genes of the Arabidopsis genome. This finds any introns where there are matches to a protein domain. This is done also for the coding sequence of the Arabidopsis genome, but using a HMMER search. A HMMER search would’ve been used for the intron search, but it would take far too long. This search found matches to any domains in the coding sequence.
Genes Possibly Matching Misannotated In Both Genes Genes that don’t have matches to the same domain in both the introns and the coding sequence are then filtered out. These genes are possibly misannotated. These genes were further filtered to leave the genes that had matches in an intron and its flanking exons. These introns will be checked for stop codons and frameshift mutations. The Ka/Ks value will also be checked. This information will be used to identify pseudogenes.
Results There were 346 genes (different models not included) that had matches to the same domain in the introns and exons There were 299 genes (different models not included) that had matches to the same domain in an intron and flanking exons. These are most likely misannotations.
4 domains with the most possible misannotations
Future Research Identify pseudogenes by looking for stop codons, and frameshift mutations in the introns and checking the Ka/Ks value Use a more recent database of domains Follow the same process for the rice genome
Acknowledgement Dr. Shin-Han Shiu Dr. Kosuke Hanada Dr. Melissa Lehti-Shiu Dr. Gail Richmond HSHSP