EXPLORING DEAD GENES Adrienne Manuel I400
What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA Results from reverse transcription from an mRNA transcript Or from gene duplication and subsequent disablement
Expression of Pseudogenes Evidently transcribed Expression of pseudogenes vary Snail (lymnaea stagnalis) example of an organism that still has functioning
Pseudogenes, Good and Bad! - Raised expression for tumor cells + Useful in studying molecular evolution + Helpful in determining rates of genomic DNA Loss for an organism
Size and Distribution of Pseudogenes DEFINING POPULATIONS AND SUBPOPULATIONS G ‘G’ the total population of confirmed and predicted protein-encoding genes ΨG is the estimated population of pseudogenes that correspond to G
The Set of genes with at least one verifying EST match was derived G E A set of genes that were deemed to be highly expressed was derived from microarray expression data and denoted G M The corresponding predicted tool or pseudogenes is denoted ΨG M
Data Files Sanger Sequencing Centre ftp (ftp://ftp.sanger.ac.uk) in this website are the six complete sequences of worm chromosomesftp://ftp.sanger.ac.uk GFF Data Files with annotations for genes and other genomic features that correspond to wormpep18 Arranged were the pseudogene population in the form of a pipeline
Pipelines Step 1: Sanger centre pseudogene annotations Start with list of 332 pseudogenes Pseudogene population was derived by looking for gene disablement Step 2: FASTA matching to find potential pseudogenes
PIPELINES (continued) Worm genes masked for low complexity region with the program SEG TFASTX and TFASTY are next used to compare the complete wormpep18 against the worm genome After comparison Pseudogene matches were refined with the next step
Pipeline (continued) Step 3: reduction for overlaps on the genomic DNA Significant matches of protein sequences to the DNA were reduced for redundancy where homologs match the same segment of DDNA Matches are then sorted Step 4: Prevention of over counting for adjacent matches. Initial matches may correspond to same pseudogene To avoid over counting matches were realigned
Pipeline Step 5: Masking against Sanger Centre annotation and Transposon library. Potential pseudogenes filtered for overlap with any other annotations in the Sanger Centre GFF files e.g. exons of genes, tandem or inverted repeats Step 6: Reduction for possible additional repeat elements At this point there is a set of 3814 pseudogenic fragments
Pipeline (final step) Step 7: reducing threshold stringency e-value match threshold reduced from.01 to.001 Check the web! To find pseudogene population, the data can be viewed either by searching for protein name or viewing specific range in the chromosome
Size of Pseudogene Popuation Composed of 2168 sequence, that’s about 12% of total gene complement Factors that affect the size: 1. Dead copies of transposable elements 2. Size of pseudogene underestimated because pseudogenes with less obvious disablement aren't included. 3.Annotated genes might be pseudogenes because disablement is undetectable 4. Pseudogenes still part of functioning gene 5. Some pseudogenes arise due to sequencing errors 6. Possible genomic repeats
SUBPOPULATIONS Highly expressed genes have fewer dead gene copies The most reliable subset of the pseudogene population is about half the total for ΨG. 39% of pseudogenes are intronic-these kinds of pseudogenes aren't ailing families of proteins
Chromosomal Distributions More abundant near the ends of chromosome (the “arms”) For each chromosome, there is a calculated proportion of dead genes
The data plot above indicates genome to genome over all age. The percentage composition for each of the 20 amino acids is graphed in decreasing order of the implied amino acid composition in the pseudogene set. In the bottom part of the figure, the G difference for each amino acid composition is indicated by a bar.
Listed are the largest sequence families in the worm ranked by genes and pseudogenes They’re named for their particular representative. Four of the 10 paralog genes family when ranked by number are functionally uncharacterized Three of the pseudogenes top 10 are amongst the biggest families when we rank according to number of genes
Pseudofolds These charts ranked in terms of implied structural pseudofolds Proteins encoded by the worm genome have been assigned to globular domain folds From the SCOP database
Why was this studied again? To provide an initial estimate of the size distribution and characterizations of the pseudogene comparing C.elegans in attempt to estimate the total number in humans. Found few pseudogenes that are apparently due to processing in the worm genome Found large uncharacterized gene family that makes up 2/3 of dead genes Arms of chromosome are an unreliable for encoding genes but more likely to spawn new proteins