Friday July 18 th Update “WesList” proteins Wes and Mark? identified a number of potential targets based upon the presence of patents – the enzyme name, PlasmoDB ID and number of patents found were provided. It was decided that a round of target selection would be carried out to identify P. falciparum targets which might have medical relevance. I have cross-referenced this to the available P. falciparum proteins in the SGPP Target Selection database: 314 proteins were identified Selection was carried out based upon: 0 predicted TM regions Homology to PDB proteins – 55% identity threshold Homology to human redundant dataset – 55% identity threshold These proteins were also blasted against the T. brucei and L. major proteins currently present in the SGPP Target Selection database in order to identify a homologous protein dataset for target selection in these species.
Following exclusion of proteins based upon “normal” selection procedures as well as the previously discussed homology exclusions, 135 P. falciparum proteins were identified as possible targets and sent to Chris. 73 of these were identified as previously having been targeted – most likely from the previously carried out enzyme selection. A variety of data was sent to Chris, as well as the more usual sequence, length etc. Matches vs redundant human proteins match (>55% identity excluded) Matches vs PDB match (>55% identity excluded) Matches vs Structural Genomics Targets match Enzyme Name; Nbr patents; Priority This should allow him to select 96 targets from the list of 135. I did not carry out selection based upon the size thresholds in order to allow Chris to obtain 96 proteins, if necessary including a few above the 850 amino acid threshold. The average length in this set is 764; 51 proteins are above the normal size threshold of 850; 15 are longer than 1,000 amino acids. Chris will send me back a list of the proteins selected and will also send this list directly to Frank for inclusion in the WebPages and in the DB being maintained there.
It was also decided that we should attempt to obtain homologous proteins for this set for both L. major and T. brucei – I have primarily been concentrating upon T. brucei target selection. 93 T. brucei “WesList” homologues were identified for T. brucei – unfortunately the majority of those identifiable based upon T. brucei sequences already present in the SGPP target selection database were incomplete and therefore unusable. New T. brucei sequences have been downloaded from a variety of sources: 9,782 proteins are currently being parsed (this list is redundant), it appears that continuing sequencing and reannotation has provided an increased number of proteins with start and stop codons – I am waiting to complete parsing and analysis of this set before I send the “WesList” T. brucei homologue list to Chris. This will also provide T. brucei targets that are not homologous to the “WesList” set.
Few suitable homologues for the “WesList” set were identified for L. major. The L. major genome project was frozen as of the beginning of this month, and I am now (together with Chris peacock at the Sanger Institute) working on the reannotation of previously incomplete L. major chromosome sequences, together with annotation of a number of L. major chromosomes which were not included in previous L. major SGPP selections, primarily due to concerns of the quality of annotation, arising from identification of large numbers of incomplete (no start or stop codon) protein amino acid sequences (5 chromosomes worth). At SBRI we have been working on redeveloping our annotation database – these changes have been reflected in the SGPP Target Selection database and allow us to populate both databases with information on the newly identified L. major proteins. I will then identify “WesList” homologues. This will obviously allow target selection for L. major proteins in general. A new set of L. major targets can be sent to Rochester by next week.
I have obtained the progress information for Targets but we have not begun any analysis aimed at identifying correlations between sequence features and “clonability”, “expressability’ etc.