Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating the ‘first’ proteome. some ‘syntenic breakpoints’ in ACT view.
Overview Contig Ordering (knowlesi v falciparum) : benefits and a caveat. Speeds:- Annotation via generation of pseudomolecules Prefinishing dissemination of gene models via GeneDB. identification ‘syntenic breakpoints’ Towards ‘species specific genes’ Generation of a predicted proteome Positive impact on gene models in falciparum by identification of missed genes/exons CAVEAT: current methodology assumes synteny no evidence for physical linkage of contigs in pseudomolecules Integration of read pair data needed to confirm linkage to generate scaffolds
Read pairs can confirm or deny physical linkage of contigs assumed by ordering
Ellen Adlam’s contig ordering Script – Brief Methodology Four stages: 1.Pk contig set is filtered to remove those below 5 kb. 2.TBlastX on sections of Pk contigs against Pf chromosomes. Contigs split into 14 groups according to the top hit linked to a Pf chromosome. 3.Coordinates of hits examined. Pk Contigs ordered relative to the ‘corresponding’ Pf chromosome. 4.Coordinates are reexamined and N’s are inserted to represent gaps as expected by measurement against Pf.
Contigs ordered against Pf Chr7. Ordering tends to fail in highly variable regions Subtelomeres Internal var arrays
Integration of data to inform gene models Blast and fastaA Comparison of regions of synteny with falciparum Gene prediction algorithms SNAP Projector Intergrate into ACT Manual review Acurate gene predictions Proteome data EST data
ACT visualisation of ‘synteny’ to aid annotation
Contigs ordering results/estimates/next steps coverage (5x)18.6 Mb ordered av. 21 (980 gaps)gene preds 2300 (8x) 23 Mb orderedav. 29 kb (280 gaps) 5100 Manually reviewed models (297) for chr 6 (estimated time scale for manual review of all genepredictions: person days, 2 – 3.5 months) Passed on to aid in prefinishing. possible next steps: 1. May be possible to manually order smaller contigs into the gaps 2. Analyse using read pair data (sequencing and BAC end reads) to generate scaffolds (IN PROGRESS). 3. Identify BAC clones which may be telomeric/subtelomeric by mapping end reads onto the metachromosomes.
Identification of gene duplication/deletion P. falciparum chr7 P. knowlesi
Gene finders different types: ab initio - bases predictions on statistic profile calculated from a training set (criteria: consensus sequence start sites, splice junctions, sequence composition on codon and DNA level for coding, introns and non-coding, intron length distribution, exon length distribution) comparative - bases predictions on sequence similarity to coding in related organism and uses statistic profile from training set to a much lesser extent
Projector precise alignment step of algorithm means that it needs much memory it cannot go through an entire sequence before we can feed it the reference and query sequence we need to: align the corresponding chromosome contigs. identify which gene plus surrounding sequence in annotated corresponds to which section in unannotated (Ellen's script and gene modeller can provide some hints for this) take the two linked regions in unannotated and reference and give these to projector as input it can only predict for regions for which you have told it to at the moment it can only be run by the person who wrote it but it is being callibrated and underdevelopement for wider use. can show where it observed conservation on sequence level for both (for untranslated, exon and intron)
Exploring different gene finding tools for P. knowlesi originated from the complex and slow process of manually building a training set for unannotated organism making use of an annotated relative (P. falciparum) SNAPab initio GENE MODELLER comparative, sensitive blast, then tries to find start/ stop/ splice site near BLAST hit ends; needs refinement PROJECTORcomparative gives us a good opportunity to evaluate strengths and weaknesses of each trial on an ordered contig set for knowlesi chr6 which had been annotated.
Sensitivity and specificity performance for single exon and multi exon genes Single exon >1 exon
Sensititivity and specificity measured against a set of 156 manually annotated genes
How well are start and stop codons predicted?
Conclusions on gene prediction performance Specificity Projector (26)> SNAP (6) > Gene Modeller (0) “New” projector: 20 % (17 %) exact specificity of the gene models made Sensitivity SNAP (154 ) > Gene Modeller(143 ) > Projector (128 ) SNAP/Gene modeller although not specific are sensitive Gene Modeller due to the blast parameters chosen (low penalties for gap opening, extension and mismatch, word size 9) Can the strengths of Gene modeller or SNAP be combined with the specificity of projector?
Future work New run of the latest contig ordering set using projector informed with additional data as “intervals” to improve sensitivity.