Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute
Predictors ENCODE participants (3): (McPromoter1) (McPromoter2) (Fprom) additional predictors Beyond ENCODE participants (4) (out of competition) DBTSS (reference experimental dataset of capped flcDNA) FirstEF Dragon Gene Start Finder Dragon Promoter Finder
Goals How good are promoter predictors? Does performance change on this dataset? Implications for future developments
Data (1) Category “Known genes with CDS” (category = 2) 1061 annotated transcripts > 994 unique starts of transcripts (TSSs) 319 unique TSSs in Encode ‘training’ set (13 regions) 675 unique TSSs in Encode test set Length of ENCODE regions 29,998,060 bp Length of ‘training’ regions 8,538,447 bp Length of testing regions 21,459,613 bp
Data (2) programspredictionsunique McPromoter1694 McPromoter2727 Fprom (combined with gene annotation from Fgenesh) DBTSS (exp) FEF1266 DPF2168 DGSF628
Method for counting TP and FP All hits to ‘orange’ count as FPs Only one hit within A, B, or C counts as TP for unique position of TSS (3 hits within C will count only as 1 TP) Only minimum distance from all TSSs counts
Results Different measures of success Test ENCODE regions Also: comparison with other participants (test + all regions)
Se, ppv, AE (average positional error)
DIP1, DIP2, CC, ASM
Comments Compared to previous whole human genome analysis, now we use a more strict distance constraint: max allowed distance 1000 nt (vs. previous 2000 nt) Previously: Se [0.4 – 0.8], ppv [0.25 – 0.67] Now, for experimental DBTSS data: –Positional error ~100 nt, Se 0.61, ppv 0.93 Computational promoter prediction (CPP) (using single genome, no transcripts): positional error nt (2-3 fold larger than DBTSS) (positive surprise) Se [ ] (negative surprise but expected) –(reason poor G+C content of some of the test regions) CPP: ppv >80 (in some cases >90%) (positive surprise) Having in mind the type of information used for ab initio promoter finding, we see no dramatic difference in 5’ end prediction by methods class 1 and 3, and CPP (positive surprise); however, Se and ppv are better with methods of class 1 and class 3 for obvious reasons.
Future developments Combine TSS predictors and gene finding programs or transcript info (positive effects of this are visible in Fprom, and , since in these cases the TSS search space is effectively restricted) This, however, requires retuning of TSS predictors and some change in their design philosophy Expected performance should be similar or better than in class 1 and class 3 systems as TSS finding systems should be more specialized for the 5’end type of signals More emphasis should be given to positional accuracy of TSS predictors
Thank you for your time You may wake up now