Download presentation
Presentation is loading. Please wait.
Published byVernon Craig Modified over 9 years ago
1
Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute
2
Predictors ENCODE participants (3): 7-80-8 (McPromoter1) 7-81-8 (McPromoter2) 41-108-8 (Fprom) additional predictors Beyond ENCODE participants (4) (out of competition) DBTSS (reference experimental dataset of capped flcDNA) FirstEF Dragon Gene Start Finder Dragon Promoter Finder
3
Goals How good are promoter predictors? Does performance change on this dataset? Implications for future developments
4
Data (1) Category “Known genes with CDS” (category = 2) 1061 annotated transcripts 1009 -> 994 unique starts of transcripts (TSSs) 319 unique TSSs in Encode ‘training’ set (13 regions) 675 unique TSSs in Encode test set Length of ENCODE regions 29,998,060 bp Length of ‘training’ regions 8,538,447 bp Length of testing regions 21,459,613 bp
5
Data (2) programspredictionsunique McPromoter1694 McPromoter2727 Fprom (combined with gene annotation from Fgenesh) 634533 DBTSS (exp)628524 FEF1266 DPF2168 DGSF628
6
Method for counting TP and FP All hits to ‘orange’ count as FPs Only one hit within A, B, or C counts as TP for unique position of TSS (3 hits within C will count only as 1 TP) Only minimum distance from all TSSs counts
7
Results Different measures of success Test ENCODE regions Also: comparison with other participants (test + all regions)
8
Se, ppv, AE (average positional error)
9
DIP1, DIP2, CC, ASM
11
Comments Compared to previous whole human genome analysis, now we use a more strict distance constraint: max allowed distance 1000 nt (vs. previous 2000 nt) Previously: Se [0.4 – 0.8], ppv [0.25 – 0.67] Now, for experimental DBTSS data: –Positional error ~100 nt, Se 0.61, ppv 0.93 Computational promoter prediction (CPP) (using single genome, no transcripts): positional error 200-300 nt (2-3 fold larger than DBTSS) (positive surprise) Se [0.32-0.62] (negative surprise but expected) –(reason poor G+C content of some of the test regions) CPP: ppv >80 (in some cases >90%) (positive surprise) Having in mind the type of information used for ab initio promoter finding, we see no dramatic difference in 5’ end prediction by methods class 1 and 3, and CPP (positive surprise); however, Se and ppv are better with methods of class 1 and class 3 for obvious reasons.
12
Future developments Combine TSS predictors and gene finding programs or transcript info (positive effects of this are visible in Fprom, 20- 76-4 and 20-76-5, since in these cases the TSS search space is effectively restricted) This, however, requires retuning of TSS predictors and some change in their design philosophy Expected performance should be similar or better than in class 1 and class 3 systems as TSS finding systems should be more specialized for the 5’end type of signals More emphasis should be given to positional accuracy of TSS predictors
13
Thank you for your time You may wake up now
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.