JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 2007, 7(Suppl):S9. J. E. Allen and S. L. Salzberg. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18): , J. E. Allen, M. Pertea and S. L. Salzberg. Computational gene prediction using mutliple sources of evidence. Genome Research, 14(1), 2004.
Collecting gene structure evidence for JIGSAW Figure 1. Evidence from the UCSC genome browser used as input to JIGSAW. Evidence includes: computational gene finders, alignments from gene expression evidence and evidence of cross-species sequence conservation.
Representing gene structure evidence in JIGSAW Each evidence source can predict up to six gene features: –Start codon –Stop codon –Intron –Protein coding nucleotides –Donor site –Acceptor site
Figure 3. Four evidence sources mapped to sequence S: gene prediction (GP1) with no confidence score, gene prediction with confidence score 0.65 (GP2), cDNA aligned with 86% identity and an EST aligned with 95% identity. Examples of the different feature vector types are shown: start codon (sta), stop codon (stp), donor site (don), acceptor site (acc), intron (inr) and amino acid codon (cod). Each element in the feature vector is an evidence source’s prediction for that feature type. The possible exon boundaries are k0, k1, …, k6.
Gene pred. 1 Gene pred. 2 cDNA EST alignment % 95% S2S2 Single exon % 92% S1S1 Initial exon Terminal exon % 85% SmSm Initial exon Terminal exon … Internal exon 85% 0.92 Start site feature vectors Stop site feature vectors Donor site feature vectors Acceptor site feature vectors Example coding feature vectors Example intron feature vectors Schematic of the JIGSAW training procedure. Known genes are used to evaluate the accuracy of the different combinations of evidence. Prediction accuracy for each feature type (start codon, stop codon, acceptor, donor, amino acid codon and intron) is measured separately. Training
Fig 4a. The plot shows the accuracy of predictions based on alignments to non-human sequences that overlap a gene finder’s predictions. Each point is a pair of alignments observed in training and their percent identity to the genomic sequence. ‘+’ points are labeled ‘accurate’ and ‘x’ points are labeled ‘inaccurate.’ The two lines correspond to the non-leaf nodes in the decision tree.
Figure 4b. Decision tree used to partition the feature vector space from Figure 4a into three sub-regions. This decision tree indicates that non-human cDNA alignments with > 95% identity to the human sequence (region “V 1 ”) are accurate protein coding predictors.
Interval: assigns state to the subsequence from to. JIGSAW dynamic programming Dynamic programming algorithm: at the end of each interval (e 0, for example), store the score of the best parse ending at that location Modification: store scores for every parse “type” ending at e 0 Types are start, stop, coding, intron, donor, acceptor
JIGSAW GHMM gene model
Evidence types for JIGSAW experiments on human DNA cDNA from human genes UniGene transcripts GenBank cDNAs matching SwissProt proteins w/at least 98% identity RefSeq genes from non-human species TIGR Gene Index (human and other) Ab initio gene finders –Genscan, GeneID, GeneZilla, GlimmerHMM –NOTE: JIGSAW allows you to use the same gene finder as multiple “lines” of evidence - e.g., GlimmerHMM with different parameter settings Alignment-based gene finders –Twinscan –SGP Predicted conserved elements from phylogenetic analysis (PhastCons)
Effects of different evidence sources Figure 6. JIGSAW prediction performance using different combinations of evidence. Gene finders = ab initio gene finders only; non-human EST = gene finders + non human expression evidence; human mRNA = gene finders + human mRNA; curated cDNA = gene finders + KnownGene, All = all evidence. KnownGene = cDNA evidence from curated proteins (from UCSC) without using JIGSAW.
Comparison of JIGSAW and other methods on human ENCODE regions Sensitivity(Sn)= % of exons correctly predicted Specificity(Sp)= % exons predictions that are correct F-score=(2 x Sn x Sp) / (Sn + Sp)
Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Top panel: dotplot for sensitivity versus specificity at the exon level for CDS evaluation. Each dot represents the overall value for each program on the 31 test sequences. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2
Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Bottom panel: boxplots of the average sensitivity and specificity for each program. Each dot corresponds to the average in each of the test sequences for which GENCODE annotation existed. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2.
EGASP results: Gene level accuracy
JIGSAW on other species