Presentation is loading. Please wait.

Presentation is loading. Please wait.

Geneid: training on S. lycopersicum

Similar presentations


Presentation on theme: "Geneid: training on S. lycopersicum"— Presentation transcript:

1 Geneid: training on S. lycopersicum
Francisco Câmara Ferreira Genome Bioinformatics Research Lab Center for Genomic Regulation Tomato Annotation, Ghent October 2006

2 Geneid: Geneid follows a hierarchical structure: signal -> exon -> gene Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) Dynamic programming algorythm: maximize score of assembled exons -> assembled gene Tomato Annotation, Ghent October 2006

3 Training geneid 1 2 3 4 5 6 7 8 9 A 0.3 0.6 0.1 0.0 0.7 0.2 C G 1.0 0.5 T 0.4 GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Before the training process we have to extract the CDS, introns and spice sites from the annoations based provided in the gff file. For the purpose of geneid training basically means computing Markov models or PWMs for splice sites and start codons & deriving a model for coding DNA, which given enoigh coding and non-coding o information, such as is the case with tomato is a Markov Model of order 5.

4 Optimization geneid eWF – Exon weight parameter
cutoff of scores of predicted exons oWF – oligo weight parameter Ratio of info between signals & coding stats Tomato Annotation, Ghent October 2006

5 Training set for tomato
Used 399 of 428 non-redundant annotated genes (102 bacs) 14 in-frame stops 5 one-nucleotide cds 10 redundant/overlapping Used 1760 donor sites, 1783 acceptors and 391 start codons 29 non-standard donors, 6 acceptors and 8 start codons Evaluation set for tomato Used 362 of 399 genes used in training Excluded 37 genes containing non-canonical starts, donors or acceptors Determined prediction accuracy on this set: sensitivity & specificity at nucleotide, exon & gene level Employed “leave-one-out” Jacknife training/evaluation method to reduce bias in accuracy results.

6 Statistics of training set
intron length: 515 nt ( nt) exon length: 162 nt (2-1,888 nt) CDS length: 970 nt (67-2,940 nt) exons/gene: 5.5 avg. gene size: 4,951nt GC (coding): 43% GC (intron): 33% # of exons: 2,188 # of single genes: 64 # coding bases: 386,811 # non-coding nts: 922,178 Tomato Annotation, Ghent October 2006

7 Statistics of training set
GC distributions Tomato Annotation, Ghent October 2006

8 Accuracy of new parameter file
Prediction/evaluation on full BACS: Program/param SN SP CC SNe SPe SNSP SNg SPg SNSPg Geneid tomato 0.94 0.18 0.37 0.77 0.22 0.55 0.36 0.09 Geneid solanaceae 0.84 0.16 0.32 0.61 0.17 0.39 0.04 0.11 Geneid arabidopsis 0.86 0.33 0.14 0.38 0.12 0.03 0.07 Genscan arabidopsis 0.15 0.58 0.10 0.06 After developing the parameter file which I wont describe here in great detail we proceeded to evaluate our predictions on either the full set of annotated bacs or on flanked annotated genes from those bacs Bacs not fully annotated Extract 400 nt.-flanked genes from annotations Tomato Annotation, Ghent October 2006

9 Accuracy of new parameter file
Prediction/evaluation on genes Program/param SN SP CC SNe SPe SNSP SNg SPg SNSPg Geneid tomato 0.96 0.91 0.81 0.80 0.39 0.37 0.38 Geneid tomato (jacknife) 0.94 0.90 0.89 0.77 0.33 0.30 0.31 Geneid solanaceae 0.85 0.83 0.62 0.71 0.66 0.18 0.17 Geneid arab 0.86 0.58 0.61 0.59 0.07 0.06 Genscan arab 0.87 0.63 0.64 0.11 0.10 Genscan maize 0.79 0.04 0.21 0.13 0.01 Geneid human 0.65 0.69 0.29 0.60 0.45 Tomato Annotation, Ghent October 2006

10 Geneid Predictions using new parameter file
102 BACs with annotations 443 fully sequenced BACs (bacs.v72.seq) Masked versions of the Bacs above used TIGR tomato/arabidopsis known repeat sequences library (TIGR_SolAth_repeat.fa obtained from SGN) Tomato Annotation, Ghent October 2006

11 Accuracy of new parameter file
Gff2ps plot of one prediction Tomato Annotation, Ghent October 2006

12 Geneid predictions on chr 9 BACs using tomato parameter file

13 Nucleotide ratios around splice sites/start codon in tomato:
Donor: Acceptor: Start:

14

15 Chromosome 9 has a total of 142 markers
Most markers are in heterochromatin Most of them did not match any BAC Gap of 46cM Chromosome 9 has a total of 142 markers But...

16 Construction of a training set for gene prediction programa geneid
Waiting for a more complete set (Shibata’s). A parameter file constructed from 100 sequences from different Solanaceae species (50% tomato).

17 Geneid predictions on 6 chr 9 BACs using sol parameter file
Tomato Sequencing, Madison July 2006

18 Geneid vs Geneseqer Geneid prediced 22 genes in the 114,526 pb C09HBa0109D11.1 BAC Most of the predictions are supported by ESTs results as shown by geneseqer Geneseqer is another gene identification tool based on the “spliced alignment” of ESTs to the genomic sequence contained in the BAC

19 Geneid with a parameter file obtained from solanaceae applied to 6 BACs from Chr9
Tomato Sequencing, Madison July 2006

20 GC distribution of GC content between intron and exons in Solanaceae sequences used to train geneid
To be improved when a large set of FL from tomato is available (Shibata´s)

21 European Commission EU-SOL
Vicky Fernandez Sheila Zuniga Angela Perez Francisco Camara Roderic Guigó Miguel A Botella Antonio Granell


Download ppt "Geneid: training on S. lycopersicum"

Similar presentations


Ads by Google