Geneid: training on S. lycopersicum

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Chr9 A ntonio Granell IBMCP-Valencia Spain Tomato Sequencing, Madison July 2006.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Profiles for Sequences
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
Jul /16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004.
Eukaryotic Gene Finding
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Lecture 12 Splicing and gene prediction in eukaryotes
Genome Annotation BCB 660 October 20, From Carson Holt.
Biological Motivation Gene Finding in Eukaryotic Genomes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Tomato genome annotation pipeline in Cyrille2
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
US Sequencing Project Funded by NSF Two-year project Start date: Sept 1, 2004 Follow-up project for full sequencing of chromosomes 1, 10 and 11.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Welcome to the combined BLAST and Genome Browser Tutorial.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
”Gene Finding in Eukaryotic Genomes”
Basics of BLAST Basic BLAST Search - What is BLAST?
Genes, Genomes, and Genomics
GEP Annotation Workflow
Visualization of genomic data
PlantGDB: Annotation Principles & Procedures
Eukaryotic Gene Finding
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Part II SeqViewer AraCyc Help
The Toy Exon Finder.
Presentation transcript:

Geneid: training on S. lycopersicum Francisco Câmara Ferreira Genome Bioinformatics Research Lab Center for Genomic Regulation Tomato Annotation, Ghent October 2006

Geneid: Geneid follows a hierarchical structure: signal -> exon -> gene Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) Dynamic programming algorythm: maximize score of assembled exons -> assembled gene Tomato Annotation, Ghent October 2006

Training geneid 1 2 3 4 5 6 7 8 9 A 0.3 0.6 0.1 0.0 0.7 0.2 C G 1.0 0.5 T 0.4 GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Before the training process we have to extract the CDS, introns and spice sites from the annoations based provided in the gff file. For the purpose of geneid training basically means computing Markov models or PWMs for splice sites and start codons & deriving a model for coding DNA, which given enoigh coding and non-coding o information, such as is the case with tomato is a Markov Model of order 5.

Optimization geneid eWF – Exon weight parameter cutoff of scores of predicted exons oWF – oligo weight parameter Ratio of info between signals & coding stats Tomato Annotation, Ghent October 2006

Training set for tomato Used 399 of 428 non-redundant annotated genes (102 bacs) 14 in-frame stops 5 one-nucleotide cds 10 redundant/overlapping Used 1760 donor sites, 1783 acceptors and 391 start codons 29 non-standard donors, 6 acceptors and 8 start codons Evaluation set for tomato Used 362 of 399 genes used in training Excluded 37 genes containing non-canonical starts, donors or acceptors Determined prediction accuracy on this set: sensitivity & specificity at nucleotide, exon & gene level Employed “leave-one-out” Jacknife training/evaluation method to reduce bias in accuracy results.

Statistics of training set intron length: 515 nt (29-6972 nt) exon length: 162 nt (2-1,888 nt) CDS length: 970 nt (67-2,940 nt) exons/gene: 5.5 avg. gene size: 4,951nt GC (coding): 43% GC (intron): 33% # of exons: 2,188 # of single genes: 64 # coding bases: 386,811 # non-coding nts: 922,178 Tomato Annotation, Ghent October 2006

Statistics of training set GC distributions Tomato Annotation, Ghent October 2006

Accuracy of new parameter file Prediction/evaluation on full BACS: Program/param SN SP CC SNe SPe SNSP SNg SPg SNSPg Geneid tomato 0.94 0.18 0.37 0.77 0.22 0.55 0.36 0.09 Geneid solanaceae 0.84 0.16 0.32 0.61 0.17 0.39 0.04 0.11 Geneid arabidopsis 0.86 0.33 0.14 0.38 0.12 0.03 0.07 Genscan arabidopsis 0.15 0.58 0.10 0.06 After developing the parameter file which I wont describe here in great detail we proceeded to evaluate our predictions on either the full set of annotated bacs or on flanked annotated genes from those bacs Bacs not fully annotated Extract 400 nt.-flanked genes from annotations Tomato Annotation, Ghent October 2006

Accuracy of new parameter file Prediction/evaluation on genes Program/param SN SP CC SNe SPe SNSP SNg SPg SNSPg Geneid tomato 0.96 0.91 0.81 0.80 0.39 0.37 0.38 Geneid tomato (jacknife) 0.94 0.90 0.89 0.77 0.33 0.30 0.31 Geneid solanaceae 0.85 0.83 0.62 0.71 0.66 0.18 0.17 Geneid arab 0.86 0.58 0.61 0.59 0.07 0.06 Genscan arab 0.87 0.63 0.64 0.11 0.10 Genscan maize 0.79 0.04 0.21 0.13 0.01 Geneid human 0.65 0.69 0.29 0.60 0.45 Tomato Annotation, Ghent October 2006

Geneid Predictions using new parameter file 102 BACs with annotations 443 fully sequenced BACs (bacs.v72.seq) Masked versions of the Bacs above used TIGR tomato/arabidopsis known repeat sequences library (TIGR_SolAth_repeat.fa obtained from SGN) Tomato Annotation, Ghent October 2006

Accuracy of new parameter file Gff2ps plot of one prediction Tomato Annotation, Ghent October 2006

Geneid predictions on chr 9 BACs using tomato parameter file

Nucleotide ratios around splice sites/start codon in tomato: Donor: Acceptor: Start:

Chromosome 9 has a total of 142 markers Most markers are in heterochromatin Most of them did not match any BAC Gap of 46cM Chromosome 9 has a total of 142 markers But...

Construction of a training set for gene prediction programa geneid Waiting for a more complete set (Shibata’s). A parameter file constructed from 100 sequences from different Solanaceae species (50% tomato).

Geneid predictions on 6 chr 9 BACs using sol parameter file Tomato Sequencing, Madison July 2006

Geneid vs Geneseqer Geneid prediced 22 genes in the 114,526 pb C09HBa0109D11.1 BAC Most of the predictions are supported by ESTs results as shown by geneseqer Geneseqer is another gene identification tool based on the “spliced alignment” of ESTs to the genomic sequence contained in the BAC

Geneid with a parameter file obtained from solanaceae applied to 6 BACs from Chr9 Tomato Sequencing, Madison July 2006

GC distribution of GC content between intron and exons in Solanaceae sequences used to train geneid To be improved when a large set of FL from tomato is available (Shibata´s)

European Commission EU-SOL Vicky Fernandez Sheila Zuniga Angela Perez Francisco Camara Roderic Guigó Miguel A Botella Antonio Granell