Tomato genome annotation pipeline in Cyrille2

Slides:



Advertisements
Similar presentations
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
A high throughput workflow management system Cyrille2 Mark Fiers NBIC practical course on web services and workflow management december 2007.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004.
Eukaryotic Gene Finding
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
What is SGN? S GN is a rapidly evolving comparative resource for the plants of the Solanaceae family, which includes important crop and model plants such.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
The New Zealand Institute for Plant & Food Research Limited Potato Genome Sequencing Consortium, notes from the edge Dr Susan Thomson, Dr Mark Fiers, Dr.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Update tomato chr. 6 Roeland van Ham Centre for BioSystems Genomics The Netherlands.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
UMR ASP UMR ASP Structural & Comparative Genomics in Bread Wheat TriAnnotPipeline A LifeGrid Project based on AUVERGRID F. Giacomoni, M.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
RNA Sequencing I: De novo RNAseq
Solanaceae 2006 BAC Annotation Plant Genome Research Center KRIBB, KOREA.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Genome Annotation Rosana O. Babu.
Progress tomato chromosome 6 René Klein Lankhorst.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
HeterochromatinEuchromatin Relative chromosome length Relative bivalent diameter X 1.23 X 1.00 Relative area Relative optical density.
Mark D. Adams Dept. of Genetics 9/10/04
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
How can we find genes? Search for them Look them up.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Welcome to the combined BLAST and Genome Browser Tutorial.
US Contribution to the International Tomato Genome Sequencing Effort Current structure of contributions Ongoing activity summary Funding issues.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
What is BLAST? Basic BLAST search What is BLAST?
Annotating The data.
Basics of BLAST Basic BLAST Search - What is BLAST?
Genome Sequence Annotation Server
Genome Sequence Annotation Server
Genes, Genomes, and Genomics
GEP Annotation Workflow
Gene Annotation with DNA Subway
Genome Annotation w/ MAKER
Geneid: training on S. lycopersicum
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Follow-up from last night: XSEDE credits
Part II SeqViewer AraCyc Help
Presentation transcript:

Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Contents of the annotation pipeline Annotation on the BAC level Gene prediction Repeat identification Other features Annotation on the gene level (work in progress) blastx vs NCBI’s nr (sequence similarity) InterProScan (domain identifcation)

Ab initio gene structure prediction Ab initio predictors included in the pipeline Genscan GlimmerHMM (trained on tomato!) GeneId (has been trained on Solanaceae) SNAP Augustus (predicts alternative spliced variants)

Alignment-based gene structure prediction (1) Transcript alignment (blastn + Sim4) SGN tomato UniGenes (34.829 UniGenes) SGN potato UniGenes (31.072 UniGenes) SGN coffee UniGenes (13.171 UniGenes) SGN pepper UniGenes (9.554 UniGenes) SGN petunia Unigenes (5.135 UniGenes) SGN S. melongena UniGenes (1.841 UniGenes) NCBI full-length tomato cDNAs (678 cDNAs) Protein alignment (tblastn + GeneWise) TAIR6 Arabidopsis thaliana proteome (30.690 proteins) TIGR4 Oryza sativa proteome (62.827 proteins) UniProt Plant division (17.831 proteins)

Additional feature prediction Repeat Identification Tandem Repeats Finder RepeatMasker RepBase + ‘default’ features (low complexity, etc) TIGR Solanum lycopersicon repeat library V2 SGN Solanum lycopersicon UniRepeats Feature prediction tRNAscan-SE MarScan GeneSplicer Marker identification (blastn + Sim4)

Preliminary results Annotation of chromosome 6 BACs phase 1, 2 and 3 632 contigs Older version of the pipeline GlimmerHMM only trained on Arabidopsis 2 UniGene sets (tomato, potato) 2 protein sets (Arabidopsis, UniProt plant) Protein alignment parameters too strict

The genomic landscape of chromosome 6 632 contigs have been annotated Length of contigs varies between 348 – 148.256 nt Average length of 9.061 nt, median length of 5.105 nt Total length of 5.726.791 nt GC content: 29.9% min, 34.1% avg, 42.2% max (sequences longer than 10.000 nt)

Ab initio gene prediction Note: Augustus predictions include up to 3 splice variants per gene Estimated gene density is 1 gene per 5 kb ~1.200 genes in currently sequenced BACs

Transcript alignment-based gene prediction Tomato 34.829 UniGenes (derived from 239.593 ESTs) 574 hits to the contigs Potato 31.072 UniGenes (derived from 133.657 ESTs) 631 hits to the contigs

Protein alignment-based gene prediction UniProt Plant proteins 17.378 protein sequences from the plant kingdom 195 hits to the contigs Arabidopsis thaliana TAIR6 annotation 30.690 protein sequences 228 hits to the contigs

Repeat density TIGR Tomato Repeat Library (95 repeats) 118 regions spanning 53.024 nt Minimum 48 nt, average 449 nt, maximum 7.675 nt SGN Tomato UniRepeats (668 repeats) 2.860 regions spanning 1.220.101 nt Minimum 10 nt, average 427 nt, maximum 8.896 nt Tandem repeats 1.313 regions spanning 157.921 nt Minimum 24 nt, average 120 nt, maximum 2.526 nt

Additional features 74 markers could be aligned alignment quality unverified 39 predicted tRNA genes 1.301 predicted MAR/SAR elements

Generic Genome Browser (1)

Generic Genome Browser (2)

Generic Genome Browser (3)

Recent work GeneModelCollector JIGSAW Tries to find ‘full’ open reading frames in aligned UniGenes Automatic generation of gene predictor training set Parameters? JIGSAW Appears not to provide a prediction for every region which contains annotations Training?

Future Work – Tomato Annotation Pipeline Gene prediction Combining predictions into a single consensus model Train individual predictors with recently curated tomato gene set Automated functional annotation of genes “Giving a biological meaning to the nicely colored bars” blastx InterProScan

Future Work – Tomato Genome Browser Annotation of features Meaningful names for features such as genes, marker alignments, blast hits More detailed and better readable data when clicking on a feature Links to external data sources NCBI GenBank SGN

Acknowledgements Cyrille2 development Mark Fiers Ate van der Burgt Joost de Groot Tomato BAC sequencing (chromosome 6) Greenomics Supervision Willem Stiekema Roeland van Ham