Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Differential insertion of transposable elements in Anopheles gambiae M & S genomes Jenica L. Abrudan, Ryan C. Kennedy, Maria F. Unger, Michael R. Olson,
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
BRC6 28 th October 2008 Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase.
Specie: Anopheles gambiae PEST Genome size: 260 Mb Status: 3rd assembly and annotation NIAID funded.
VectorBase BRC VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton.
UniProt - The Universal Protein Resource
November 2007BRC5 Bethesda Variation data in VectorBase Dan Lawson, VectorBase EMBL-EBI.
ABSTRACT We have conducted an extensive computational analysis of the Culex quinquefasciatus genome to find and annotate a specific subfamily of the TEs:
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
NGS Analysis Using Galaxy
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Scott Emrich Assistant Professor, Computer Science and Engineering Scientific Manager, VectorBase University of Notre Dame A flexible, scalable genomics.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
VectorBase Seth Redmond Imperial College, London
Abstract Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Genome Annotation Rosana O. Babu.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
VectorBase BRC Overview Scott Emrich BRC 2011 – Annual Meeting UT Southwestern Medical Center Dallas, TX September 2011.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
August 2008Bioinformatics Tools for Comparative Genomics of Vectors1 Genomes Daniel Lawson EBI.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
VectorBase Kolymbari Meeting July 2011 new genomes new features and future plans Daniel Lawson (on behalf of VectorBase)
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
Overview and History of VectorBase Frank Collins March 31, 2015.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
VectorBase’s Population Biology Resources and How to Submit to Them Bob MacCallum Imperial College, London, UK July 16, 2013.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Welcome to the combined BLAST and Genome Browser Tutorial.
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 2.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Annotating The data.
VectorBase genome annotation
Genome Sequence Annotation Server
Functional Annotation of the Horse Genome
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Follow-up from last night: XSEDE credits
Presentation transcript:

Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Anopheline species in this study: Current status Genome sequencing 9 of 16 species assembled and annotated RNAseq 10 of 12 species sequenced Isolate re-sequencing 12 of 12 species sequenced

Genome annotation First-pass genome annotation is almost always based on “automatic” computational approaches ab initio Similarity based Transcript (ESTs, RNAseq) Protein (nr protein database)

Genome annotation First-pass genome annotation is almost always based on “automatic” computational approaches ab initio Similarity based Transcript (ESTs, RNAseq) Protein (nr protein database)

Genome annotation First-pass genome annotation is almost always based on “automatic” computational approaches ab initio Similarity based Transcript (ESTs, RNAseq) Protein (nr protein database)

Genome assembly Map Repeats Genefinding Protein-coding genes Map Transcripts Map Peptides nc-RNAs Functional annotation Submission to archival databases (Release) Genome annotation - building a pipeline

Automatic annotation strategies similarityab initio

Genome annotation: resources ab initio predictions using SNAP and Augustus Mixed whole animal RNAseq datasets generated using Illumina sequencing Assembled using Trinity (Broad Institute) Many dipteran proteomes (including 4 mosquitoes & D. melanogaster) All arthropod/metazoan proteomes

MAKER annotation with RNAseq and reference proteomes Aim: Gene prediction aggregation for the masses. Used for a number of arthropod genome projects Touted as the default pipeline for many more (part of the GMOD toolkit) Overview ab-initio gene predictions from SNAP, Augustus & FGENESH Final gene models from MAKER Similarity alignments from both EXONERATE and BLAST Repeats from RepeatFinder & RepeatMasker Additional data sets integrated via GFF3 files (RNA-Seq) Uses MPI for parallelization over a compute farm Summary Iterative runs give acceptable reference gene sets. Used for Heliconius, Glossina, sandflies and the first tranche of the 16 Anophelines

Current VectorBase annotation pipeline MAKER based automatic annotation includes SNAP training and ab initio RNAseq based transcript similarity prediction Taxonomically constrained peptide similarity prediction 2 rounds of prediction refinement & final round includes all peptide similarity Community annotation phase Capture gene structure changes Metadata associated with locus (symbol, description, citation) Submission to INSDC, propagation to UniProt Presentation through VectorBase Start 1.0 set (automatic) 1.1 set (published)

Projection from a reference annotation

Gene prediction based on projection from reference annotation Local alignment of An. gambiae CDS to the assemblies provide a platform for improving gene predictions. Example loci: Rps7 (AGAP008916) Potential for transcript based assembly improvement via seqedits of genome sequence

Annotation: Preliminary genesets 10, ,162 predictions no ncRNAs yet predicted

Preliminary comparative analysis OrthoMCL runs including 17 species An. gambiae PEST 12,810 protein-coding genes An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

Preliminary comparative analysis OrthoMCL runs including 17 species No. of clusters containing all 13 mosquitoes 4961 ( ≃ 39%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

Preliminary comparative analysis OrthoMCL runs including 17 species No. of clusters containing all 13 mosquitoes 4961 ( ≃ 39%) No. of clusters containing all 11 Anophelines 5463 ( ≃ 43%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

Preliminary comparative analysis OrthoMCL runs including 17 species No. of clusters containing all 13 mosquitoes 4961 ( ≃ 39%) No. of clusters containing all 11 Anophelines 5463 ( ≃ 43%) No. of clusters containing 10 Anophelines (minus darlingi) 6606 ( ≃ 52%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

Preliminary comparative analysis OrthoMCL runs including 17 species No. of clusters containing all 13 mosquitoes 4961 ( ≃ 39%) No. of clusters containing all 11 Anophelines 5463 ( ≃ 43%) No. of clusters containing 10 Anophelines (minus darlingi) 6606 ( ≃ 52%) No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 ( ≃ 58%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

Preliminary comparative analysis OrthoMCL runs including 17 species No. of clusters containing all 13 mosquitoes 4961 ( ≃ 39%) No. of clusters containing all 11 Anophelines 5463 ( ≃ 43%) No. of clusters containing 10 Anophelines (minus darlingi) 6606 ( ≃ 52%) No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 ( ≃ 58%) No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 ( ≃ 71%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

Preliminary comparative analysis OrthoMCL runs including 17 species No. of clusters containing all 13 mosquitoes 4961 ( ≃ 39%) No. of clusters containing all 11 Anophelines 5463 ( ≃ 43%) No. of clusters containing 10 Anophelines (minus darlingi) 6606 ( ≃ 52%) No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 ( ≃ 58%) No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 ( ≃ 71%) No. of clusters containing 8 Anophelines (- darlingi & christyi) but not gambiae 600 An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

All genomes deserves a home Genome browser Similarity searches BLAST/BLAT Query tools Simple keyword Complex queries Downloads Similarity searches Query tool Downloads Browser Compara

VectorBase Long term home for these genomes is VectorBase. NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens Ensembl genome browser Similarity searches File downloads

Anopheles Genomes Cluster wiki site

Thematic analysis groups & community annotation Community led annotation of the genomes using the Community Annotation Portal (CAP)

Community annotation decision tree

Community annotation workflow ARTEMIS APOLLO scf ptn2genome ptn_match ID=xxxx;Name=tr|Q3UIQ2| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3TIU7| scf ptn2genome ptn_match ID=xxxx3;Name=sp|Q91VD9| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3VIU732| scf ptn2genome ptn_match ID=xxxx;Name=tr|Q3UIQ2| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3TIU7| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3VIU732| >MY SUPERCONTIG ATATATGCGTTGAGCTGCGTTACGTTCGGGATGCGTTAGGCTTGT GAGCTGGATCGGTCCTGCCTGCGTCGATATAAACGACCT… Identify gene Modify model Submit CAP GFF3 FASTA

CAP reporting report back to submitter to show status If successful then the model is stored in a local database and then presented to the genome browser via DAS Failed submissions have (some) information as to why. Submitters then need to correct these errors and re- submit

CAP submissions displayed in the genome browser Similarity track for supporting evidence (from previous updates)

Genome annotation metrics Metrics for quality of a gene set are far from standardised but... Simple statistics (length, number of exons, intron size) Level of support from transcript data (how many genes have overlapping EST/RNAseq) Junction data (confirmation of introns) Comparison to public datasets (UniProt) Protein domains (InterPro) Comparative analysis - orthologs/paralogs

Still to do... Primary annotation Still 7 genomes outstanding from the Broad Institute - de novo repeat finding and MAKER annotation Analysis Whole genome alignments and (12 Drosopholid analysis pipelines from Kellis group - Rob Waterhouse) Data presentation (Trinity clusters, correlation with legacy Hittinger clusters, velvet assembled 37 bp reads) Variation (SNP calls) from each of the 16 species Other genomes New version of the An. darlingi genome (Osvaldo Marinotti, recently published in NAR) New version of the Indian strain of An. stephensi (Jake Tu)

Acknowledgements V EMBL-EBI Imperial College Daniel Lawson, Gareth Maslen, Mikkel Christensen, Nick Langridge, Derek Wilson, Gautier Koscielny, Karyn Megy, Martin Hammond, Daniel Hughes, Ewan Birney, Paul Kersey Fotis Kafatos, Bob MacCallum, George Christophides, Seth Redmond, Timo Tiirikka NoTre Dame HaRvard IMBB New MexicO A Sequencers Ensembl GEnomes Maggie Werner-Washburne Phil Baker Bill Gelbart, Susan Russo, Dave Emmert, Pinglei Zhou, Lynn Crosby, Kathy Campbell Kitsos Louis, Pantelis Topalis, Emmanuel Dialynas, Vicky Dritsou TIGR/JCVI WashU Broad Institute, Baylor College Frank Collins, Greg Madey, Rob Bruggner, Nate Konopinski, EO Stinson, Scott Emrich, Andrew Sheehan, Rory Carmichael, Dave Cieslak, Dave Campbell, Ryan Butler, Katie Cybulski, Neil Lobo, Gloria Calderon, Greg Davis Dan Neafsey, Brian Haas Nora Besansky, Michael Fontaine Michael Nuhn Rob Waterhouse Paul Howell

Contact or

Anopheles Genomes Cluster Consortium Steering committee Community liaisons