Download presentation
Presentation is loading. Please wait.
Published byRussell Ramsey Modified over 9 years ago
1
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve and rationalise” (UK)
2
Full release Pre-released* Organism pages Raw GenBank data from sequencing centers VectorBase has been mostly a collator of g enomes * 3 * * * (our) Annotation
3
Rapid growth, however, in past 5 years 6 #
4
VectorBase is also: A service providing tools for browsing and mining vector “-omics” data A content generator – Mostly genome annotation (later talks) Committed to regular releases (5-6 per year) A help desk to help our community on genome informatics and are responsible for facilitating data submission 4
5
In the end, VectorBase is a t eam 5 And YOU!
7
Left side: Welcome message Available d ata Tools and Resources Right side: Past jobs Organisms (2) Latest news 7
8
Left side: Community Right side: Rotating tips Newsletters Upcoming meetings 8
9
11 This is the new organism page: Collects strain, data, and relevant tools
10
~3700-8300 jobs per month Mostly Anopheles but other species
11
Web development goals (2015) Patching/ upgrading webApollo instances (1) – multiple genomes in one instance – reworked framework to improve performance Integrating subcontractor work with Drupal CMS (2) – Easier releases and better cross site development Sitewide authentication for single user accounts – Drupal – Web Apollo – Galaxy
12
Modifying webApollo example
13
Advanced Search Antelmo (ND) is making Advanced Search more stable and intuitive via Drupal and SOLR -> Also allows looking at saved search, for advanced analysis of BRC usage -> Now running 4.x SOLR to further support PopBio
14
Current VectorBase variation + PopBio dataflows. VCF ISA-TAB Sample + variation set ids Ensembl variation database PopBio Display of variant data in genomic context Display of detailed sample metadata, e.g. geodata
15
Use of Apache Solr to provide unified search (and thus integration) across the BRC VCF Ensembl variation database PopBio Display of variant data in genomic context Display of detailed sample metadata, e.g. geodata ISA-TAB
16
PopBio import Current size: 121 projects, 57637 samples, 172,636 assays (of which 4,387 are IR) At present loading can be done overnight, but this may change Web interface is not slow due to “pre loading,” which definitely isn’t scalable
17
PopBio plans Map interface: delivery June release + Kolymbari + ICEMR meetings Spreadsheet submission wizard development scheduled for Fall 2015. Year 2: Sample x genotype browser development, including e! REST and variation Solr work. Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.
18
Scaling up to millions of SNPs, thousands of samples Plan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:
19
Upcoming genome updates June 2015 – sandflies x 2 – anopheles assembly updates x 4 Summer 2015 - QC of Glossina workshop data, 16G data August 2015 – Release of malariaGen 1000G data (pending publication plans); we expect ~50 million new malaria mosquito variants by the end of summer. October - Glossinas x 6
20
Updating genes and assemblies We recently supported the Glossina gene annotation workshop held in Kenya (3/2015). The workshop data will be integrated into the existing Glossina databases for release in late 2015. A new database for the final species (Glossina palpalis) will also be created for release in late 2015. Assembly updates for An. farauti, An. melas, An. merus and An. sinensis have been examined to assess whether we can project gene information onto the new assemblies. Over 90% of transcripts could be projected and we intend to schedule the assembly updates for Q3 2015. New databases have been proposed for Sarcoptes scabiei var canis, and Aedes albopictus. Emrich, Hahn, Lawnziak and Besansky will submit a new reference genome of An. gambiae (S) for summer 2015.
21
Improved EBI production Data management systems Webapollo databases have been set up for 32 organisms, and are being actively used by the community for Biomphalaria glabrata (snail), Phlebotamus papatasi and Lutzomyia longipalpis (sandflies), Musca domesticus (house fly) and the five current Glossina (tsete) species. IT infrastructure VectorBase production pipelines are being migrated to the EBI eHive system (https://github.com/Ensembl/ensembl-hive). This encourages standardization of our code base, and also allows using EBI parallel computing resources. Analysis tools New pipelines for xrefs, search, protein alignment and exonerate based sequence alignments have been developed using the eHive system. This has allowed us to speed up run times in addition to the advantages above.
22
Future production work at EBI Search We had previously experienced scaling problems with the generation of Solr indices for the VectorBase search, and have now rewritten the core gene Solr gene index generation for eHive. Updating genome data Projection of gene descriptions between closely related orthologs will be introduced in an attempt to improve basal gene annotation in some of the new species. First deployment of this code is scheduled for June 2015. Transcript, genomic sequence and GTF/GFF dumping have been included in the eHivr pipeline, but data files are still updated on the VectorBase drupal site in a manual fashion. Adding the UCSC track hub system to facilitate metadata and additional “-omics” data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.