Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.

Slides:



Advertisements
Similar presentations
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Advertisements

Integrating dbSNP with P. falciparum genome resources.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Vector Epidemiology Data Gloria I. Giraldo-Calderon March 31, 2015.
Centers of Excellence for Influenza Research and Surveillance 6 th Annual Meeting Aug 1, 2012 Status of IRD Development.
Anopheles gambiae PopGenBase Groundwork for a vector population genetics database UC Davis - UCLA.
Using HapMap.Org A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory.
Information Retrieval in Practice
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
November 2007BRC5 Bethesda Variation data in VectorBase Dan Lawson, VectorBase EMBL-EBI.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Overview of Search Engines
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
DbSNP: the NCBI database of genetic variation S. T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski and K. Sirotkin, Nucleic Acids.
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
RDA Wheat Data Interoperability Cookbook and last developments 9 th March 2015, San Diego.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Gene Expression Omnibus (GEO)
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
PhenCode Linking Human Mutations to Phenotype. PhenCode Brings the deep information on genotypes and phenotypes in locus specific databases (LSDBs) into.
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015.
Pantelis Topalis and Emmanuel Dialynas.  Ontology content  Data annotation with ontologies  Tools to handle and visualize ontologies OWL – OBO parsers.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
NCBI Vector-Parasite Genomic Related Databases Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 12, 2004
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
-- Don Preuss NCBI/NLM/NIH
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Introduction to the Gramene Genetic Diversity module 5/2010 Build #31.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Copyright OpenHelix. No use or reproduction without express written consent1.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
GBIF Data Access and Database Interoperability 2003 Work Programme Overview Donald Hobern, GBIF Programme Officer for Data Access and Database Interoperability.
Map-based Exploration of Population Biology Data in VectorBase What is VectorBase? We are a consortium of institutions that hosts the genomes of invertebrate.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Sharing Genetic Variation Data via EMBL-EBI: The European Variation Archive Gary Saunders, PhD
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
From Reads to Results Exome-seq analysis at CCBR
GIAB: Genome reference material development resources for clinical sequencing Chunlin Xiao 1, Justin Zook 2, Shane Trask 1, Melissa Landrum 1, Marc Salit.
Tools For Vertebrate Gene Naming
Hub Updates for Year 3 Carl Kesselman.
TreeGenes & Tripal treegenesdb.org Emily Grau
Data Mining with BioMart
Using ArrayExpress.
How to store and visualize RNA-seq data
SRA Submission Pipeline
Functional Annotation of the Horse Genome
Ensembl Genome Repository.
TAMU Bovine QTL db and viewer
Biological Databases.
Presentation transcript:

Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variation data types VectorBase captures both sequence and structural variations (stable chromosomal inversions in An. gambiae) The bulk of the data is sequence variants, primarily SNPs, based on high-throughput genomic re-sequencing of isolates. Data is formatted using VCF (Variant Call Format).

Variation data bridges between PopBio and genome browser (Ensembl) Variants are stored in a mySQL using the Ensembl RDB schema. Sample metdata is stored in the PopBio (Chado postgres). Large and complex data sets which depend on the accuracy of the genome assembly and parameterization of variant calling algorithms.

Summary of VectorBase variation datasets ( ) 1) Aedes aegypti contains the latest SNP chip data set from Powell et al, 2015 (PMID: ). 2) An. stephensi (Indian strain) was a new database distinct from the SDA-500 strain. 3) Four further variation datasets for An. farauti, An. merus, An. sinensis and An. melas are available and will be loaded after updates to the assemblies for these organisms.

Representation in Popbio Genomic (re)sequencing is an assay type in PopBio Sample metadata stored in PopBio and Biosamples databases MR4 colony sequencing (VBP )

Variant VectorBase

Querying and using variation VectorBase Browser tracks

Querying and using variation VectorBase Browser tracks Biomart datasets

Querying and using variation VectorBase Browser tracks Biomart datasets Sample metadata

Querying and using variation VectorBase Browser tracks Biomart datasets Sample metadata VEP tool (Variant Effect Predictor)

Internal VectorBase variation + PopBio dataflows. VCF ISA-TAB Sample + variation set ids Ensembl variation database PopBio Display of variant data in genomic context Display of detailed sample metadata, e.g. geodata

Use of Apache Solr to provide unified variation search across VectorBase site. VCF Ensembl variation database PopBio Display of variant data in genomic context Display of detailed sample metadata, e.g. geodata ISA-TAB

Identification and management of redundant variant records via MongoDB NoSQL db. Slide courtesy of Christoph Grabmüller – Ensembl Genomes 2014.

VectorBase interactions with external data sources relevant to variation studies. dbSNP EVA BioSamples Community Initial submission of variation data (multiple formats). VectorBase VCF format data Sample metadata (ontology compliant) Long term variation archive Ongoing curation of data either solely by community, or in collaboration with VectorBase

EVA - long term storage of variant data Processing of variation data for VectorBase species of dbSNP is too slow to be useful (>1 year) VectorBase accepts community variation data submissions and processes these rapidly (this involves active collaboration with submitters to convert submissions into suitable data formats and link entries to ontologies and other metadata tracking systems). Store submitted variation data as VCF files in the European Variation Archive long term ( ) EVA to broker submission of VCF data to dbSNP who can then resolve duplicate submissions and allocate persistent IDs which can be reincorporated into VectorBase variation records. VectorBase has submitted data for Anopheles coluzzii + Anopheles gambiae to EVA. The anopheles 16 genomes data will be submitted to the archive once remaining sample tracking issues have been resolved.

BioSample- sample metadata Joint EBI/NCBI database that stores submitter supplied data relating to samples used in other primary NCBI archives such as – SRA (Sequence Read Archive) – dbGaP (Genotypes and Phenotypes database) – GenBank VectorBase works with community members to ensure sample metadata is captured and tagged with appropriate ontology terms (e.g. “A Multipurpose High Throughput SNP Chip for the Dengue and Yellow Fever Mosquito, Aedes aegypti.” Evans et al PMID: ). Joint VectorBase/researcher submitters allow samples to be curated by the community.

Future plans Consolidation: Continue to broker and improve sample metadata submissions with BioSamples Work with EVA developers to pilot VCF brokerage with dbSNP Improve “Sample picker” interface New data: MalariaGen 1000 Anopheles project (AR2 data release) Individuals from 8 countries, min. 30 samples per population, est million variants Data queries: Increase use of SOLR to replace and augment BioMart functionality Use of other database solutions for specific queries (e.g. mongoDB)