EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

1 / 30 Data Mining with BioMart
Genomic Innovations- Orthology Paralogy. Genomic innovation.
SPICE! An Ontology Based Web Application By Angela Maduko and Felicia Jones Final Presentation For CSCI8350: Enterprise Integration.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
An introduction to using the AmiGO Gene Ontology tool.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Dan Lawson, All Sites VectorBase Releases. 2 VectorBase 2012 A release cycle for VectorBase Regular release every 2 months In place since June 2010 Latest.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Patient Empowerment for Chronic Diseases System Sifat Islam Graduate Student, Center for Systems Integration, FAU, Copyright © 2011 Center.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Curation Editor Flexible web based editor for non gene model data. FlyBase – Harvard University Frank Smutniak.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Pantelis Topalis and Emmanuel Dialynas.  Ontology content  Data annotation with ontologies  Tools to handle and visualize ontologies OWL – OBO parsers.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Data Mining in Ensembl with BioMart Nov,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Copyright OpenHelix. No use or reproduction without express written consent1.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Copyright OpenHelix. No use or reproduction without express written consent1.
NCBI Genome Workbench Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 15, 2004 Slides from Michael Dicuccio’s Genome Workbench.
Data Mining in Ensembl with BioMart Giulietta Spudich.
1 Outline Standardization - necessary components –what information should be exchanged –how the information should be exchanged –common terms (ontologies)
Mining the Biomedical Research Literature Ken Baclawski.
A collaborative tool for sequence annotation. Contact:
1 Service Creation, Advertisement and Discovery Including caCORE SDK and ISO21090 William Stephens Operations Manager caGrid Knowledge Center February.
Worldwide Protein Data Bank wwPDB Common D&A Project November 24, 2009 November 24, 2009 Steering Committee Project Update.
Cool BaRC Web Tools Prat Thiru. BaRC Web Tools We have.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
What do we already know ? The rice disease resistance gene Pi-ta Genetically mapped to chromosome 12 Rybka et al. (1997). It has also been sequenced Bryan.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
ArrayExpress Ugis Sarkans EMBL - EBI
Data Loading into Ensembl Database TGAC Browser
National Cancer Institute Uma Mudunuri ABCC, NCI-Frederick ISRCE Monthly Meeting, Nov 9th 2010 bioDBnet The biological DataBase network.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
Data Mining with BioMart
Development of the Amphibian Anatomical Ontology
Building Search Systems for Digital Library Collections
Knowledge Based Workflow Building Architecture
Getting Started With Solr
Welcome to the GrameneMart Tutorial
Problems from last section
Supporting High-Performance Data Processing on Flat-Files
Welcome - webinar instructions
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

2 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 History of text search Up to 2009: Notre Dame University maintained the main site text search At the time, there was no text search module available in the version of Ensembl installed. In 2010: The Ensembl installation was updated to reflect the latest Ensembl Genomes installation. Text search technology available At the time, Ensembl search was based on the EB-EYE indices 2

3 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Challenges in 2010 How to integrate the new Lucene EB-EYE indices in the main site? Multiple sources of indexing VectorBase (expression, community annotations, etc.) Relied on good will from external services to update the EB-EYE indices from VectorBase core databases Relied on a XML dump of the core database Time-consuming task Difficult to index new datatypes or resources 3

4 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Requirements Framework to generate indices at any time Can reflect new community annotations (CAP) Ontology information New datasources: literature Search to serve Lucene indices from different providers: Gene annotation, x-refs, comparative genomics data (EBI) Microarray and gene expression data (Imperial) CAP (Notre Dame) Indexing must be fast, easy to use and maintain Search can be plugged to different tools: Main VectorBase website Ensembl genome browser 4

5 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Architecture 5 EnsemblFuncGenCAP Lucene indices Data sources Index file VectorBase Search Service Layer Clients EBI Imperial Notre Dame Index file Index file SOAP

6 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 What is being searched? Genomic information (Ensembl databases) Gene models Variation Probes Orthologs Expression data (Imperial) CAP Ontologies (idomal, miro, anatomy) Population genomics (Imperial) 6

7 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Generating Ensembl indices at the EBI Based on a direct connection to the database(s) Use a configuration file containing the description of objects and their types Database connection (staging-1, …) Database type (core, funcgen, variation) Genome (aedes_aegypti) Homologies Each object in the configuration file is represented by a java class The configuration loader will automatically create an instance of each type using the class loader. 7

8 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Example of configuration file 8

9 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Procedure (for Ensembl indices) 9 corefuncgenvariationcompara 1.If compara is defined, get all homologies 2.For each genome in turn: Get all gene, transcript, exons, proteins, xrefs information from core Get all reporters from funcgen and their mapping to gene models Get all variations and relation to gene models Associate all existing homologies to the genes Create a Lucene Document for all genes The indices are copied to Notre Dame University Tomcat instance is restarted

10 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Ensembl object mapping in Java Ensembl concepts are mapped to equivalent Java data access objects (DAO) All Ensembl concepts are stored in memory and removed when a Lucene Document is created 10 EnsemblFeature Gene extends contains Transcripts, translations, exons Homology extends Xref contains

11 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Creating a Lucene document A document is a container for the index Each document define one or several fields The framework creates a document per gene Each field can store its value (or not) Each field can be indexed (or not) The text stored can be compressed. 11

12 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Gene Document Fields: Gene id, name, description Coordinates: seq region name, start, end Species, feature type (gene), source (biotype), genomic unit Transcript count, transcript stable ids Exon count, exon stable ids Peptide count, peptide stable ids, domains Core xrefs Variation xrefs (if available) Funcgen xrefs (if available) Compara homologs (If available) 12

13 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 CAP indices GFF parser extract gene and transcript models. Name, description, submitter, chromosome location are indexed. Very fast Could be updated overnight if required. 13

14 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Expression data/Population genomics Constructed by Bob McCallum (Imperial) 14

15 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Ontologies Ontology term are indexed. An OBO parser extract each term in turn. Accession, name, description are parsed by default Extra fields are parsed depending on the completeness of each term. 15

16 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 SOAP interface 2 procedures: getNbOfResults, getResults (see wiki) 16

17 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 To do list Front-end: All domain should be queried to produce an ‘Entrez’ like page. So, search all by default and display count per domain Could be very simple result page (see next slide for mock-up) Updates: We could update some of the domain more frequently CAP is a good candidate. Other technologies: Other technologies can be used Auto-completion SOLR 17

18 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February Result page Genome (1693) Expression (3693)Ontology (70) Population (30)