Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010 www.informatics.jax.org Mouse Genome.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Nomenclature: The Language of Genomics Ruth Lovering HUGO Gene Nomenclature Committee
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Using Ontologies to Annotate Phenotypic Data Janan T. Eppig December Mouse Genome Informatics.
Gene Ontology John Pinney
Alignment of Ontologies for Biological Research Judith A. Blake, Ph.D. Bioinformatics and Computational Biology The Jackson Laboratory.
Terry F. Hayamizu Mouse Genome Informatics, The Jackson Laboratory M OUSE A NATOMY O NTOLOGIES AND GXD.
2 March, 2005 Chapter 12 Mutational dissection Normal gene Altered gene with altered phenotype mutagenesis.
COG and GO tutorial.
Comprehensive Annotation System for Infectious Disease Data Alexander Diehl University at Buffalo/The Jackson Laboratory IDO Workshop /9/2010.
Mouse Genome Informatics November 2008 Paul Szauter MGI User Support.
Ontologies and vocabularies supporting data integration: emphasis on mouse phenotypes and disease model Control C3H/HeJ Homozygous Fasl gld /Fasl gld The.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Data Analysis Summary. Elephant in the room General Comments General understanding that informatics is integral in medical sequencing and other –omics.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Copyright OpenHelix. No use or reproduction without express written consent1.
8 October 2009Microbial Research Commons1 Toward a biomedical research commons: A view from NLM-NIH Jerry Sheehan Assistant Director for Policy Development.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Organizing information in the post-genomic era The rise of bioinformatics.
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
DATA MANAGEMENT AND CURATION AT TAIR
Copyright OpenHelix. No use or reproduction without express written consent1.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Phenotype Curation Susan R. McCouch Department of Plant Breeding Cornell University.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
 9 European Countries  1 Third Country  14 Research Centers of Excellence  5 Universities  4 SMEs  1 Venture Capital.
Two powerful transgenic techniques Addition of genes by nuclear injection Addition of genes by nuclear injection Foreign DNA injected into pronucleus of.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
MGI and Phenotyping Projects Mouse Genome Informatics.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Information Representation Working Group WG Meeting September 5, 2008.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Development of the Amphibian Anatomical Ontology
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Introduction to Bioinformatics
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Browsing the GO at MGI Harold Drabkin, Ph.D. Senior Scientific Curator
Relationship between Genotype and Phenotype
Presentation transcript:

Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April Mouse Genome Informatics

short domed skull short-limbed dwarfism malocclusion bulging abdomen as adults respiratory problems shorted lifespan Achondroplasia Homozygous achondroplasia mouse mutant and control Mouse Genome Informatics (MGI) program goal …to facilitate the use of the mouse as a model for heritable human diseases and normal human biology.

…to accomplish MGI’s mission, we provide integrated access to the genetics, genomics, and biology of the laboratory mouse. Hermansky-Pudlak syndrome Information content spans from sequence to phenotype/disease sequence natural variation gene function genome location orthologies strain geneaology expression tumors

MGI Data Content, a few numbers April, 2010 Genes (including uncloned mutants)36,691 Genes w/ nucleotide sequence29,108 Genes annotated to GO Total mouse GO annotations 25, ,558 Mouse/human orthologs17,846 Mouse/rat orthologs16,776 Phenotypic alleles in mice genes with mutant alleles in mice mutant alleles in cell lines only total phenotype annotations (Mamamlian Phenotype-MP) 24,007 12, , ,139 QTL4,436 Human diseases w/ one or more mouse model1005 Gene Expression Assays37,584 Integrated mouse nucleotide sequences+ESTs refSNPs >9,994,000 >10,089,000 References153,161 …plus strains, expression and phenotype images, tumor records, etc.

Integration in MGI Identify objects. Resolve discrepancies. Integration is key to knowledge discovery

Integration is hard…not just a matter of combining data sources… Data from multiple sources can be of differing quality The same data can enter the system via various paths Naming conventions may or may not be to standards Some data sources don’t maintain unique accession numbers (or allow them to change) Periodic updates from data sources can cause problems if objects have disappeared… (or reappear) If objects have split in two

Data Acquisition Object Identity Standardizations Data Associations Integration with other bioinformatics resources Literature & Loads New Gene, Strain or Sequence? Controlled Vocabularies Evidence & Citation Co-curation of shared objects and concepts Annotation Pipeline

Data integration is hard “Bucketizing” establishe types of correspondence between objects in the input sets. Allows immediate incorporation of 1:1 corresponding data. Sorts conflicting data into bins that allow prioritization for curator resolution.

VEGA annotated three distinct genes instead of multiple transcripts for a single gene (Mvk) chr5:

Why resolve and integrate data? 1. Allows you to find all the data: Example: I want all the sequences from GenBank that are from C57BL/6 There are >100 different versions of this strain name in GenBank files, e.g. B6BL/6C57BL76J57BL/6J Black-6JB6C57Black/6black six …..ETC… Example: You find several papers describing different phenotypes of knockouts of the Fgfr2 gene. The knockout alleles are just called Fgfr2 -/-. Help! There are 14 different targeted alleles of Fgfr2 (knockout/knockin, each has a unique symbol and MGI-ID, different phenotype annotations, and are models of different human diseases). All are associated with their respective references. MGI has curated these data. You can ask these questions!

Why resolve and integrate data? 2. Allows you to discriminate ambiguous data Example: I want information for mouse gene Tap Which gene? There are 5 genes published as Tap. Each of these genes has Tap as a synonym. Chr 15 Ly6a, lymphocyte antigen 6 complex, locus ALy6a Chr 19 Nxf1, nuclear RNA export factor 1 homolog (S. cerevisiae)Nxf1 Chr 11 Sec14l2, SEC14-like 2 (S. cerevisiae)Sec14l2 Chr 17 Tap1, transporter 1, ATP-binding cassette, sub-family B (MDR/TAP)Tap1 Chr 5 Uso1, USO1 homolog, vesicle docking protein (yeast)Uso1 P.S. Gene Gnas has 20 synonyms

Why resolve and integrate data? 3. In addition to object identification issues, integration allows you ask complex questions that span data sets and data types from different sources: Example: What genes on Chromosome 11 have mutant alleles that display phenotypes of hydrocephaly and hypertension? Example: Provide me with a list of Refseq IDs where the gene corresponding to the sequences show expression in embryos at days and are involved in the biological process (GO) of apoptosis.

Integration requires consistent semantics Controlled vocabularies/nomenclatures Strains Genes Alleles (phenotypic or variant) Classes of genetic markers Types of mutations Types of assays Developmental stages Tissues Clone libraries ES cell lines ….. organized as lists or simple hierarchies

Ldb1 (LIM domain binding 1) gene expression in CD-1 mice Assay Type Gene nomenclature Strain Age Results

Semantics plus relationship data Ontologies/structured vocabularies Gene Ontology (GO) Molecular function Biological process Cellular component Mouse Anatomy (MA) Embryonic Adult Mammalian Phenotype (MP) Sequence Ontology (SO) ….. organized as directed acyclic graphs (DAGs) DAGs

Mammalian Phenotype Ontology Structured as DAG Over 7324 terms covering physiological systems, behavior, development and survival Available in browser and in OBO file formats from MGI ftp and OBO Foundry sites

P05147 PMID: GO: IDA P05147GO: IDA PMID: GO Term Reference Evidence Annotating Gene Products using GO Gene Product

Data sources Primary literature Centers: mutagenesis, gene trap, etc Data Loads: GenBank, SNPs, clone collections, UniProt, RIKEN, IKMC,etc Electronic Submissions (individual labs) Processing, QC, and curation Gather data from multiple sources Factor out common objects Assemble integrated objects

Data sourcing for MGI Data from major providers (e.g. Ensembl, UniProt) and from data project Centers (e.g. gene trap, ENU mutagenesis centers) are generally reliably formatted, though data may still have QC issues. Occasional changes in format can be frustrating. Data from individual research labs vary greatly in file formats and adherence to nomenclature & usually are handled on a case-by-case basis. Scientific literature is a reflection of individual labs (largely), & must be treated as using non-standard nomenclatures – but awareness is improving!

Data sourcing for MGI (…wishes) more user contributions pre-publication nomenclature assignments data submissions (data can be held private until publication) journal permissions for images - have some in progress (collaborations on raw phenotype data exchange with European and Japanese mouse mutagenesis and knockout groups)

Building a mouse phenotyping data resource Large scale ENU mutagenesis programs worldwide - continuing Large scale gene trap programs (International Gene Trap Consortium) - gene trap cell lines loaded, with Lexicon International Mouse Knockout Consortium KOMP – Knockout Mouse Project (USA) EUCOMM – European Conditional Mouse Mutagenesis NorCOMM – North American Conditional Mouse Mutagenesis Texas Institute for Genomic Medicine Knockouts Collaborative Cross Literature and lab submissions New recombinase (cre, flp, etc) and reporter database is online and data is being populated

BREADTH: Large scale screen for potential phenotypic outliers DEPTH: Phenotypic description of mutant genotype(s)

SUMMARY Integration in MGI is accomplished through a combination of automatic & semi-automatic loads & QC processing, followed by manual curation. requires applying semantic consistency using standard nomenclatures, ontologies and structured vocabularies. provides users with the ability to find data that would otherwise not be found or ambiguous. allows complex questions spanning different data sets and data areas to be asked.

SUMMARY Data Sourcing in MGI includes data from major genome resources and mouse centers, as well as individual lab submissions and curated information from scientific literature. requires QC processing for format consistency; for some (individual) labs case-by-case assistance. for new large-scale phenotyping activities, integrate data with common curation of MP ontology; connect with raw data (international collaboration). continue to work with community and journals to allow easier data access.

Bar Harbor, Maine MGI is funded by : NHGRI grants HG000330, HG002273, HG NICHD grant HD NCI grant CA