Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster 

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

SPARQL Dimitar Kazakov, with references to material by Noureddin Sadawi ARIN, 2014.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
ESDSWG2011 – Semantic Web session Semantic Web Sub-group Session ESDSWG 2011 Meeting – Semantic Web sub-group session Wednesday, November 2, 2011 Norfolk,
So What Does it All Mean? Geospatial Semantics and Ontologies Dr Kristin Stock.
Application of OBO Foundry Principles in GO Chris Mungall Lawrence Berkeley Labs NCBO GO Consortium.
Linking Animal Models to Human Diseases Supported by NIH P41 HG and U54 HG the University of Oregon, Eugene, OR
Ontology Notes are from:
Iowa State University Animal Science Department Bioinformatics & Computational Biology Program - 01/16/06 1 Overview of Animal Trait Ontology and PATO.
RDF(S) Tools Adrian Pop, Programming Environments Laboratory Linköping University.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
How can Computer Science contribute to Research Publishing?
From SHIQ and RDF to OWL: The Making of a Web Ontology Language
GO Ontology Editing Workshop: Using Protege and OWL Hinxton Jan 2012.
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Core 2: Bioinformatics CBio-Berkeley. Outline Berkeley group background Core 2 first round –what: aims, milestones –how: software lifecycle, interaction.
PATO An ontology for phenotypes. The development of PATO is the work of George Gkoutos, supported by the NCBO, working in Cambridge.
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
The Earth System Curator Metadata Representations Prototype Portal in Collaboration with ESMF and ESG Rocky Dunlap Spencer Rugaber Georgia Tech.
-By Mohamed Ershad Junaid UTD ID :
OBD : technical overview Chris Mungall. Outline  The annotation lifecycle  OBD Model and modeling requirements  Current OBD architecture  Discussion.
CASIMIR Networking Meeting Heathrow, July 2007 CASIMIR WP4 Data Representation John Hancock Duncan Davidson.
A Generic Software Framework for building Hybrid Ontology-Backed Models for Driving Applications Colin Puleston, James Cunningham, Alan Rector Bio-Health.
Chado for evolutionary science Chris Mungall HHMI (until June) National Center for Biomedical Ontologies (after June)
The National Center for Biomedical Ontology Stanford – Berkeley Mayo – Victoria – Buffalo UCSF – Oregon – Cambridge.
SONet: Scientific Observations Network Semtools: Semantic Enhancements for Ecological Data Management Mark Schildhauer, Matt Jones, Shawn Bowers, Huiping.
PART IV: REPRESENTING, EXPLAINING, AND PROCESSING ALIGNMENTS & PART V: CONCLUSIONS Ontology Matching Jerome Euzenat and Pavel Shvaiko.
Applying the Semantic Web at UCHSC - Center for Computational Pharmacology Ian Wilson.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Principles and Practice of Ontology Development: Making Definitions Computable Chris Mungall LBL.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Ontology-Driven Software Development with Protégé and OWL Holger Knublauch Stanford Medical Informatics Model-Driven Semantic Web.
The european ITM Task Force data structure F. Imbeaux.
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP2 – Media Semantics and Ontologies.
A School of Information Science, Federal University of Minas Gerais, Brazil b Medical University of Graz, Austria, c University Medical Center Freiburg,
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Using RDF in Agent-Mediated Knowledge Architectures K. Hui, S. Chalmers, P.M.D. Gray & A.D. Preece University of Aberdeen U.K
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
M.Benno Blumenthal and John del Corral International Research Institute for Climate and Society OpenDAP 2007
Core 2: Bioinformatics NCBO-Berkeley. Core 2 Specific Aims 1.Apply ontologies  Software toolkit for describing and classifying data 2.Capture, manage,
Phenote Mark Gibson Berkeley Bioinformatics and Ontology Project (BBOP) National Center for Biomedical Ontologies(NCBO) Lawrence Berkeley National Lab.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Mining the Biomedical Research Literature Ken Baclawski.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Phenote Mark Gibson Berkeley Bioinformatics and Ontology Project (BBOP) National Center for Biomedical Ontologies(NCBO) Lawrence Berkeley National Lab.
Managing Semi-Structured Data. Is the web a database?
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
WonderWeb. Ontology Infrastructure for the Semantic Web. IST Project Review Meeting, 11 th March, WP2: Tools Raphael Volz Universität.
Ontologies Reasoning Components Agents Simulations An Overview of Model-Driven Engineering and Architecture Jacques Robin.
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
Universität Innsbruck Leopold Franzens  Copyright 2007 DERI Innsbruck Second TTF Technical Fair 12 December 2007 Mediation Component Second.
OpenAccess Gear David Papa 1 Zhong Xiu 2, Christoph Albrecht, Philip Chong, Andreas Kuehlmann 3 Cadence Berkeley Labs 1 University of Michigan, 2 Carnegie.
Stanford Medical Informatics
Semantic Database Builder
 DATAABSTRACTION  INSTANCES& SCHEMAS  DATA MODELS.
CCO: concept & current status
Presentation transcript:

Core 2: Bioinformatics NCBO-Berkeley

Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster  Annotated biological important features of this sequence  Produced gene disruptions using P element- mediated mutagenesis  Full length sequencing and expression characterization of a cDNA for every gene  Developing informatics tools

SimaChris Who is here from NCBO- Berkeley MarkShu

Chris  GadFly database schema  GO database schema  Chado database schema  Perl libraries for all  OBD data architect

Shu  AmiGO,ImaGO & database  Compute Pipeline  OBD dev & Data flow

Mark  Apollo Genome Annotation Editor  Phenote and other OBD interfaces

Sima  Adh region annotation  Annotation of entire Drosophila Genome  Project manager and coordinator nonpareil  Associate Director

OBD Outline  Core 2 aims, refresher  Data models for OBD  phenotypes  clinical trials  others  Modeling frameworks  exchange formats  database system  SQL based vs ‘SemWeb’ dbs  Progress  Demo

Core 2 Specific Aims 1.Apply ontologies  Software toolkit for describing and classifying data 2.Capture, manage, and view data annotations  Database (OBD) and interfaces to store and view annotations 3.Investigate and compare implications  Linking human diseases to model systems 4.Maintain  Ongoing reconciliation of ontologies with annotations

Core 3 Driving Biological Projects  DBPs  phenotypes: Fly and Zebrafish to human  clinical trials  Core 2 Aims 1. Apply ontologies to describe data 2. Capture, manage, and view data annotations 3. Link disease genes to model systems 4. Reconcile annotation and ontology changes

Apply ontologies to describe data  Requirements  Data capture tools  phenote  demo tomorrow  no tool requirements from UCSF  Data model  Database (OBD)  --aim 2

data flow

user’s view

Data models  Common/shared domain specific models  Aim 3  linking disease genes  model must support this  granularity  comparability

Domain specific data models  FB, ZFIN  genotype to phenotype  ‘EAV’  qualities inhere in entities  orthologs  phenotype to disease  core 2 will help define common model  UCSF  clinical trials  existing ontology-friendly schema - trialbank

Phenotype data model  Qualities inhere in entities  Entity term; PATO term  brain FBbt: ; fused PATO:  gut MA: ; dysplastic PATO:  tail fin ZDB: ; ventralized PATO:  kidney ZDB: ; hypertrophied PATO:  midface ZDB: ; hypoplastic PATO:  Pre-composed phenotype terms  Mammalian Phenotype Ontology  “increased activated B-cell number” MPO:  “pink fur hue” MPO:

Extensions to simple model  What about  Relational attributes  Quantative vs qualitative  Post-composing entity and attribute terms  Relative states/values  Variation in place, space and time  A better treatment of absence  See CSHL Pheno meeting talk  also, more detailed formal presentation (available)  Not to mention genotypes, environments, provenance, etc

Modeling clinical trials  Model already described using frame- based schema  Further modeling required?  abstraction  to integrate more with other OBD datatypes  views  to only show parts relevant to OBD/BioPortal

Future DBPs and use cases  OBD will contain a variety of general types of data  Modeling is expensive  use existing models where appropriate  but whole must be cohesive and integrated  Most of this talk focuses on the pheno DBPs for illustrative purposes

Modeling frameworks  language  technology

Modeling data: underlying formalism  Model is expressed with modeling language  Options  Relational/SQL  Semi-structured, XML  Object-centric (UML, frame-based?)  Logic based  description logic: e.g. OWL  first-order logic: e.g. CL  Natural language descriptions  Model should be independent of language it is expressed in

Data exchange language: XML  Simple  XML is suited for data exchange  XML can drive software spec  constrains programmatic data model  XSD can generate UML  closed world assumption is useful  cf Ruttenberg et al  Mature technology  well understood by developers, MODs  standards

How OBD uses XML  obd-geno-pheno-xml (aka pheno-xml)  actually multiple modular components  genotype schema  phenotype schema: ‘EAV’  environment schema  provenance schema  used as  exchange format  cf: gene ontology association files  no need for ClinicalTrials-XML

Example pheno-xml ZFIN:tm84

SQL Databases  Data storage, management and querying  all MODs use SQL dbs  Lots of advantages  scalable, standard QL, mature, APIs, etc  pure relational model is reasonably formal  XML/SQL more or less compatible  low impedance mismatch

Schemas for geno-pheno data  We already have schema: Chado  Used by many MODs (eg FB)  others are ‘chado compliant’ (eg ZFIN)  Modular  ontologies  genomic  genotype  phenotype  phylogenies  …etc  Phenotype module needs updating  will be driven by pheno-xml

Problem solved?  We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each  Is this enough to work with?

Issues  OBD will be much more than geno-pheno  clinical trials  future DBPs, other NCBCs  any data expressed in an ontology language  Software and schema development expensive  fragility in face of schema evolution  development gets bogged down in data exchange issues

Major issue  SQL and XPath work great for ‘traditional’ data…  …but are too low level for ontology- centric data  lack of inference  no way to directly express ontology constraints

Use cases from previous experience: AmiGO  GO  “find all TF genes” (is_a closure)  “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a)  Our solution (AmiGO & go-sqldb)  pre-compute transitive closure over all relations in db  (sort of) works for GO (for now)  refresh problem  explosive for tangled DAGs

OBD requires more ontological awareness  Other relations  ontogenic (eg derives_from)  transitive_over  Other types of data  Pre- versus post- composed terms  E.g. MPO versus AO+PATO  E.g. Entity+Spatial qualifier  queries over either should be interchangeable

Solution: more expressive formalisms  QLs and APIs should provide and abstract away common ontology operations  ease of programming, optimisation  Choices  ‘Semweb’ databases  RDF + RDFS + Owl [ lite + DL ] + extra  lots to choose from, emerging standards  compatible with Obo v1.2 spec  Deductive databases  superset of relational databases  from Prolog to full CL

Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entityquality

Example query in SeRQL SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" find mutations affecting the shape of the wing vein: results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”

Advantages of ‘SemWeb’ dbs  Advantages over pure SQL  The ontology is the model  constraints encoded in ontology  e.g. certain quality types only applicable to certain entity types  agile development - fast database integration  Rich modeling constructs  transitivity, subsumption, intersection, etc  powerful QLs and APIs  More (technical) interoperation ‘for free’  URIs  proven?  Open World Assumption (maybe a hindrance?)

Disadvantages of ‘SemWeb’ dbs  Disadvantages  speed  may be slower than SQL ..but in-memory execution is fast  lack of maturity  new technology.. but has a LOT of momentum  foundations  are RDF triples appropriate?  inherent difficulties modeling time  SQL allows n-ary relations/predicates

Hybrid model  SemWeb dbs are commonly layered over SQL DBs  We can have the best of both worlds  Data View layers  mapping between Obo/OWL model and domain-specific relational schema  (optionally) materialized for speed  different applications use appropriate layer

Current progress: OBD- Sesame  Sesame  open source ‘triple store’  based on Jena  also used in Protégé-OWL  storage layer options  mysql/postgresql generic schema  in-memory  disk-based

OBD in Sesame: current datasets  Pheno  ZFIN & FB : EAV trial 2003 data  Test ortholog set  FB ‘simple phenotype’ alleles  ZFIN legacy phenotype data, automatically parsed to EAV  Ontologies: AOs, PATO, Cell, GO  Method  excel & flatfiles->pheno-xml->owl  OWL from  Trialbank  Method: ocelot->obo-xml->owl  Soon  human orthologs and omim

Technology Evaluation: Sesame  Use case query set  Benchmarks  preliminary conclusions  SQL layering is terrible  in-memory is fast  optimisations?  other triple stores?  up to date results on wiki   Need to test OWL-DL entailment  Bigger dataset required for full evaluations  Community effort: pub-semweb-lifesci list

Parallel development: an OBD Prototype  Initiated prior to OBD-Sesame  Simple deductive database  prolog-based  chado-like schema  can be views on Obo/OWL predicates  amigo-clone user interface  Rapid prototyping  Current dataset  as obd-sesame, plus CT  trivial to drop in more

Example logic query inheres(QI,EI)& inst(QI,QT)& label(QT,shape)& inst(EI,ETP)& part_of * (ETP,ET)& label(ET,’head capsule’) find mutations affecting the shape of some part of the head capsule results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”

OBD TODO  Pheno-xml  finalise release version  finalise Obo/OWL mapping  logic specification  Data  orthologies  OBD - BioPortal integration  how will it work?  Versioning and reconciling changes  decide on ontology versioning first

OBD dependencies  PATO development  UMLS into OBO-site  Ontologies  FMA accessibility?  species-centric AO alignments (XSPAN?)  Sept meeting on AO development  Nov meeting on disease ontologies  Data  MOD pheno annotation  OMIM annotation  Bioportal

Misc  NLP for phenote  Obol  trial on evolutionary phenotype characters  cambridge NLP project  can be used to ‘prime’ phenote  Decomposing MPO  pink fur def= fur, has_quality: pink

Discussion  Will SemWeb dbs work?  experiment  Ontology-based modeling  the ontology is the model  importance of  relations ontology  upper ontology

Demos    eset.jsp?repository=mem-rdfs-db eset.jsp?repository=mem-rdfs-db