Download presentation
Presentation is loading. Please wait.
Published byCurtis Hopkins Modified over 9 years ago
1
Core 2: Bioinformatics NCBO-Berkeley
2
Berkeley Drosophila Genome Project Finish the sequence of the euchromatic genome of Drosophila melanogaster Annotated biological important features of this sequence Produced gene disruptions using P element- mediated mutagenesis Full length sequencing and expression characterization of a cDNA for every gene Developing informatics tools
3
SimaChris Who is here from NCBO- Berkeley MarkShu
4
Chris GadFly database schema GO database schema Chado database schema Perl libraries for all OBD data architect
5
Shu AmiGO,ImaGO & database Compute Pipeline OBD dev & Data flow
6
Mark Apollo Genome Annotation Editor Phenote and other OBD interfaces
7
Sima Adh region annotation Annotation of entire Drosophila Genome Project manager and coordinator nonpareil Associate Director
8
OBD Outline Core 2 aims, refresher Data models for OBD phenotypes clinical trials others Modeling frameworks exchange formats database system SQL based vs ‘SemWeb’ dbs Progress Demo
9
Core 2 Specific Aims 1.Apply ontologies Software toolkit for describing and classifying data 2.Capture, manage, and view data annotations Database (OBD) and interfaces to store and view annotations 3.Investigate and compare implications Linking human diseases to model systems 4.Maintain Ongoing reconciliation of ontologies with annotations
10
Core 3 Driving Biological Projects DBPs phenotypes: Fly and Zebrafish to human clinical trials Core 2 Aims 1. Apply ontologies to describe data 2. Capture, manage, and view data annotations 3. Link disease genes to model systems 4. Reconcile annotation and ontology changes
11
Apply ontologies to describe data Requirements Data capture tools phenote demo tomorrow no tool requirements from UCSF Data model Database (OBD) --aim 2
12
data flow
13
user’s view
14
Data models Common/shared domain specific models Aim 3 linking disease genes model must support this granularity comparability
15
Domain specific data models FB, ZFIN genotype to phenotype ‘EAV’ qualities inhere in entities orthologs phenotype to disease core 2 will help define common model UCSF clinical trials existing ontology-friendly schema - trialbank
16
Phenotype data model Qualities inhere in entities Entity term; PATO term brain FBbt:00005095 ; fused PATO:0000642 gut MA:0000917 ; dysplastic PATO:0000640 tail fin ZDB:020702-16 ; ventralized PATO:0000636 kidney ZDB:020702-16 ; hypertrophied PATO:0000636 midface ZDB:020702-16 ; hypoplastic PATO:0000636 Pre-composed phenotype terms Mammalian Phenotype Ontology “increased activated B-cell number” MPO:0000319 “pink fur hue” MPO:0000374
17
Extensions to simple model What about Relational attributes Quantative vs qualitative Post-composing entity and attribute terms Relative states/values Variation in place, space and time A better treatment of absence See CSHL Pheno meeting talk also, more detailed formal presentation (available) Not to mention genotypes, environments, provenance, etc
18
Modeling clinical trials Model already described using frame- based schema Further modeling required? abstraction to integrate more with other OBD datatypes views to only show parts relevant to OBD/BioPortal
19
Future DBPs and use cases OBD will contain a variety of general types of data Modeling is expensive use existing models where appropriate but whole must be cohesive and integrated Most of this talk focuses on the pheno DBPs for illustrative purposes
20
Modeling frameworks language technology
21
Modeling data: underlying formalism Model is expressed with modeling language Options Relational/SQL Semi-structured, XML Object-centric (UML, frame-based?) Logic based description logic: e.g. OWL first-order logic: e.g. CL Natural language descriptions Model should be independent of language it is expressed in
22
Data exchange language: XML Simple XML is suited for data exchange XML can drive software spec constrains programmatic data model XSD can generate UML closed world assumption is useful cf Ruttenberg et al Mature technology well understood by developers, MODs standards
23
How OBD uses XML obd-geno-pheno-xml (aka pheno-xml) actually multiple modular components genotype schema phenotype schema: ‘EAV’ environment schema provenance schema used as exchange format cf: gene ontology association files no need for ClinicalTrials-XML
24
Example pheno-xml ZFIN:tm84
25
SQL Databases Data storage, management and querying all MODs use SQL dbs Lots of advantages scalable, standard QL, mature, APIs, etc pure relational model is reasonably formal XML/SQL more or less compatible low impedance mismatch
26
Schemas for geno-pheno data We already have schema: Chado Used by many MODs (eg FB) others are ‘chado compliant’ (eg ZFIN) Modular ontologies genomic genotype phenotype phylogenies …etc Phenotype module needs updating will be driven by pheno-xml
27
Problem solved? We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each Is this enough to work with?
28
Issues OBD will be much more than geno-pheno clinical trials future DBPs, other NCBCs any data expressed in an ontology language Software and schema development expensive fragility in face of schema evolution development gets bogged down in data exchange issues
29
Major issue SQL and XPath work great for ‘traditional’ data… …but are too low level for ontology- centric data lack of inference no way to directly express ontology constraints
30
Use cases from previous experience: AmiGO GO “find all TF genes” (is_a closure) “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a) Our solution (AmiGO & go-sqldb) pre-compute transitive closure over all relations in db (sort of) works for GO (for now) refresh problem explosive for tangled DAGs
31
OBD requires more ontological awareness Other relations ontogenic (eg derives_from) transitive_over Other types of data Pre- versus post- composed terms E.g. MPO versus AO+PATO E.g. Entity+Spatial qualifier queries over either should be interchangeable
32
Solution: more expressive formalisms QLs and APIs should provide and abstract away common ontology operations ease of programming, optimisation Choices ‘Semweb’ databases RDF + RDFS + Owl [ lite + DL ] + extra lots to choose from, emerging standards compatible with Obo v1.2 spec Deductive databases superset of relational databases from Prolog to full CL
33
Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entityquality
34
Example query in SeRQL SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" find mutations affecting the shape of the wing vein: results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”
35
Advantages of ‘SemWeb’ dbs Advantages over pure SQL The ontology is the model constraints encoded in ontology e.g. certain quality types only applicable to certain entity types agile development - fast database integration Rich modeling constructs transitivity, subsumption, intersection, etc powerful QLs and APIs More (technical) interoperation ‘for free’ URIs proven? Open World Assumption (maybe a hindrance?)
36
Disadvantages of ‘SemWeb’ dbs Disadvantages speed may be slower than SQL ..but in-memory execution is fast lack of maturity new technology.. but has a LOT of momentum foundations are RDF triples appropriate? inherent difficulties modeling time SQL allows n-ary relations/predicates
37
Hybrid model SemWeb dbs are commonly layered over SQL DBs We can have the best of both worlds Data View layers mapping between Obo/OWL model and domain-specific relational schema (optionally) materialized for speed different applications use appropriate layer
39
Current progress: OBD- Sesame Sesame open source ‘triple store’ based on Jena also used in Protégé-OWL storage layer options mysql/postgresql generic schema in-memory disk-based
40
OBD in Sesame: current datasets Pheno ZFIN & FB : EAV trial 2003 data Test ortholog set FB ‘simple phenotype’ alleles ZFIN legacy phenotype data, automatically parsed to EAV Ontologies: AOs, PATO, Cell, GO Method excel & flatfiles->pheno-xml->owl OWL from http://www.fruitfly.org/~cjm/obo-downloadhttp://www.fruitfly.org/~cjm/obo-download Trialbank Method: ocelot->obo-xml->owl Soon human orthologs and omim
41
Technology Evaluation: Sesame Use case query set Benchmarks preliminary conclusions SQL layering is terrible in-memory is fast optimisations? other triple stores? up to date results on wiki http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark Need to test OWL-DL entailment Bigger dataset required for full evaluations Community effort: pub-semweb-lifesci list
42
Parallel development: an OBD Prototype Initiated prior to OBD-Sesame Simple deductive database prolog-based chado-like schema can be views on Obo/OWL predicates amigo-clone user interface Rapid prototyping Current dataset as obd-sesame, plus CT trivial to drop in more
43
Example logic query inheres(QI,EI)& inst(QI,QT)& label(QT,shape)& inst(EI,ETP)& part_of * (ETP,ET)& label(ET,’head capsule’) find mutations affecting the shape of some part of the head capsule results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”
44
OBD TODO Pheno-xml finalise release version finalise Obo/OWL mapping logic specification Data orthologies OBD - BioPortal integration how will it work? Versioning and reconciling changes decide on ontology versioning first
45
OBD dependencies PATO development UMLS into OBO-site Ontologies FMA accessibility? species-centric AO alignments (XSPAN?) Sept meeting on AO development Nov meeting on disease ontologies Data MOD pheno annotation OMIM annotation Bioportal
46
Misc NLP for phenote Obol trial on evolutionary phenotype characters cambridge NLP project can be used to ‘prime’ phenote Decomposing MPO pink fur def= fur, has_quality: pink
47
Discussion Will SemWeb dbs work? experiment Ontology-based modeling the ontology is the model importance of relations ontology upper ontology
48
Demos http://yuri.lbl.gov/amigo/ct http://yuri.lbl.gov/amigo/ct http://yuri.lbl.gov/amigo/obd http://yuri.lbl.gov/amigo/obd http://spade.lbl.gov:8080/sesame/actionFram eset.jsp?repository=mem-rdfs-db http://spade.lbl.gov:8080/sesame/actionFram eset.jsp?repository=mem-rdfs-db
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.