Presentation is loading. Please wait.

Presentation is loading. Please wait.

Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster 

Similar presentations


Presentation on theme: "Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster "— Presentation transcript:

1 Core 2: Bioinformatics NCBO-Berkeley

2 Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster  Annotated biological important features of this sequence  Produced gene disruptions using P element- mediated mutagenesis  Full length sequencing and expression characterization of a cDNA for every gene  Developing informatics tools

3 SimaChris Who is here from NCBO- Berkeley MarkShu

4 Chris  GadFly database schema  GO database schema  Chado database schema  Perl libraries for all  OBD data architect

5 Shu  AmiGO,ImaGO & database  Compute Pipeline  OBD dev & Data flow

6 Mark  Apollo Genome Annotation Editor  Phenote and other OBD interfaces

7 Sima  Adh region annotation  Annotation of entire Drosophila Genome  Project manager and coordinator nonpareil  Associate Director

8 OBD Outline  Core 2 aims, refresher  Data models for OBD  phenotypes  clinical trials  others  Modeling frameworks  exchange formats  database system  SQL based vs ‘SemWeb’ dbs  Progress  Demo

9 Core 2 Specific Aims 1.Apply ontologies  Software toolkit for describing and classifying data 2.Capture, manage, and view data annotations  Database (OBD) and interfaces to store and view annotations 3.Investigate and compare implications  Linking human diseases to model systems 4.Maintain  Ongoing reconciliation of ontologies with annotations

10 Core 3 Driving Biological Projects  DBPs  phenotypes: Fly and Zebrafish to human  clinical trials  Core 2 Aims 1. Apply ontologies to describe data 2. Capture, manage, and view data annotations 3. Link disease genes to model systems 4. Reconcile annotation and ontology changes

11 Apply ontologies to describe data  Requirements  Data capture tools  phenote  demo tomorrow  no tool requirements from UCSF  Data model  Database (OBD)  --aim 2

12 data flow

13 user’s view

14 Data models  Common/shared domain specific models  Aim 3  linking disease genes  model must support this  granularity  comparability

15 Domain specific data models  FB, ZFIN  genotype to phenotype  ‘EAV’  qualities inhere in entities  orthologs  phenotype to disease  core 2 will help define common model  UCSF  clinical trials  existing ontology-friendly schema - trialbank

16 Phenotype data model  Qualities inhere in entities  Entity term; PATO term  brain FBbt:00005095 ; fused PATO:0000642  gut MA:0000917 ; dysplastic PATO:0000640  tail fin ZDB:020702-16 ; ventralized PATO:0000636  kidney ZDB:020702-16 ; hypertrophied PATO:0000636  midface ZDB:020702-16 ; hypoplastic PATO:0000636  Pre-composed phenotype terms  Mammalian Phenotype Ontology  “increased activated B-cell number” MPO:0000319  “pink fur hue” MPO:0000374

17 Extensions to simple model  What about  Relational attributes  Quantative vs qualitative  Post-composing entity and attribute terms  Relative states/values  Variation in place, space and time  A better treatment of absence  See CSHL Pheno meeting talk  also, more detailed formal presentation (available)  Not to mention genotypes, environments, provenance, etc

18 Modeling clinical trials  Model already described using frame- based schema  Further modeling required?  abstraction  to integrate more with other OBD datatypes  views  to only show parts relevant to OBD/BioPortal

19 Future DBPs and use cases  OBD will contain a variety of general types of data  Modeling is expensive  use existing models where appropriate  but whole must be cohesive and integrated  Most of this talk focuses on the pheno DBPs for illustrative purposes

20 Modeling frameworks  language  technology

21 Modeling data: underlying formalism  Model is expressed with modeling language  Options  Relational/SQL  Semi-structured, XML  Object-centric (UML, frame-based?)  Logic based  description logic: e.g. OWL  first-order logic: e.g. CL  Natural language descriptions  Model should be independent of language it is expressed in

22 Data exchange language: XML  Simple  XML is suited for data exchange  XML can drive software spec  constrains programmatic data model  XSD can generate UML  closed world assumption is useful  cf Ruttenberg et al  Mature technology  well understood by developers, MODs  standards

23 How OBD uses XML  obd-geno-pheno-xml (aka pheno-xml)  actually multiple modular components  genotype schema  phenotype schema: ‘EAV’  environment schema  provenance schema  used as  exchange format  cf: gene ontology association files  no need for ClinicalTrials-XML

24 Example pheno-xml ZFIN:tm84

25 SQL Databases  Data storage, management and querying  all MODs use SQL dbs  Lots of advantages  scalable, standard QL, mature, APIs, etc  pure relational model is reasonably formal  XML/SQL more or less compatible  low impedance mismatch

26 Schemas for geno-pheno data  We already have schema: Chado  Used by many MODs (eg FB)  others are ‘chado compliant’ (eg ZFIN)  Modular  ontologies  genomic  genotype  phenotype  phylogenies  …etc  Phenotype module needs updating  will be driven by pheno-xml

27 Problem solved?  We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each  Is this enough to work with?

28 Issues  OBD will be much more than geno-pheno  clinical trials  future DBPs, other NCBCs  any data expressed in an ontology language  Software and schema development expensive  fragility in face of schema evolution  development gets bogged down in data exchange issues

29 Major issue  SQL and XPath work great for ‘traditional’ data…  …but are too low level for ontology- centric data  lack of inference  no way to directly express ontology constraints

30 Use cases from previous experience: AmiGO  GO  “find all TF genes” (is_a closure)  “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a)  Our solution (AmiGO & go-sqldb)  pre-compute transitive closure over all relations in db  (sort of) works for GO (for now)  refresh problem  explosive for tangled DAGs

31 OBD requires more ontological awareness  Other relations  ontogenic (eg derives_from)  transitive_over  Other types of data  Pre- versus post- composed terms  E.g. MPO versus AO+PATO  E.g. Entity+Spatial qualifier  queries over either should be interchangeable

32 Solution: more expressive formalisms  QLs and APIs should provide and abstract away common ontology operations  ease of programming, optimisation  Choices  ‘Semweb’ databases  RDF + RDFS + Owl [ lite + DL ] + extra  lots to choose from, emerging standards  compatible with Obo v1.2 spec  Deductive databases  superset of relational databases  from Prolog to full CL

33 Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entityquality

34 Example query in SeRQL SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" find mutations affecting the shape of the wing vein: results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”

35 Advantages of ‘SemWeb’ dbs  Advantages over pure SQL  The ontology is the model  constraints encoded in ontology  e.g. certain quality types only applicable to certain entity types  agile development - fast database integration  Rich modeling constructs  transitivity, subsumption, intersection, etc  powerful QLs and APIs  More (technical) interoperation ‘for free’  URIs  proven?  Open World Assumption (maybe a hindrance?)

36 Disadvantages of ‘SemWeb’ dbs  Disadvantages  speed  may be slower than SQL ..but in-memory execution is fast  lack of maturity  new technology.. but has a LOT of momentum  foundations  are RDF triples appropriate?  inherent difficulties modeling time  SQL allows n-ary relations/predicates

37 Hybrid model  SemWeb dbs are commonly layered over SQL DBs  We can have the best of both worlds  Data View layers  mapping between Obo/OWL model and domain-specific relational schema  (optionally) materialized for speed  different applications use appropriate layer

38

39 Current progress: OBD- Sesame  Sesame  open source ‘triple store’  based on Jena  also used in Protégé-OWL  storage layer options  mysql/postgresql generic schema  in-memory  disk-based

40 OBD in Sesame: current datasets  Pheno  ZFIN & FB : EAV trial 2003 data  Test ortholog set  FB ‘simple phenotype’ alleles  ZFIN legacy phenotype data, automatically parsed to EAV  Ontologies: AOs, PATO, Cell, GO  Method  excel & flatfiles->pheno-xml->owl  OWL from http://www.fruitfly.org/~cjm/obo-downloadhttp://www.fruitfly.org/~cjm/obo-download  Trialbank  Method: ocelot->obo-xml->owl  Soon  human orthologs and omim

41 Technology Evaluation: Sesame  Use case query set  Benchmarks  preliminary conclusions  SQL layering is terrible  in-memory is fast  optimisations?  other triple stores?  up to date results on wiki  http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark  Need to test OWL-DL entailment  Bigger dataset required for full evaluations  Community effort: pub-semweb-lifesci list

42 Parallel development: an OBD Prototype  Initiated prior to OBD-Sesame  Simple deductive database  prolog-based  chado-like schema  can be views on Obo/OWL predicates  amigo-clone user interface  Rapid prototyping  Current dataset  as obd-sesame, plus CT  trivial to drop in more

43 Example logic query inheres(QI,EI)& inst(QI,QT)& label(QT,shape)& inst(EI,ETP)& part_of * (ETP,ET)& label(ET,’head capsule’) find mutations affecting the shape of some part of the head capsule results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”

44 OBD TODO  Pheno-xml  finalise release version  finalise Obo/OWL mapping  logic specification  Data  orthologies  OBD - BioPortal integration  how will it work?  Versioning and reconciling changes  decide on ontology versioning first

45 OBD dependencies  PATO development  UMLS into OBO-site  Ontologies  FMA accessibility?  species-centric AO alignments (XSPAN?)  Sept meeting on AO development  Nov meeting on disease ontologies  Data  MOD pheno annotation  OMIM annotation  Bioportal

46 Misc  NLP for phenote  Obol  trial on evolutionary phenotype characters  cambridge NLP project  can be used to ‘prime’ phenote  Decomposing MPO  pink fur def= fur, has_quality: pink

47 Discussion  Will SemWeb dbs work?  experiment  Ontology-based modeling  the ontology is the model  importance of  relations ontology  upper ontology

48 Demos  http://yuri.lbl.gov/amigo/ct http://yuri.lbl.gov/amigo/ct  http://yuri.lbl.gov/amigo/obd http://yuri.lbl.gov/amigo/obd  http://spade.lbl.gov:8080/sesame/actionFram eset.jsp?repository=mem-rdfs-db http://spade.lbl.gov:8080/sesame/actionFram eset.jsp?repository=mem-rdfs-db


Download ppt "Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster "

Similar presentations


Ads by Google