Brain Data & Knowledge Grid (or: Towards Services for Knowledge-Based Mediation of Neuroscience Information Sources) National Center for Microscopy and Imaging Research (NCMIR) Mark Ellisman Maryann Martone Steve Peltier Steve Lamont... Data-Intensive Computing Environments San Diego Supercomputer Center (SDSC) Reagan Moore Chaitan Baru Amarnath Gupta Bertram Ludäscher Richard Marciano Arcot Rajasekar Ilya Zaslavsky... University of California, San Diego
Infrastructure for Sharing Neuroscience Data CCBCCB, Montana SU Surface atlas, Van Essen LabVan Essen Lab NCMIRNCMIR, UCSD stereotaxic atlas LONILONI MCell, CNL, SalkCNL SOURCES: NCMIR, U.C. San Diego Caltech Neuroimaging Center for Imaging Science, John Hopkins Center for Computational Biology, Montana State Laboratory of Neuro Imaging (LONI), UCLA Computatuonal Neurobiology Laboratory, Salk Inst. Van Essen Laboratory, Washington University … Data Management Infrastructure (DICE/NPACI) MIX Mediation in XML MCAT information discovery SRB data handling HPSS storage... Knowledge-based GRID infrastructure ? ? ? ? Data Management Infrastructure (“Data Grid”) GTOMO, Telemicroscopy, Globus, SRB/MCAT, HPSS
Sharing Resources on the Brain Data Grid Scientific groups... –create data products (e.g., text data, images, simulation data …) –put them in collections –add metadata (who created it, what is the data about …) –make it available for sharing (on the web, in data caches, in HPSS, …) Technical challenges... –size & packaging of data –heterogeneity: data types, storage technologies, transport mechanisms, authentication,... –access levels: collection, object, fragment; data-specific functions (“data blades”) Data Grid technologies can help... –distributed data management, e.g., Storage Request Broker/Metadata Catalog (SRB/MCAT), computing (Globus),... –focus is on resource sharing (data, networks, cycles)
Integration Issue: Semantic Integration/Mediation ??? SEMANTIC INTEGRATION ??? SYNTACTIC/STRUCTURAL Integration Integrated Views (Src-XML => Intgr-XML) Schema Integration (DTD =>DTD) Wrapping, Data Extraction (Text => XML) MIX Mediation of Information using XML SYSTEM INTEGRATION SRB/MCAT TCP/IP grid-ftp HTTP storage, query capabilities protocols & services Distributed Query Processing Globus JDBC DOM CORBA
Standard Mediator/Wrapper Architecture GRID federation services ??? INTEGRATED VIEW Client/User-Query (Neuro)Science (Re)Sources DB Files WWW Lab1Lab2Lab3 Wrapper XML Q/A SRB/MCAT, DOM, X(ML)Query structure transport syntax storage } domain semantics ??? Integration logic protocol translation
The Need for Semantic Integration protein localization What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? morphometry neurotransmission ???Mediator ??? Web CaBP, Expasy Wrapper ??? Integrated View ??? ??? Integrated View Definition ??? Data, relationships, constraints are modeled (CMs) Cross-source relationships are modeled Semantic (knowledge- based) mediation services Cross-source queries
Hidden Semantics: Protein Localization RyR …. spine 0 branchlet 30 Molecular layer of Cerebellar Cortex Purkinje Cell layer of Cerebellar Cortex Fragment of dendrite
Hidden Semantics: Morphometry … … Branch level beyond 4 is a branchlet Must be dendritic because Purkinje cells don’t have somatic spines
Knowledge-Based (Semantic) Mediation Multiple Worlds Integration Problem: –compatible terms not directly joinable –complex, indirect associations among attributes –unstated integrity constraints Approach: –a “theory” under which terms can be “semantically joined” => lift mediation to the level of conceptual models (CMs) => formalize domain knowledge, ICs become rules over CMs => Knowledge-Based/Model-Based (Semantic) Mediation
XML-Based vs. Model-Based Mediation Raw Data IF THEN Logical Domain Constraints Integrated-CM := CM-QL(Src1-CM,...) (XML) Objects Conceptual Models XML Elements XML Models C2 C3 C1 R Classes, Relations, is-a, has-a,... DOMAIN MAP Integrated-DTD := XML-QL(Src1-DTD,...) No Domain Constraints A = (B*|C),D B =... Structural Constraints (DTDs), Parent, Child, Sibling,... CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, OIL, DAML, …}
Knowledge-Based Mediator Prototype USER/Client USER/Client S1 S2 S3 XML-Wrapper CM-Wrapper XML-Wrapper CM-Wrapper XML-Wrapper CM-Wrapper GCM CM S1 GCM CM S2 GCM CM S3 CM (Integrated View) Mediator Engine FL rule proc. LP rule proc. Graph proc. XSB Engine Domain Map DM Integrated View Definition IVD Logic API (capabilities) CM Queries & Results (exchanged in XML) CM Plug-In
Mediation Services: Source Registration (System Issues) Source Data Type Access Protocol Query Capability table treefile SRB HTTPJDBC SQL XML QL DOOD ARC Result Delivery Tuple-at-a-time Set-at-a-time Stream Binary for Viewer Selections SPJ
Mediation Services: Source Registration (Semantics Issues) Domain Map Registration –provide concept space/ontology … as a private object (“ myANATOM ”) … merge with others (give “semantic bridges”) … and check for conflicts Conceptual Model Registration –schema: classes, associations, attributes –domain constraints –“put data into context” (linking data to the domain map) Next
ANATOM Domain Map ANATOM Back
anatom_dom(X) :- (ucsd_has_a(X,_) ; ucsd_has_a(_,X) ; ucsd_isa(X,_) ; ucsd_isa(_,X)). senselab_dom(X) :- (sl_has_a(X,_) ; sl_has_a(_,X) ; sl_isa(X,_) ; sl_isa(_,X)). % map Senselab anatom terms to equivalent UCSD ANATOM sl2ucsd(X,X) :- senselab_dom(X), anatom_dom(X). sl2ucsd('A',axon). sl2ucsd('AH',axon). sl2ucsd('Dad',spiny_branchlet). % should map to a PATH not just the end of the path sl2ucsd('Dam',main_branches). % some of the main_branches based on the branch level sl2ucsd('Dap',main_branches). sl2ucsd('Dbd',spiny_branchlet). sl2ucsd('Dbm',main_branches). sl2ucsd('Dbp',main_branches). sl2ucsd('Ded',spiny_branchlet). sl2ucsd('Dem',main_branches). sl2ucsd('Dep',main_branches). sl2ucsd('T',axon). % keep has_a edge if at least one node is known from UCSD has_a(X,Y) :- sl2ucsd(_,X), ucsd_has_a(X,Y). has_a(X,Y) :- sl2ucsd(_,Y), ucsd_has_a(X,Y). % keep all and only UCSD is_a rels isa(X,Y) :- ucsd_isa(X,Y).BackBack Senselab (Yale) and NCMIR (UCSD) “Semantic Bridge”
Neuron Spiny Neuron Substantia Nigra Pc AxonSomaDendrite GABA Neurotransmitter Compartment Dopamine R Substance P MyNeuron Medium Spiny Neuron Substantia Nigra Pr Globus Pallidus Int. Globus Pallidus Ext. MyDendrite OR ALL:has AND = exp Neostriatum Refinement of a Domain Map (Ontology): Putting Data in Context via Registration of new Classes & Relationships
Mediation Services : Integrated View Definition DERIVE protein_distribution(Protein, Organism, Brain_region, Feature_name, Anatom, Value) FROM I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS:anatomical_structure[name->Anatom]}], % from PROLAB NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value]. provided by the domain expert and mediation engineer declarative language (here: Frame-logic)
Example Query Evaluation (I) Example: protein_distribution –given: organism, protein, brain_region –Use DOMAIN-KNOWLEDGE-BASE: recursively traverse the has_a_star paths under brain_region collect all anatomical_entities –Source PROLAB: join with anatomical structures and collect the value of attribute “image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism –Mediator: aggregate over all parents up to brain_region report distribution
Example Query Evaluation X1 := select output from parallel fiber X2 := “hang off” X1 from Domain X3 := X4 := select PROT-data(X3, Ryanodine X5 := compute aggregate(X4); "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?"
Mediation Services: Client Registration Client Update Client Fat Result Viewer Query Client Check Data Merge Before Insert Derive Before Insert Client-side Buffer Client-side Processing Navigate/ Ad-hoc Query Capability Query on Schema Thin Result Viewer Send Full Data Server-side Buffer Context Sensitive Server-Push/ Client-Pull
Example Client: Query Formulation and Result Display combination of ad hoc and navigational queries client side visualization (left) results are shown in semantic context (right)
Mediation Services: Semantic Annotation Tools line drawing ==annotation==> (spatial) database for mediation
XML Sources RDB Sources File Sources HTML Sources Query interface (down API): SDLIP, SOAP,... (subsets of) SQL, X(ML)-Query, CPL,... DOM SRB-based access Result delivery interface (up API): SDLIP, SOAP,... pull (tuple/set-at-a-time, DOM) vs. push (stream) synchronous/asynchronous direct data/data reference Wrapper Layer Digital Libraries (Collections) Spatial Sources Source registration: domain knowledge model & schema query & computation capabilities Query processing: view unfolding semantic optimization capability-based rewriting Source model lifting: domain knowledge reconciliation model transformation Query formulation: user query integrated view definition Optimizer Model Reasoner Deductive Engine Mediator Layer Mediation Services Mediator Architecture Blueprint Boston Univ. NCMIR UCSD Yale Univ. Montana Univ. SDLIP ARC IMS
Coming up: Knowledge-Based/Semantic Mediation of Brain Data CCBCCB, Montana SU Surface atlas, Van Essen LabVan Essen Lab NCMIRNCMIR, UCSD stereotaxic atlas LONILONI MCell, CNL, SalkCNL ANATOM PROTLOC ResultResult (VML/SVG) ResultResult (XML/XSLT) Knowledge-Based Mediation
Some Open Issues Data/Knowledge Modeling –Extensibility: how to handle a source with new data types and operations? Temporal Data: instrument readings, video microscopy Spatial Data: Integrating with spatial database systems Image database systems –Conflict Management Grades of certainty Alternate Hypothesis Integrating Services –Registration and warping of my image slice to a reference Integrating into Larger Applications –M-Cell simulation –Telemicroscopy –Visualization
Model-Based Mediation with Domain Maps, Bertram Ludäscher, Amarnath Gupta, Maryann Martone, Intl. Conference on Data Engineering (ICDE), Heidelberg, 2001 Knowledge-Based Mediation of Heterogeneous Neuroscience Information Sources, Amarnath Gupta, Bertram Ludäscher, Maryann Martone, Intl. Conference on Scientific and Statistical Databases (SSDBM), Berlin, Model-Based Information Integration in a Neuroscience Mediator System, Bertram Ludäscher, Amarnath Gupta, Maryann Martone, Intl. Conference on Very Large Data Bases (VLDB), Cairo, References