Semantic Mediation and Scientific Workflows Bertram Ludäscher Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego
2 SEEK Kansas 11/02 Data Integration Approaches: –Let’s just share data, e.g., link everything from a web page! –... or better put everything into an relational or XML database –... and do remote access using the Grid –... or just use Web services! Nice try. But: –“Find the files where the amygdala was segmented.” –“Which other structures were segmented in the same files?” –“Did the volume of any of those structures differ much from normal?” –What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? Some BIRNing Data Integration Questions Biomedical Informatics Research Network
3 SEEK Kansas 11/02
XML-Based (or Relational) vs. Semantic Mediation Raw Data IF THEN Logical Domain Constraints Integrated-CM CM-QL(Src1-CM,...) (XML) Objects Conceptual Models XML Elements XML Models C2 C3 C1 R Classes, Relations, is-a, has-a,... “Glue Maps” = Domain & Process Maps (ontologies) Integrated-DTD XML-QL(Src1-DTD,...) No Domain Constraints A = (B*|C),D B =... Structural Constraints (DTDs), Parent, Child, Sibling,... CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
5 SEEK Kansas 11/02 Making the SM System “Understand” Your Data: Source Contextualization via Ontology Refinement Making the SM System “Understand” Your Data: Source Contextualization via Ontology Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... sources can register new concepts at the mediator...
Query Processing Demo Query Processing Demo Query results in context Contextualization CON(Result) wrt. ANATOM. Mediator View Definition DERIVE protein_distribution (Protein, Organism,Brain_region, Feature_name, Anatom, Value) WHERE I: protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS: anatomical_structure[ name->Anatom ] } ], % from PROLAB NAE: neuro_anatomic_entity[ name->Anatom; % from ANATOM located_in->>{Brain_region} ], AS..segments..features [ name->Feature_name; value->Value ]. provided by the domain expert and mediation engineer deductive OO language (here: F-logic)
7 SEEK Kansas 11/02 A Scientific Workflow: Promoter Identification Questions: Are chr#’s in common? Are chr#’s locations in common? Are there conserved upstream sequences? Are gene locations conserved across species Questions: RNA POLII promoter? GpC Island present? Are there common TAF’s across genomic gi#? Questions: Are there other common genes? gi#’s from clusfavor cDNA gi# Gene name blast blast human Genomic gi# Chr # Gene location TAF’s Location on Genomic gi#’s Probabilities of match Probabilities of random match TRANSFAC GC Island location Exon/intron location Repeats location Promoter location GRAIL Validates polII promoter location promoter location Shared TAF’s across cluster Common consensus sequence Data Consolidation Consensus sequences CLUSTAL blast other species Genomic gi# Chr # Gene location blast Matthew Coleman, LLNL, 2002 Genomic gi# cDNA gi# blast CLUSTAL TRANSFAC
8 SEEK Kansas 11/02 SDM Demo & Architecture Translation Approach: Abstract Workflow (AWF) => Executable Workflow (EWF) Translation Approach: Abstract Workflow (AWF) => Executable Workflow (EWF)