Download presentation
Presentation is loading. Please wait.
Published byAda Moody Modified over 9 years ago
1
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti, dp @ics.forth.gr
2
Computer Science Department, University of Crete & FORTH-ICS 2 Presentation Outline 1. Motivation 2. Current Integration Approaches 3. Quete Overview 4. Querying in Quete 5. Evaluation 6. Conclusions 7. Future Work
3
Computer Science Department, University of Crete & FORTH-ICS 3 1. Motivation Clinical IS mediator D.B. Genomic IS Visualization Tools Regulatory Element Tools Statistical, Clustering, Classification Tools Query Engine metadata findings Normalization Tools Sample name Normalized data
4
Computer Science Department, University of Crete & FORTH-ICS 4 2. Current Approaches (1/2) Warehouse Integration Data is downloaded, filtered, integrated and stored in a warehouse. Answers are taken from the warehouse GUS Navigational Integration Explicit Links Between data SRS, Entrez Mediator - Wrapper Approaches A global schema is defined over all data sources K2/BioKleisli, TAMBIS, BACIIS, DiscoveryLink
5
Computer Science Department, University of Crete & FORTH-ICS 5 2. Current Approaches (2/2) Mediator-Wrapper approach GAV approach The global schema is defined in terms of the source terminologies LAV approach The sources are defined in terms of the global schema Source 1 Source 2 Mediator Wrapper Query Results
6
Computer Science Department, University of Crete & FORTH-ICS 6 3. Integration Architecture Ontology Source 1Source 3 Java Application QueryResult Jdbc-Odbc Source 2 Java DB Engine QUETEQUETE
7
Computer Science Department, University of Crete & FORTH-ICS 7 3.1 The Reference Ontology Ontology is organized as a graph (+relationship concepts) related through IS-A HAS-A TumorSample TumorIdentifier: String SurgeryDate: Date RiskFactors YearsOfSmoking: Int Age: Int Hybridization HybridizationDate: Date BreastCancerPatient Name: String City: String SSN: String GeneExpression RatioValue : Decimal Reporter ReporterName :String HGNCGeneSymbol :String GOAnnotation GOId: String GOName: String GOBiologicalProcess GOMolecularFunction GOCellularComponent IS-A HAS-A
8
Computer Science Department, University of Crete & FORTH-ICS 8 3.2 Semantic Names A semantic name (SN) captures the system independent semantics of a schema element combining one or more ontology terms Semantic_name= [CN 1 ; …; CN m ] AN The semicolon between CN i and CN i+1 means that concept CN i is generalization of concept CN i+1. TypeSemantic NameSystem Name Table[BreastCancerPatient]BreatCancerPatient Field[BreastCancerPatient] NameName Field[[BreastCancerPatient] CityCity Table[BreastCancerPatient;TumorSample]SurgicalExcision Field[BreastCancerPatient;TumorSample] TumorIdTumorSampleId Field[BreastCancerPatient;TumorSample] SurgeryDateSurgeryDate
9
Computer Science Department, University of Crete & FORTH-ICS 9 3.3 Definitions A semantic name [CN 1 ; …; CN m ] AN is subsumed by a semantic name [CN 1 ’; …; CN m ’] AN ’, if m ’ <= m CN m-m’+I coincides with or is a specialization of CN i ’, i=1, …, m’ AN=AN’ Two semantic names are semantically overlapping if Their last i concept names are the same or related through the ISA relationship They have the same attribute name AN
10
Computer Science Department, University of Crete & FORTH-ICS 10 3.4 Integration Steps Capture Process Captures the data to be integrated Performed independently in each source Use Extractor tool to export database schemata Choose fields/tables of interest Use the Ontology to Annotate Schemata Use the Ontology to Annotate Schemata Database schemata extracted and stored in X-Spec files that are sent to the central site. Integration Process Central Integration of the various data sources A global view is produced in memory called Context View
11
Computer Science Department, University of Crete & FORTH-ICS 11 4.1 Query Formulation Attribute-only version of SQL SELECT [BreastCancerPatient]Name, [Reporter]HGNCGeneSymbol, [GeneExpression]RatioValue WHERE [RiskFactors]YearsOfSmoking>30 AND [Hybridization]HybridizationDate=[TumorSample]SurgeryDate AND [Reporter;GOMolecularFunction]GOName=“celladhesion” ORDERBY [BreastCancerPatient]Name SELECT clause contains concepts to be projected WHERE clause specifies selection criteria FROM clause is absent since the integration system will automatically identify tables to be used. No need for explicit join declarations
12
Computer Science Department, University of Crete & FORTH-ICS 12 4.2 Query Answering Semantic Query is decomposed in SQL subqueries When possible all operations are pushed into subqueries They are issued in parallel in distinct data sources When all results are returned in central site, all remaining operations are performed ( joins, ordering etc)
13
Computer Science Department, University of Crete & FORTH-ICS 13 4.3 Requirements in forming local subqueries 1. Identify the interesting to the user table attributes with semantic name [CN path ]AN 1. i.e (attributes with the same or more specific information+ local join keys) 2. Since the from clause is missing, the linking tables with interesting to the user attributes must be determined and their join conditions 3. The join attributes called DB link attributes are needed to link the interesting to the user attributes among sources
14
Computer Science Department, University of Crete & FORTH-ICS 14 4.4 Forming the local sub-queries Extension of Unity’s algorithm that increase’s system recall with no sacrifice in precision Our algorithm takes into account The user query The ontology The data source-to-ontology mappings …and formulates a single sub query (SQ) for each data source
15
Computer Science Department, University of Crete & FORTH-ICS 15 4.5 Algorithm: Result Composition Input: (i)The user semantic query (ii) local SQs Output: Composition plan 1. Find all minimal subsets of SQs such that 1. There is a join tree connecting all subqueries 2. All the semantic query’s fields exist 3. In each SQ there is a projection attribute which does not overlap with the projection attribute of another SQ 2. Join the queries in each minimal subset 3. Project the common requested attributes 4. Union Results 5. Apply Group and Order operations
16
Computer Science Department, University of Crete & FORTH-ICS 16 4.6 Results composition Is done with the help of a central DBMS For every sub query design the temporary table in central db and store the returned results Build the global SQL query to be issued to the central DB according to the result composition plan Execute the global SQL query Pros First step executed in parallel Uses DBMS technology to handle efficient join, union, order and group operators
17
Computer Science Department, University of Crete & FORTH-ICS 17 4.6 Novel features Horizontal, vertical and hybrid fragmentation can be declared and used During the formation of local sub queries During the formation of the result composition plan It rebuilds the fragmented tables before going further down to composition plan Advantages Eliminate unnecessary local sub queries Avoids joins that are certain to return empty results Increasing system’s recall Improving performance.
18
Computer Science Department, University of Crete & FORTH-ICS 18 Preliminary Evaluation
19
Computer Science Department, University of Crete & FORTH-ICS 19 Conclusions Information Integration is a difficult task Heterogeneity of Sources Independent Evolution Communication costs Complicated Structures Our system has good performance. A LAV system Global Schema do not change as sources evolve new sources are added But without LAV’s complexity in processing Trade off between complexity and efficiency
20
Computer Science Department, University of Crete & FORTH-ICS 20 Future Work More Query Algorithms in memory Database Cycles Non – Relational Data Sources Exploit Systems for Automatic Schema matching Web Service – Grid approach Caching Updates in sources
21
Thanks !!!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.