Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for Answering Semantic Web Queries

Outline Challenges Desiderata and overview of our approach OWLII: the subset of OWL that our system supports OBII: Ontology Based Information Integrator Evaluation Wrap up

Challenge 1: Scalability The (Semantic) Web is too big for a given system – regardless of advances in algorithms and/or smart hacks We need to some how identify a suitable subset that is relevant to a query and work on them –Sampling and refinement – as Fensel and van Harmelen (IEEE IC 2007) suggest ? –get a good enough subset in one shot ?

Challenge 2: Heterogeneity O2O2 O3O3 O4O4 ONON O1O1 sesame OWLIM DLDB sesame ? ? ? Query Need Alignments Mapping tools Third party alignments Need Tools that exploit them

Desiderata 1. Rely as much as possible on existing infrastructure. 2. Answer a query using any ontology not just a globally accepted query ontology. 3. Identify a good enough subset of data sources that will get useful answers. 4. Be able to discover alignment information even when the ontologies are not directly mapped with one another. 5. Account for the dynamic nature of the Web where content of the data sources change rapidly

Our approach (Get a good enough subset in one shot) Introduced a concept of source relevance (REL statements) –Allows data providers to advertise the relevance of their data to a query –If a source can express that it has relevant information we can choose to query it as opposed to other sources that do not express this information. Adapted an information integration algorithm to select relevant sources for a query, given relevance meta data and ontology alignment Implemented and evaluated the system on synthetic data

PDMS Fast and proven algorithm for query reformulation in database community (Halevy et al. ICDE 2003) Uses LAV and GAV information integration formalisms to describe maps and data sources –GAV In first order logic it is an implication with multiple antecedent and a single consequent Usually written like: O3:BigMonitor (x) :- O2: LCD (x), screen (x, big) –LAV In first order logic it is an implication single antecedent and multiple consequents Usually written like:O1:CinemaDisplay (x) O2: LCD (x), screen (x, big)

OWL for information integration (OWLII) The ontology language our system supports A subset of OWL DL (therefore, decidable) To represent LAV and GAV in OWL –We have extended the DHL language (Grosof et. al. 03) REL statements are modeled as LAV statements Details in a tech report http://www3.lehigh.edu/images/userImages/jgs2/Page_7287/LU-CSE-07-007.pdf

REL example

Map example

Not so simple maps! Maps are not always straight forward For example: Data type property to object type property –profession is a datatype property in O1 –Profession is a class and hasProfession is a object property in O2 (domain Person and range Profession) –O1:Person O1:profession.{teacher} O2:Person O2:hasProfession.{teacher}

OBII Ontology Based Information Integrator Input –Domain ontologies (class and property hierarchy only) –Map ontologies (OWL files that import two ontologies and establish alignments using OWLII) –REL files (RDF files. A set of RDF triples enclosed by RelStatement describes a source relevance) –Data Sources (OWL files that contain only individual and property assertions ABox or Sesame repository that contain similar data) –Sparql query Output –Variable binding in XML

Evaluation A Baseline system: we load all the ontologies, all the maps and all the sources in a DL reasoner and issue a query to get a sound and complete answer Basic PDMS to select sources (without any taxonomic reasoning) OBII

Metrics Response time –For the baseline system we add the time to load all the data, and the reasoning time to get the answers. –For the other two systems load time is calculated as the time to load the ontologies that have been used in the reformulation (map and domain ontologies) and the selected data sources. –The response time for these two systems then is a sum of load time, reformulation time and the reasoning time. Percentage of complete responses to queries. –In determining the completeness of queries, we consider the baseline system's answers to be the reference set. –This is reasonable because it has all the data available to it and uses KAON2 a DL reasoner. –We only consider queries that entail at least one answer.

Data Real world data is limited –Can not be used to test the system completely We decided to use synthetic data We developed a work load generator MOST (Maps Ontologies Sources Tester) We plan to use some real data soon in the ISENS project

Results (1) Response time for each system as we vary the number of ontologies and the number of sources Both basic PDMS and OBII are significantly faster then the base line system (note: the chart is in logarithmic scale) Additionally, the basic PDMS is typically twice as fast as OBII Similar trend in other configurations # of Onts - Diameter- # of sources

Results (2) Contribution of load time to response time OBII and basic PDMSs main performance difference is due to load time. OBII identifies more sources as it uses taxonomic reasoning Since PDMS fails to identify these sources, it is incomplete for many queries (next chart)

Results (3) The percentage of complete query responses decreases in basic PDMS as we increase the number of data sources and the number of ontologies. OBII is 100% complete for all the queries with respect to the baseline system

Wrap up! The Semantic Web needs to be connected in order for the Semantics to really payoff We have implemented a fast source algorithm for selecting and integrating Semantic Web data sources Our initial evaluation shows promise but there is a lot to be done –Complex ontologies, expressive RELs ….

Backups

OWLII description Axiom typeSubjectObject owl:equivalentClassNamed classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue rdfs:subClassOfAll of the above + owl:unionOf All of the above + owl:allValuesFrom owl:equivalentProperty rdfs:subPropertyOf named properties, owl:inverseOf owl:inverseOfnamed properties

Source Selection Algorithm subPred returns sub class (properties), enhanced match that allows us to consider sub classes (properties). This is an improvement over the PDMS

How MOST was used? OntGenerator –An average of 20 classes and 20 properties. –The class and property taxonomy have an average branching factor of 4 and an average depth of 3 MapGenerator –An even distribution of various mapping axioms (it can be controlled) –Chose to map about 30% of the classes and 30% of the properties of a given domain ontology –The resulting map views contain an average of 5 conjuncts with some maps containing up to 11 conjuncts. SourceGenerator –Create instances of 30% of the classes and 30% of the properties of the domain ontology that a source commits to. –On average each data source contains 50 triples. QueryGenerator –Generate 200 random queries with 1 to 3 conjuncts (75% of conjuncts are properties as opposed to a class).

Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Similar presentations

Presentation on theme: "Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Similar presentations

Presentation on theme: "Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for."— Presentation transcript:

Similar presentations

About project

Feedback