Towards Distributed Information Retrieval in the Semantic Web: Query Reformulation Using the Framework Wednesday 14 th of June, 2006 Rapha ë l Troncy, Umberto Straccia
Motivation Various SW repositories, using different vocabularies, distributed on the web Already large amounts of data out there Swoogle hits 1.5M unique Semantic Web documents (05/06/2006) Problem: How to search and retrieve information in such an environment?
Example scenario Kim Clijsters (courtesy of AFP) Montenegro independence (courtesy of Euronews) NewsML SportsML EventsML iCalendar TimeML... EXIF EBU P/Meta MXF MPEG-7
Distributed Search in the SW: Resource selection Select a subset of some relevant resources Query reformulation Reformulate the information need into the vocabulary used by the resource Data fusion and rank aggregation Merged and ranked all the results together
Resource Selection Compute an approximation of the content of each resources For some random queries, an approximation consists of: The ontology the resource relies on Some instances (sampling annotated documents)
Query reformulation Transformation rules: From the query vocabulary to the vocabularies used by the resources Semantic Web: Ontology Alignment Establishing relationships holding between the entities (subsumption, equivalence, disjointness…) With a confidence measure Automatically computed an ontology alignment framework
oMAP: Ontology Alignment Tool
oMAP: A Formal Framework Sources of inspiration: Formal work in data exchange [Fagin et al., 2003] GLUE: combining several specialized components for finding the best set of mappings [Doan et al., 2003] Notation: A mapping is a triple: M = (T, S, ∑ ) S and T are the source and target ontologies S i is an OWL entity (class, datatype property, object property) of the ontology ∑ is a set of mapping rules: α ij T j ← S i
oMAP: Combining Classifiers Weight of a mapping rule: α ij = w (S i,T j, ∑ ) Using different classifiers: w (S i,T j, CL k ) is the classifier's approximation of the rule T j ← S i Combining the approximations: Use of a priority list: CL 1 CL 2 … CL n Weighted average of the classifiers prediction
Terminological Classifiers Same entity names (or URI) Same entity name stems
Terminological Classifiers String distance name Iterative substring matching See [Stoilos et al., ISWC'05]
Terminological Classifiers WordNet distance name lcs is the longest common substring between S i and T j sim =
Machine Learning-Based Classifiers Collecting bag of words: label for the named individuals data value for the datatype properties type for the anonymous individuals and the range of object properties … Recursion on the OWL definition: depth parameter Use statistical methods on the collected bag of words
Machine Learning-Based Classifiers Example Individual (x 1 type (Conference) value (label "European SW Conf") value (location x 2 ) ) Individual (x 2 type (Address) value (city "Budva") value (country "Montenegro") ) u1 = (" European SW Conf ", "Address") u2 = ("Address", "Budva", "Montenegro") Naïve Bayes text classifier kNN text classifier
Structural and Semantics-Based Classifier ∑ is a set of mapping rules: α ij T j ← S i ∑ sets are computed by taking the OWL definition of the entities to align recursively in the OWL structure... without looping thanks to cycles detection
Structural and Semantics-Based Classifier If S i and T j are property names: If S i and T j are concept names 1 : 1 Where D = D(S i ) * D(T j ) ; D(S i ) represents the set of concepts directly parent of S i
Structural and Semantics-Based Classifier Let C S =(QR.C) and D T =(Q’R’.D), then 1 : Let C S =(op C 1 …C m ) and D T =(op’ D 1 …D m ), then 2 : 1 Where Q,Q’ are quantifiers, R,R’ are property names and C,D concept expressions 2 Where op, op’ are concept constructors and n,m ≥ 1
Evaluation OAEI Contests (2004, 2005, 2006): Systematic benchmark tests on bibliographic data Tests 2xx: aligning an ontology with variations of itself where each OWL constructs are discarded or modified one per one Tests 3xx: four real bibliographic ontologies Web categories alignment
Benchmark Tests dublin Falcon0.91 FOAM0.90 oMAP0.85 CMS0.81 OLA0.80 ctxMatch0.72 edna0.45 Falcon0.89 OLA0.74 dublin FOAM0.69 oMAP0.68 edna0.61 ctxMatch0.20 CMS0.18 oMAP: 4 th with the global F-Measure 1 st on 3xx tests (real ontologies to align) Precision Recall
Aligning Web Categories Aligning Google, Loksmart and Yahoo web categories [Avesani et al., ISWC'05] Blind tests: only recall results are available ctxMatchFOAMCMSDublin20FalconOLAoMAP 9.4%11.9%14.1%26.5%31.2%32.0%34.4%
Distributed Search in the SW Q: retrieve course material dealing with history of the Americas and "Columbus" query(d)<- History_Americas(d,"Columbus") is re-written as two queries 0.63 query(d)<- Latin_American_History(d,"Columbus") 0.84 query(d)<- American_History(d,"Columbus") Each document score is then multiplied with the confidence score of the rule. university courses
Conclusion Distributed Search in the SW resource selection / query reformulation / data fusion and rank aggregation oMAP: a formal framework for aligning automatically OWL ontologies Combining several specific classifiers Terminological classifiers Machine learning-based classifiers Structural and semantics-based classifier
Future Work Implementing the three steps proposed Keyword-based or structured (SPARQL) queries Ranked list of results oMAP Using additional classifiers: KL-distance, other resources, background K, etc. Straightforward theoretically but practically difficult! Finding complex alignment name = firstName + lastName OWL and rule-based languages: Take into account this additional expressivity
Any questions ?
Structural and Semantics-Based Classifier Possible values for w op and w Q weights w op w Q ⊓⊔ ¬ ⊓ 11/40 ⊔ 10 ¬1 1 1 n n m11/3 m 1