Template-based Question Answering over RDF Data Christina Unger , Lorenz Bühmann , Jens Lehmann Axel-Cyrille Ngonga Ngomo ,Daniel Gerber , Philipp Cimiano Yanan Zhang
background Intuitive ways of accessing RDF data become more and more important. Question answering approaches have been proposed as a good compromise between intuitiveness and expressivity. general way: a triple-based representation e.g. Who wrote The Neverending Story? (PowerAqua): <[person,organization], wrote, Neverending Story>. <Writer, IS_A, Person> <Writer, author, The Neverending Story>
1. (a) Which cities have more than three universities? (b) <[cities], more than, universities three> (c) SELECT ?y WHERE { ?x rdf:type onto:University . ?x onto:city ?y . } HAVING (COUNT(?x) > 3) 2. (a) Who produced the most films? (b)<[person,organization], produced, most films> ?x rdf:type onto:Film . ?x onto:producer ?y . ORDER BY DESC(COUNT(?x)) OFFSET 0 LIMIT 0 the original semantic structure of the question can not be faithfully captured using triples.
contribution a domain-independent question answering approach the question (parse) a SPARQL template Identify domain specific entities SELECT ?x WHERE { ?x ?p ?y . ?y rdf:type ?c . } ORDER BY DESC(COUNT(?y)) LIMIT 1 OFFSET 0
POS tagger
Who produced the most films? POS tagger (a) who/WP produced/VBD the/DT most/JJS films/NNS Parsing and template generation Domain independent lexicon: 107 entries: light verbs ,question words ,determiners, negation words, coordination and the like. (b) Covered tokens: who, the most, the, most Domain dependent lexicon: built on-the-fly. POS tag ——> syntactic and semantic properties. (c) Building entries for: produced/VBD, films/NNS
POS tag ——> syntactic and semantic properties. Heuristics: Named entities , resources. Nouns , classes, properties. Verbs, properties. If no contribution, instead by noun (Which cities have more than 2 million inhabitants?) syntactic representation semantic representation
Who produced the most films? SPARQL templates: Who produced the most films? (a) SELECT ?x WHERE { ?x ?p ?y . ?y rdf:type ?c . } ORDER BY DESC(COUNT(?y)) LIMIT 1 OFFSET 0 Slots: <?c, class, films> <?p, property, produced> (b) SELECT ?x WHERE { <?p, property, films>
String s——knowledge base K ——similar entity Entity identification String s——knowledge base K ——similar entity Generic approach S Property detection Large number of expressions can be used to denote the same predicate. (X, the creator of Y and Y is a book by X ) BOA pattern library WordNet Label(e) Entities e S(s)
Sentences: …"label(x) *label(y)" or "label(y) * label(x)"… Pairs: I(p)={(x,y):(x p y)∈K} NLE Ѳ : the form ?D? representation ?R? or ?R? representation ?D? Distinguish patterns that are Specific to property p . Support Typicity Specificity pairs x p y Sentences: …"label(x) *label(y)" or "label(y) * label(x)"… NLE Ѳ Pairs (p, Ѳ) BOA patterns
the highest scored query with a non-empty result. Query ranking and selection String similarity, prominence of entities and the schema of the knowledge base to score a query. Entities score: type checks on queries . (?x p e e p ?x ) Return: the highest scored query with a non-empty result.
Evaluation and discussion The QALD benchmark on Dbpedia: 50 questions annotated with SPARQL queries and answers. Metric: Precision recall Preliminary remark: manually corrected erroneous POS tags in seven questions. 11 questions rely on namespaces which we did not incorporate for predicate detection: FOAF ,YAGO
Unknown domain-independent expressions Results: 19 p:1.0 r:1.0 2 P>0.8 r>0.8 Precision recall Mean 0.61 0.63
The key advantage : Incorrect templates The semantic structure of the natural language input is faithfully captured. e.g. Complex questions containing quantifiers , comparatives, superlatives. Don’t need any user feedback. Incorrect templates No sensible template is constructed. Is there a video game called Battle Chess? Property slot: title or name Rdfs:label The structure of the templates is sometimes too rigid. Join the EU prop:accessioneudate The sporadic failure of named entity recognition. Battle of Gettysburg
Entity identification Class or property cannot be found on the basis of the slot. Give me all soccer clubs in the Premier League. Onto:league Give me all movies with Tom Cruise. Onto:starring Hard to match Which cities have more than 2000000 inhabitants prop:populationTotal Who owns Aldi onto:keyPerson Which mountains are higher than the Nanga Parbat prop:elevation
Query selection others A query with the wrong entity instantiating the slot is picked. The slot contains too little information in order to decide among candidates. Founded: prop:foundation, prop:foundingYear, prop:foundingDate,onto:foundationPerson, onto:foundationPlace Which organizations were founded in 1950 When was Capcom founded Which software has been developed by organizations founded in California others Namespace overlap and chosing one over the other often leads to different results of different quality. …….
Future work Rigid templates: a preprocessing step a more flexible fallback strategy Provide robust question answering for large scale heterogeneous knowledge bases.
Thanks for your listening!