Automatic Detection of Causal Relations for Question Answering 2003 Roxana Girju Baylor University
content Background Main Task Result Analysis Application in QA
Background Causation relations expressed in English Explict (cause, lead to, kill, dry…) Implict (no keyword) Previous Work Using knowledge-based inferences
Main Task classifier A classifier based in machine learning input: sentence(<NP1, verb, NP2>) vectorization main task output: YES or NO classifier
Vectorization Training Example (entityNP1,psychological-featureNP1,abstractionNP1, stateNP1,eventNP1,actNP1,groupNP1,possessionNP1,phenomeno nNP1; verb; entityNP2,psychological-featureNP2,abstractionNP2, stateNP2,eventNP2,actNP2,groupNP2,possessionNP2,phenomeno nNP2; target)
Sentence: Earthquake generates Tsunami Vector: <f, f, f, f, f, f, f, f, t, generate, f, f, f, f, f, t, f, f, f> The complete training example: <f, f, f, f, f, f, f, f, t, generate, f, f, f, f, f, t, f, f, f, YES>
How to build the training set? Step1: find the sentences (Where did the data come from?) Step2: select features (How to vectorization?)
Find the Sentences Step1: find NP pairs contain causation relationship WordNet 1.7 contains 429 such NP pairs, the most frequent being medicine.(about 58.28%)
Step2: For each pair of causation nouns determined above, search the Internet and retain only sentences containing the pair. From these sentences,determine automatically all the parterns <NP1 verb/verb_expression NP2>
Step3: searching the text collection and retained 120 sentences containing the verb 60*120 = 7200 ( corpus A) Step4: extracting 6523 relationships of the type <NP1 verb NP2> from 7200 sentences. 2101 are causal relations, while 4422 are not (manually annotate)
Select Features Both lexical and semantic features Lexical features: verb/verb_expression Semantic features: 9 noun hierarchies in WordNet: entity, psychological feature, abstraction, state, event, act, group, possession, and phenomenon.
Training Algorithm C4.5 decision tree Inductive bias: a preference for the shorter tree that places high information gain attributes closer to the root
Result Analysis 683 relationships of the type <NP1 verb NP2> in corpus B 102/(115+38) = 66.67%
Reasons for Errors Mostly the fact that the causal pattern is very ambiguous Incorrect parsing of noun phrases The use of the rules with smaller accuracy(e.g. 63%) The lack of named entities in WordNet
Application in QA 50 questions tested, 61% precision for QA system with the causation module, and 36% precision for QA system without the module.
Thanks!