QA and Language Modeling (and Some Challenges) Eduard Hovy Information Sciences Institute University of Southern California
Standard QA architecture (factoids) Identify keywords from Q Build (Boolean) query Retrieve texts using IR Rank texts/passages Move window over text and score each position Rank candidate answers Return top N candidates A list Input Q Corpus: 35% + Web: + 10% (Microsoft 01) (Waterloo 01) Replace this by more-targeted matching
Textmap: Knowledge used for pinpointing Orthography (rules) –ZIP codes, URLs, etc. Default numerical info (rules) –how many people live in a city? Abbreviations / acronyms (rules) External sources (WordNet etc.) –definitions, instances, etc. Syntactic constituents (parse tree) –delimit answer extent exactly Syntactic and semantic types & relations (parse tree) –pinpoint correct syntactic relation –pinpoint correct semantic type –QA typology (140 types) (PRED) [2] Jack Ruby (DUMMY) [6], (MOD) [7] who killed John F. Kennedy assassin Lee Harvey Oswald (SUBJ) [8] who (PRED) [10] killed (OBJ) [11] John F. Kennedy assassin Lee Harvey Oswald (MOD) [13] John F. Kennedy (MOD) [19] assassin (PRED) [20] Lee Harvey Oswald [1] Lee Harvey Oswald allegedly shot and killed Pres. John Kennedy... [2] Jack Ruby, who killed John F. Kennedy assassin Lee Harvey Oswald Surface answer patterns (patterns)
Language modeling? IR stage: as for IR Pinpointing stage: learn to generate Qs from As…? –for factoids: very brief Qs, very brief As…hard –for longer As (biographies, event descriptions, opinion descriptions…): better outlook BIRTHDATE 1.0 ( - ) 0.85 was born on, 0.6 was born in 0.59.was born 0.53 was born 0.50 – ( 0.36 ( - LOCATION 1.0 ' s 1.0 regional : : 1.0 at the in 0.96 the in, 0.92 near in ‘Structured’ language model: word sequence patterns –Learn patterns for each Qtype; apply to pinpoint answer (Soubbotin & Soubbotin 01) –Automated learning from web (Ravichandran & Hovy 02) –Eventually create FSMs with semantic and syntactic types This is the LM for the semantics of birthdates!
Moving beyond factoids Structured non-factoid answers: biographies, event stories, opinion ‘arguments’, etc. –Multi-doc summarization Answer ‘qualifiers’: tense, hypotheticals, negation… “who is the president?” – when? –Linguistics work Non-structured long answers –Text planning? Inference –AI? / KR? easier harder
Challenges for QA Remembering what you learned today; adding that to some (structured) knowledge repository Complex answers (and extend QAtypology) Answer validity / trustworthiness Merging answer (pieces) from multiple media sources (speech, databases, etc.) Learning the LM / structure for any type of non- factoid answer—moving to more complex models: –bag-of-words –ngram distributions –patterns –schemas/templates (decomposition&recomposition) –?user’s known-fact list
Thank you