PLUIE: Probability and Logic Unified for Information Extraction Stuart Russell Patrick Gallinari, Patrice Perny
Project Goals “Open” information extraction – Construct knowledge bases from the web – Learn new classes, relations, linguistic patterns – Learn new predictive regularities – Integrate facts, entities across multiple documents – Support question answering Accuracy, consistency, integration, and utility; not scale for its own sake
Approach Probabilistic inference with the Web as evidence Generative models when available WorldWeb
Approach, contd. Open-universe probability models (e.g., BLOG) – First-order expressive power (objects, relations, functions, quantifiers, equality, etc.) – Allow for uncertainty about existence, identity of objects Generative model consists of – What might be true in the world – Who might choose to say what – How they might choose to say it
Approach contd. Rigorous ontological framework – Standard taxonomic hierarchy that supports distinctions needed for language E.g., mass nouns (water) vs count nouns (lake) – Proper treatment of events and time; avoid deficient “facts” such as Man Utd beat Chelsea; Chelsea beat Man Utd (PowerSet) Hank Paulson is the CEO of Goldman Sachs (NELL)
Open questions Efficient inference – What is extracted? Posterior over possible worlds? How to identify new categories and relations HCI: Presenting infinite heterogeneous posterior distributions: Who wrote what when when “who,” “what” and “when” vary across worlds? Making use of partially extracted or unextracted information – “data spaces” (Franklin, Halevy) Adversarial data: game-theoretic analysis?
Plan Reading group – Weekly meeting (day and time?) – Participants take turns presenting – Reading list at Formal project (ANR) runs 1/1/13 to 8/31/14 – Will continue indefinitely – Hiring two postdocs Possible collaborations – Tom Mitchell’s NELL project (CMU) – Andrew McCallum (UMass) – Kevin Murphy (Google’s Knowledge Graph project)