The Knowledge Acquisition Bottleneck Revisited: How can we build large KBs? Illustrations of different approaches Peter Clark and John Thompson Boeing Research 2004
Premise Intelligent machines needs lots of knowledge, for question-answering intelligent search information integration natural language understanding decision support modeling etc. etc. Much of this knowledge can be drawn from some general repository of reusable knowledge e.g., WordNet How does one build such a repository? “No-one considers hand-building a large KB to be a realistic proposition these days” [paraphrase of Daphne Koller, 2004]
1. Build it by Hand “Let’s roll up our sleeves and get on with it!” But: It’s a daunting task Our own work Cyc + Lots in it, (Relatively) well designed ontology - 650 person-years effort so far - Still patchy coverage (why?) Difficult to use outside Cycorp
1. Build it by Hand (cont) WordNet + Easy to use + Comprehensive Little inference-supporting knowledge in Ad hoc ontology
1. Build it by Hand (cont) The Component Library Claim: can bound the required knowledge by working at a coarse-grained level + Large, more doable Hard to use, still very incomplete
2. Extract from Dictionaries - MindNet + Automatically built Unusable? Extended WordNet + Won TREC competition - Still somewhat incoherent Lot of manual labor
3. Corpus-based Text/Web Mining - Schubert’s system + Automatic + Lots of knowledge Noisy No word senses Only grabs certain kinds of knowledge 30M entries…
3. Corpus-based Text/Web Mining (cont) - KnowIt (Etsioni) + automatic only factoids
4. Community-Based Acquisition Knowledge entry by the masses OpenMind + Large Full of junk, unusable (?) Would this work with better acquisition tools? (see next slide for illustration)
5. Use Existing Resources e.g., databases CIA World Fact Book Web data/services e.g., SRI/ISI’s ARDA QA system + Syntactically simple + Available Largely limited to factoids Information integration is a major challenge different ontologies, contradictory data
Where to? Can we bound the knowledge needed for a particular application for a useful, sharable, general resource? Which of these approaches seems most realistic? build by hand extract from dictionaries mine text corpora community knowledge entry use existing resources