Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ
Outline Motivation System Architecture Detail Techniques Search Engine Interface Extractor Probabilistic Assessment Experimental Result Future Work Conclusion
Motivation Why Web-scale Information Extraction? Web is the largest knowledge base. Extracting information by searching the web is not easy: list the cities in the world whose population is above 400,000; humans who has visited space. Unless we find the “right” document, this work could be tedious, error-prone process of piecemeal search.
Motivation (2) Previous Information Extraction Works Supervised Learning Difficult to scale to the web the diversity of the web the prohibitive cost of creating an equally diverse set of hand- tagged documents Weakly Supervised and Bootstrap Need domain-specific seeds Learn rule from seeds, and then vice versa KnowItAll Domain-Independent Use Bootstrap technique
System Architecture 4 Components Data Flow Extractor Search Engine Interface Assessor Database
System Architecture System Work Flow Extractor Search Engine Interface Assessor Database Web Pages RuleRule templatekeywords NP1 “such as” NPList2 & head(NP1) = plural(name(Class1)) & properNoun(head(each(NPList2))) => instanceOf(Class1,head(each(NPList2))) Noun PhraseNoun Phrase List NP1 “such as” NPList2 & head(NP1) = “countries” & properNoun(head(each(NPList2))) => instanceOf(Country,head(each(NPList2))) Keywords: “countries such as”
System Architecture System Work Flow Extractor Search Engine Interface Assessor Database Web PagesRule Extracted Information Knowledge the United Kingdom and Canada India North Korea, Iran, India and Pakistan Japan Iraq, Italy and Spain … the United Kingdom Canada India North Korea Iran … Discriminator Phrase Country AND X “Countries such as X” Country AND the United Kingdom Countries such as the United Kingdom Frequency
System Architecture Extractor Search Engine Interface Assessor Database Search Engine Interface Distribute jobs to different Search Engines Extractor Rule Instantiation Information Extraction Accessor Discriminator Phrases Construction Access of Information
Search Engine Interface Metaphor: Information Food Chain Search Engine Herbivore KnowItAll Carnivore Why build on top of search engine? No need to duplicate existing work Low cost/time/effort Query Distribution Make sure not to overload search engines
Extractor Extraction Template Examples NP1 {“,”} “such as” NPList2 NP2 {“,”} “and other” NP2 NP1 {“,”} “is a” NP2 All are domain-independent!
Extractor (2) Noun phrase analysis A. “China is a country in Asia” B. “Garth Brooks is a country singer” In A, the word “country” is the head of a simple noun phrase. In B, the word “country” is not the head of a simple noun phrase. So, China is indeed a country while Garth Brooks is not a country.
Extractor (3) Rule Template: NP1 “such as” NPList2 & head(NP1) = plural( name( Class1 )) & properNoun( head( each( NPList2 ))) => instanceOf( Class1, head( each( NPList2))) The Extractor generates a rule for “Country” from this template by substituting “Country” for “Class 1”.
Assessor Naïve Bayesian Model Features: hits returned by search engine Incident: whether the extracted inf. is a fact Adjusting the threshold Trade between precision and recall
Assessor (2) Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф) Define PMI (I,D) = |Hits(D+I)| / |Hits(I)| I: the extracted NP D: discriminator phrase 4 P(fi|Ф) and P(fi|¬Ф) Functions Hits-Thresh:P(hits>Hits(D+I)|Ф) Hits-Density:p(hits=Hits(D+I)|Ф) PMI-Thresh:P(pmi>PMI(I,D)|Ф) PMI-Density:p(pmi=PMI(I,D)|Ф)
Experimental Results Precision vs. Recall Thresh better than Density PMI better than Hits
Experimental Results (2) Time Len: 4 day Web page retrieved vs. time 3000 pages/hour New facts vs. Web page retrieved 1 new fact / 3 pages to 1 new fact / 7 pages
Conclusion & Future Works Conclusion: Domain-independent rule templates Rule generated by rule templates Built on top of search engine Assessor Model: More data, more accurate Future works: Learn domain-specific rules to improve recall Automatically extend the ontology
Q & A Thanks!