Download presentation
Presentation is loading. Please wait.
Published byDarrell Stanley Modified over 9 years ago
1
Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner
2
Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference Which performing artists were born in Chicago? … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who was selected as this year’s … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …
3
Next-Generation Search Information Extraction … Ontology Actor ISA Performing Artist… Inference Born-In(A) ^ PartOf(A,B) => Born-In(B)… … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …
4
Wikipedia – Bootstrap for the Web Goal: search over the Web Now: search over Wikipedia Comprehensive High-quality (Semi-)Structured data
5
Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms. Infoboxes
6
Infobox examples Basic infobox Taxobox –Plant species
7
More example Infobox People - Actor Infobox- Convention Center
8
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
9
Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts Form training dataset based on infoboxes Extract semantic relations from Wikipedia articles Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage. ------Wikipedia
10
It is a prototype of self-supervised, machine learning system It looks for classes of pages with similar infoboxes It determines common attributes It creates training examples Kylin
11
Infobox Generation
12
Preprocessor Schema Refinement Free edit -> schema drift Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19) Low usage of attribute Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Kylin: Strict name match 15% occurrences
13
Its county seat is Clearfield. As of 2005, the population density was 28.2/km². Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. Preprocessor Training Dataset Construction
14
Document Classifier List and Category Fast Precision(98.5%) Recall(68.8%) Sentence Classifier Predicts which attribute value are contained in given sentence. It uses maximum entropy model. To decrease noisy and incomplete training dataset, Kylin apply bagging. Classifier
15
Conditional Random Fields Model Attribute value extraction: sequential data labeling CRF model for each attribute independently Relabel–filter false negative training examples 2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data. CRF Extractor
16
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
17
Long-Tail 1: Sparse Infobox Class Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90% Recall: low 50% ~ mid 90% Kylin Flounders on Sparse Classes – Little Training Data e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles
18
Long-Tail 2: Incomplete Articles Desired Information Missing from Wikipedia Among 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.
19
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
20
Attempt to improve Kylin’s performance using shrinkage. We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes Shrinkage
21
performer (44).location actor (8738) comedian (106).birthplace.birth_place.cityofbirth.origin person (1201).birth_place [McCallum et al., ICML98]
22
KOG (Kylin Ontology Generator) [Wu & Weld, WWW08] performer (44).location actor (8738) comedian (106). birthplace.birth_place.cityofbirth.origin person (1201).birth_place Shrinkage
23
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
24
Retraining Key: Identify relevant sentences given the sea of Web data? Complementary to Shrinkage: Harvest extra training data from broader Web Andrew Murray was born in Scotland in 1828 ……
25
Retraining Kylin Extraction: TextRunner Extraction: Query TextRunner for relevant sentences: r1= Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870. r2= Ada Cambridge was born in Norfolk, England, in 1844. t=
26
Effect of Shrinkage & Retraining
27
1755% improvement for a sparse class 13.7% improvement for a popular class
28
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
29
Extraction from the Web Idea: apply Kylin extractors trained on Wikipedia to general Web pages Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects Key: retrieve relevant sentences Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages
30
Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray” “andrew murray” birth date “andrew murray” was born in “andrew murray” … predicates from TextRunner attribute name
31
Weighting Extractions Which extractions are more relevant? Features : # sentences between sentence and closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence
32
Web Extraction Experiment Extractor confidence alone performs poor Weighted combination is the best
33
Combining Wikipedia & Web Recall Benefit from Shrinkage / Retraining…
34
Combining Wikipedia & Web Benefit from Shrinkage + Retraining + Web
35
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
36
Problem Information Extraction is Imprecise › Wikipedians Don’t Want 90% Precision How Improve Precision? › People!
37
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
38
Intelligence in Wikipedia What is IWP? › A project/system that aims to combine IE (Information Extraction) CCC (communal content creation)
39
Information Extraction Examples: › Zoominfo.com › Fligdog.com › Citeseer › Google Advantage: Autonomy Disadvantage: Expensive
40
IE system contributors Contributors in this room? › Wikipedia IE systems › Citeseer › Rexa › DBlife
41
Communal Content Creation Examples › Wikipedia › Ebay › Netflix › Advantage: more accuracy then IE › Disadvantage: bootstrapping, incentives, and management
42
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
43
Virtuous Cycle
44
Contributing as a Non-Primary Task Encourage contributions Without annoying or abusing readers › Compared 5 different interfaces
46
Results Contribution Rate 1.6% 13% 90% of positive labels were correct
47
Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary
48
IWP and Shrinkage, Retraining, and Extracting from the Web Shrinkage – improves IWP’s precision, and recall Retraining – improves the robustness of IWP’s extractors Extraction – further helps IWP’s performance
49
Multi-Lingual Extraction Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing information by copying from one language to another Utilize CCC to validate and improve updates. Example Nombre = “Jerry Seinfeld” and Name = “Jerry Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica Sienfeld”
50
Summary Kylin’s initial performance is unacceptable Methods for increasing recall Shrinkage Retraining Extraction from the web
51
Summary IWP – developing AI methods to facilitate the growth, operation and use of Wikipedia Initial goal – extraction of a giant knowledge bas of semantic triples Faceted browsing Input to reasoning based question- answering system How IE CCC
52
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.