Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering.

Slides:



Advertisements
Similar presentations
What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.
Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Learning 5000 Relational Extractors Raphael Hoffmann, Congle Zhang, Daniel S. Weld University of Washington Talk at ACL /12/10.
Amplifying Community Content Creation with Mixed-Initiative Information Extraction Raphael Hoffmann, Saleema Amershi, Kayur Patel, Fei Wu, James Fogarty,
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Machine Reading From Wikipedia to the Web Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA.
Machine Reading From Wikipedia to the Web Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA.
6/14/2015 8:20 PM1 CSE 574 Extracting, Managing & Personalizing Web Information Staffing –Dan Weld –Raphael Hoffmann Content –Intersection of AI, ML, DB.
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Information Retrieval
Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Information Arbitrage Across Multi-Lingual Wikipedia Eytan Adar, Michael Skinner*, and Daniel Weld University of Washington, CSE *Google Inc. WSDM’09.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Open IE and Universal Schema Discovery Heng Ji Acknowledgement: some slides from Daniel Weld and Dan Roth.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Open Information Extraction CSE 454 Daniel Weld. To change More textrunner, more pattern learning Reorder: Kia start.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Developing “Geo” Ontology Layers for Web Query Faculty of Design & Technology Conference David George, Department of Computing.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
IR, IE and QA over Social Media Social media (blogs, community QA, news aggregators)  Complementary to “traditional” news sources (Rathergate)  Grow.
Algorithmic Detection of Semantic Similarity WWW 2005.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
The Intelligence in Wikipedia Project Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Joint Work.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institut AIFB – Angewandte Informatik.
Coached Active Learning for Interactive Video Search Xiao-Yong Wei, Zhen-Qun Yang Machine Intelligence Laboratory College of Computer Science Sichuan University,
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
The Semantic Web By: Maulik Parikh.
Information Extraction from Wikipedia: Moving Down the Long Tail
Information Retrieval
IST 497E Information Retrieval and Organization
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference Which performing artists were born in Chicago? … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who was selected as this year’s … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …

Next-Generation Search Information Extraction … Ontology Actor ISA Performing Artist… Inference Born-In(A) ^ PartOf(A,B) => Born-In(B)… … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …

Wikipedia – Bootstrap for the Web  Goal: search over the Web  Now: search over Wikipedia  Comprehensive  High-quality  (Semi-)Structured data

 Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format  An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms. Infoboxes

Infobox examples Basic infobox Taxobox –Plant species

More example Infobox People - Actor Infobox- Convention Center

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts Form training dataset based on infoboxes Extract semantic relations from Wikipedia articles Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage Wikipedia

 It is a prototype of self-supervised, machine learning system  It looks for classes of pages with similar infoboxes  It determines common attributes  It creates training examples Kylin

Infobox Generation

Preprocessor Schema Refinement Free edit -> schema drift  Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19)  Low usage of attribute  Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Kylin:  Strict name match  15% occurrences

Its county seat is Clearfield. As of 2005, the population density was 28.2/km². Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until ,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. Preprocessor Training Dataset Construction

 Document Classifier List and Category  Fast  Precision(98.5%)  Recall(68.8%)  Sentence Classifier  Predicts which attribute value are contained in given sentence.  It uses maximum entropy model.  To decrease noisy and incomplete training dataset, Kylin apply bagging. Classifier

Conditional Random Fields Model  Attribute value extraction: sequential data labeling  CRF model for each attribute independently  Relabel–filter false negative training examples  2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area  Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data. CRF Extractor

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Long-Tail 1: Sparse Infobox Class Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90% Recall: low 50% ~ mid 90% Kylin Flounders on Sparse Classes – Little Training Data e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

Long-Tail 2: Incomplete Articles Desired Information Missing from Wikipedia Among 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

 Attempt to improve Kylin’s performance using shrinkage.  We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes Shrinkage

performer (44).location actor (8738) comedian (106).birthplace.birth_place.cityofbirth.origin person (1201).birth_place [McCallum et al., ICML98]

KOG (Kylin Ontology Generator) [Wu & Weld, WWW08] performer (44).location actor (8738) comedian (106). birthplace.birth_place.cityofbirth.origin person (1201).birth_place Shrinkage

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Retraining Key: Identify relevant sentences given the sea of Web data? Complementary to Shrinkage: Harvest extra training data from broader Web Andrew Murray was born in Scotland in 1828 ……

Retraining Kylin Extraction: TextRunner Extraction: Query TextRunner for relevant sentences: r1= Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in r2= Ada Cambridge was born in Norfolk, England, in t=

Effect of Shrinkage & Retraining

1755% improvement for a sparse class 13.7% improvement for a popular class

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Extraction from the Web Idea: apply Kylin extractors trained on Wikipedia to general Web pages Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects Key: retrieve relevant sentences Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages

Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray” “andrew murray” birth date “andrew murray” was born in “andrew murray” … predicates from TextRunner attribute name

Weighting Extractions Which extractions are more relevant? Features : # sentences between sentence and closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence

Web Extraction Experiment Extractor confidence alone performs poor Weighted combination is the best

Combining Wikipedia & Web Recall Benefit from Shrinkage / Retraining…

Combining Wikipedia & Web Benefit from Shrinkage + Retraining + Web

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Problem  Information Extraction is Imprecise › Wikipedians Don’t Want 90% Precision  How Improve Precision? › People!

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Intelligence in Wikipedia  What is IWP? › A project/system that aims to combine  IE (Information Extraction)  CCC (communal content creation)

Information Extraction  Examples: › Zoominfo.com › Fligdog.com › Citeseer › Google  Advantage: Autonomy  Disadvantage: Expensive

IE system contributors  Contributors in this room? › Wikipedia IE systems › Citeseer › Rexa › DBlife

Communal Content Creation  Examples › Wikipedia › Ebay › Netflix › Advantage: more accuracy then IE › Disadvantage: bootstrapping, incentives, and management

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Virtuous Cycle

Contributing as a Non-Primary Task  Encourage contributions  Without annoying or abusing readers › Compared 5 different interfaces

Results Contribution Rate 1.6%  13% 90% of positive labels were correct

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

IWP and Shrinkage, Retraining, and Extracting from the Web Shrinkage – improves IWP’s precision, and recall Retraining – improves the robustness of IWP’s extractors Extraction – further helps IWP’s performance

Multi-Lingual Extraction Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing information by copying from one language to another Utilize CCC to validate and improve updates. Example Nombre = “Jerry Seinfeld” and Name = “Jerry Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica Sienfeld”

Summary Kylin’s initial performance is unacceptable Methods for increasing recall Shrinkage Retraining Extraction from the web

Summary IWP – developing AI methods to facilitate the growth, operation and use of Wikipedia Initial goal – extraction of a giant knowledge bas of semantic triples Faceted browsing Input to reasoning based question- answering system How IE CCC

Questions