Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Large-Scale Entity-Based Online Social Network Profile Linkage.
Advertisements

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
Using Machine Learning to Discover and Understand Structured Data William W. Cohen Machine Learning Dept. and Language Technologies Inst. School of Computer.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity: William W. Cohen Machine Learning Dept. and Language.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Language Technologies Institute, Carnegie Mellon University Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee:
Information Retrieval in Practice
Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
A Language Independent Method for Question Classification COLING 2004.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Never-Ending Language Learning for Vietnamese Student: Phạm Xuân Khoái Instructor: PhD Lê Hồng Phương Coupled SEAL.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Learning to Rank Typed Graph Walks: Local and Global Approaches
Introduction Dataset search
KnowItAll and TextRunner
Presentation transcript:

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA USA

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 2 / 20 Language-Independent Set Expansion Outline Introduction System Architecture  Fetcher  Extractor  Ranker Evaluation Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 3 / 20 Language-Independent Set Expansion What is Set Expansion? For example,  Given a query: {“spit”, “boogers”, “ear wax”}  Answer is: {“puke”, “toe jam”, “sweat”,....} More formally,  Given a small number of seeds: x 1, x 2, …, x k where each x i S t  Answer is a listing of other probable elements: e 1, e 2, …, e n where each e i S t A well-known example of a web-based set expansion system is Google Sets™ 

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 4 / 20 Language-Independent Set Expansion What is it used for? Derive features for…  Named Entity Recognition (Settles, 2004) (Talukdar, 2006) Expand true named entities in training set Utilize expanded names to assign features to words  Concept Learning (Cohen, 2000) Given a set of instances, look in web pages for tables or lists that contain some of those instances Automatically extract features from those pages Define features over the instances found  Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005) Extract items from tables or lists that contain given seeds Utilize extracted items and their contexts for learning relations

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 5 / 20 Language-Independent Set Expansion Our Set Expander: SEAL Features  Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean,... Accept documents in HTML, XML, SGML, TeX, WikiML, …  Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Learns wrappers on the fly Based on two research contributions 1. Automatic construction of wrappers Extracts “lists” of entities on semi-structured web pages 2. Use of random graph walk Ranks extracted entities so that those most likely to be in the target set are ranked higher Set Expander for Any Language

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 6 / 20 Language-Independent Set Expansion System Architecture Fetcher: download web pages from the Web Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers 1.Canon 2.Nikon 3.Olympus 4.Pentax 5.Sony 6.Kodak 7.Minolta 8.Panasonic 9.Casio 10.Leica 11.Fuji 12.Samsung 13.…

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 7 / 20 Language-Independent Set Expansion The Fetcher Procedure: 1. Compose a search query using all seeds 2. Use Google API to request for top N URLs We use N = 100, 200, and 300 for evaluation 3. Fetch URLs by using a crawler 4. Send fetched documents to the Extractor

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 8 / 20 Language-Independent Set Expansion The Extractor Learn wrappers from web documents and seeds on the fly  Utilize semi-structured documents  Wrappers defined at character level No tokenization required; thus language independent However, very specific; thus page-dependent  Wrappers derived from document d is applied to d only

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 9 / 20 Language-Independent Set Expansion Extractor E 1 finds maximally- long contexts that bracket all instances of every seed It seems to be working… but what if I add one more instance of “toyota”? It seems to be working too… but how about a more complex example? … … … … … …

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 10 / 20 Language-Independent Set Expansion I am a noisy entity mention Me too! Can you find common contexts that bracket all instances of every seed? I guess not! Let’s try out Extractor E 2 and see if it works… Extractor E 2 finds maximally-long contexts that bracket at least one instance of every seed Horray! It seems like Extractor E 2 works! But how do we get rid of those noisy entity mentions?

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 11 / 20 Language-Independent Set Expansion Extractor: Summary A wrapper consists of a pair of left (L) and right (R) context string  All strings between (but not containing) L and R are extracted Referred to as “candidate entity mention” We compared two versions of wrapper:  Maximally-long contextual strings that bracket… 1. all instances of every seed (Extractor E 1 ) 2. at least one instance of every seed (Extractor E 2 )

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 12 / 20 Language-Independent Set Expansion The Ranker Rank candidate entity mentions based on “similarity” to seeds  Noisy mentions should be ranked lower We compare two methods for ranking 1. Extracted Frequency (EF) # of times an entity mention is extracted 2. Random Graph Walk (GW) Probability of an “entity mention” node being reached in a graph (explained in next slide)

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 13 / 20 Language-Independent Set Expansion Building a Graph A graph consists of a fixed set of…  Node Types: {seeds, document, wrapper, mention}  Labeled Directed Edges: {find, derive, extract} Each edge asserts that a binary relation r holds Each edge has an inverse relation r -1 (graph is cyclic) “ford”, “nissan”, “toyota” curryauto.com Wrapper #3 Wrapper #2 Wrapper #1 Wrapper #4 “honda” 26.1% “acura” 34.6% “chevrolet” 22.5% “bmw pittsburgh” 8.4% “volvo chicago” 8.4% find derive extract northpointcars.com Minkov et al. Contextual Search and Name Disambiguation in using Graphs. SIGIR 2006

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 14 / 20 Language-Independent Set Expansion Legend Node: x, y, z Edge Relation: r An edge from x to y with relation r : Stop Probability: λ Random Graph Walk Probability of picking a target node y given an edge relation r and source node x “curryauto.com”,... “wrapper #1”,... “honda”, “acura”,... find, find -1, derive, derive -1, extract, extract -1 Probability of staying at a node (0.5) Probability of picking an edge relation r given a source node x Probability of reaching any node z from x Recursive computation of probability Probability of continuing to node z from x Probability of staying at node x r yx

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 15 / 20 Language-Independent Set Expansion Evaluation Datasets

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 16 / 20 Language-Independent Set Expansion Evaluation Method Mean Average Precision  Commonly used for evaluating ranked lists in IR  Contains recall and precision-oriented aspects  Sensitive to the entire ranking  Mean of average precisions for each ranked list Evaluation Procedure (per dataset) 1. Randomly select three true entities and use their first listed mentions as seeds 2. Expand the three seeds obtained from step 1 3. Repeat steps 1 and 2 five times 4. Compute MAP for the five ranked lists where L = ranked list of extracted mentions, r = rank Prec ( r ) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 17 / 20 Language-Independent Set Expansion Experimental Results Legend [Extractor] + [Ranker] + [Top N URLs] Extractor = { E1: Extractor E 1, E2: Extractor E 2 } Ranker = { EF: Extracted Frequency, GW: Graph Walk } N = { 100, 200, 300 }

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 18 / 20 Language-Independent Set Expansion Conclusion & Future Work Conclusion  Unsupervised approach for expanding sets of named entities Domain and language independent  SEAL performs better than Google Sets Higher Mean Average Precision on our datasets Handle not only English, but also Chinese and Japanese Future Work  Learn from graphs to re-rank extracted mentions  Bootstrap named entities by using extracted mentions in previous expansion as seeds  Identify possible class names for expanded sets i.e. car makers, constellations, presidents…

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 19 / 20 Language-Independent Set Expansion References

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 20 / 20 Language-Independent Set Expansion Top three mentions are the seeds Try it out at

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 21 / 20 Language-Independent Set Expansion Top three mentions are the seeds Try it out at