Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA USA
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 2 / 20 Language-Independent Set Expansion Outline Introduction System Architecture Fetcher Extractor Ranker Evaluation Conclusion
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 3 / 20 Language-Independent Set Expansion What is Set Expansion? For example, Given a query: {“spit”, “boogers”, “ear wax”} Answer is: {“puke”, “toe jam”, “sweat”,....} More formally, Given a small number of seeds: x 1, x 2, …, x k where each x i S t Answer is a listing of other probable elements: e 1, e 2, …, e n where each e i S t A well-known example of a web-based set expansion system is Google Sets™
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 4 / 20 Language-Independent Set Expansion What is it used for? Derive features for… Named Entity Recognition (Settles, 2004) (Talukdar, 2006) Expand true named entities in training set Utilize expanded names to assign features to words Concept Learning (Cohen, 2000) Given a set of instances, look in web pages for tables or lists that contain some of those instances Automatically extract features from those pages Define features over the instances found Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005) Extract items from tables or lists that contain given seeds Utilize extracted items and their contexts for learning relations
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 5 / 20 Language-Independent Set Expansion Our Set Expander: SEAL Features Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean,... Accept documents in HTML, XML, SGML, TeX, WikiML, … Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Learns wrappers on the fly Based on two research contributions 1. Automatic construction of wrappers Extracts “lists” of entities on semi-structured web pages 2. Use of random graph walk Ranks extracted entities so that those most likely to be in the target set are ranked higher Set Expander for Any Language
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 6 / 20 Language-Independent Set Expansion System Architecture Fetcher: download web pages from the Web Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers 1.Canon 2.Nikon 3.Olympus 4.Pentax 5.Sony 6.Kodak 7.Minolta 8.Panasonic 9.Casio 10.Leica 11.Fuji 12.Samsung 13.…
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 7 / 20 Language-Independent Set Expansion The Fetcher Procedure: 1. Compose a search query using all seeds 2. Use Google API to request for top N URLs We use N = 100, 200, and 300 for evaluation 3. Fetch URLs by using a crawler 4. Send fetched documents to the Extractor
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 8 / 20 Language-Independent Set Expansion The Extractor Learn wrappers from web documents and seeds on the fly Utilize semi-structured documents Wrappers defined at character level No tokenization required; thus language independent However, very specific; thus page-dependent Wrappers derived from document d is applied to d only
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 9 / 20 Language-Independent Set Expansion Extractor E 1 finds maximally- long contexts that bracket all instances of every seed It seems to be working… but what if I add one more instance of “toyota”? It seems to be working too… but how about a more complex example? … … … … … …
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 10 / 20 Language-Independent Set Expansion I am a noisy entity mention Me too! Can you find common contexts that bracket all instances of every seed? I guess not! Let’s try out Extractor E 2 and see if it works… Extractor E 2 finds maximally-long contexts that bracket at least one instance of every seed Horray! It seems like Extractor E 2 works! But how do we get rid of those noisy entity mentions?
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 11 / 20 Language-Independent Set Expansion Extractor: Summary A wrapper consists of a pair of left (L) and right (R) context string All strings between (but not containing) L and R are extracted Referred to as “candidate entity mention” We compared two versions of wrapper: Maximally-long contextual strings that bracket… 1. all instances of every seed (Extractor E 1 ) 2. at least one instance of every seed (Extractor E 2 )
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 12 / 20 Language-Independent Set Expansion The Ranker Rank candidate entity mentions based on “similarity” to seeds Noisy mentions should be ranked lower We compare two methods for ranking 1. Extracted Frequency (EF) # of times an entity mention is extracted 2. Random Graph Walk (GW) Probability of an “entity mention” node being reached in a graph (explained in next slide)
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 13 / 20 Language-Independent Set Expansion Building a Graph A graph consists of a fixed set of… Node Types: {seeds, document, wrapper, mention} Labeled Directed Edges: {find, derive, extract} Each edge asserts that a binary relation r holds Each edge has an inverse relation r -1 (graph is cyclic) “ford”, “nissan”, “toyota” curryauto.com Wrapper #3 Wrapper #2 Wrapper #1 Wrapper #4 “honda” 26.1% “acura” 34.6% “chevrolet” 22.5% “bmw pittsburgh” 8.4% “volvo chicago” 8.4% find derive extract northpointcars.com Minkov et al. Contextual Search and Name Disambiguation in using Graphs. SIGIR 2006
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 14 / 20 Language-Independent Set Expansion Legend Node: x, y, z Edge Relation: r An edge from x to y with relation r : Stop Probability: λ Random Graph Walk Probability of picking a target node y given an edge relation r and source node x “curryauto.com”,... “wrapper #1”,... “honda”, “acura”,... find, find -1, derive, derive -1, extract, extract -1 Probability of staying at a node (0.5) Probability of picking an edge relation r given a source node x Probability of reaching any node z from x Recursive computation of probability Probability of continuing to node z from x Probability of staying at node x r yx
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 15 / 20 Language-Independent Set Expansion Evaluation Datasets
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 16 / 20 Language-Independent Set Expansion Evaluation Method Mean Average Precision Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list Evaluation Procedure (per dataset) 1. Randomly select three true entities and use their first listed mentions as seeds 2. Expand the three seeds obtained from step 1 3. Repeat steps 1 and 2 five times 4. Compute MAP for the five ranked lists where L = ranked list of extracted mentions, r = rank Prec ( r ) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 17 / 20 Language-Independent Set Expansion Experimental Results Legend [Extractor] + [Ranker] + [Top N URLs] Extractor = { E1: Extractor E 1, E2: Extractor E 2 } Ranker = { EF: Extracted Frequency, GW: Graph Walk } N = { 100, 200, 300 }
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 18 / 20 Language-Independent Set Expansion Conclusion & Future Work Conclusion Unsupervised approach for expanding sets of named entities Domain and language independent SEAL performs better than Google Sets Higher Mean Average Precision on our datasets Handle not only English, but also Chinese and Japanese Future Work Learn from graphs to re-rank extracted mentions Bootstrap named entities by using extracted mentions in previous expansion as seeds Identify possible class names for expanded sets i.e. car makers, constellations, presidents…
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 19 / 20 Language-Independent Set Expansion References
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 20 / 20 Language-Independent Set Expansion Top three mentions are the seeds Try it out at
Language Technologies Institute, Carnegie Mellon University Richard C. Wang 21 / 20 Language-Independent Set Expansion Top three mentions are the seeds Try it out at