Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU
Searching the NYT Archive for Book Research
Motivation: News Archive Accessing and searching is not an easy task Researchers and reporters spend a large amount of time going through their long query results News archives are huge and available for tens of years Many relevant results Results in the first page are not more relevant than the results in the 5 th or the 10 th page (NYT archive) Search engines of news archive mainly follow the paradigm Search, skim through long results, modify, and search again Goal: Multifaceted Interfaces (MI) over the news archive of Newsblaster Newsblaster archive About 6 years of news from 24 news sources Stories are clustered daily into hierarchies of topics and events Events are threaded over time, summarized, and classified
Motivation: MI for Newsblaster Archive Our multifaceted interfaces work has some limitations [CIKM2005]: Supervised learning: facets that could be identified by our algorithm appear in the training set WordNet hypernyms WordNet has rather poor coverage of named entities Free text collections The quality of the hierarchies built on top of news stories was low.
Challenge: Automatic Extraction of the Useful Facets from News Archive Automatically discover, in an unsupervised manner, a set of candidate facet terms from free text Automatically group together facet terms that belong to the same facet Build the appropriate browsing structure for each facet
Intuition: Look for Facet Terms Elsewhere Pilot study stories from The NYTimes Common facets: Location, Institutes, History, People, Social Phenomenon, Markets, Nature, and Event Sub-facets: Leaders under People, Corporations under Markets Clear phenomenon: the terms for the useful facets do not usually appear in the news stories A journalist writing a story about Jacques Chirac will not necessarily use the terms Political Leader, Europe, or France. Such missing terms are tremendously useful for identifying the appropriate facets for the story We will look for these terms elsewhere infrequent terms in the original collection, but are frequent in expanded documents
Context-Aware Expansion Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wikipedia Wiki Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Wordnet Text Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Google Text WordnetGoogle WordnetGoogle Name Entities Yahoo Term Extractor
Useful Facets Terms are Elsewhere Infrequent Terms Context-aware Collection titi Original Collection
Frequency-based shifting Due to the Zipfian nature, we favor terms that have already high frequencies (inverse problem) Rank-shifting Term Frequency Analysis
Summary: Candidate Facet Terms For each document in the database, identify the important terms that are useful to characterize the contents of the document For each term in the original database, query the external resource and retrieve the terms that appear in the results. Add the retrieved terms in the original document, in order to create an expanded, “context- aware” document Analyze the frequency of the terms, in both the original and the expanded database and identify the candidate facet terms
Indicative
Research in Progress Cleaning and filtering Grouping similar facet terms under one facet Evaluation The resulted candidate terms The resulted hierarchies