Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU.

Slides:



Advertisements
Similar presentations
A Domain Level Personalization Technique A. Campi, M. Mazuran, S. Ronchi.
Advertisements

Web Intelligence Text Mining, and web-related Applications
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Digital Library Service Integration (DLSI) --> Looking for Collections and Services to be DLSI Testbeds
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Overview of Search Engines
Marko Grobelnik Jasna Škrbec Jozef Stefan Institute Social Context as a part of News-Archive-Explorer Web application for exploratory browsing of news.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Search Engines and Information Retrieval Chapter 1.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
SIRS DISCOVERER BY PROQUEST. Overview Sources and articles are selected for their educational content, reliability, relevance, interest, age- appropriateness,
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Faceted Searching and Browsing Over Large Collections Wisam Dakka, Columbia University.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Faceted Search Zhao Jing Outline  What is faceted search?  Why use faceted search?  Topics of interests  Faceted Search in Dataspace.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
1 SEARCHING FOR TRUTH Locating Information on the WWW chapter 5.
Advancing Science: OSTI’s Current and Future Search Strategies Jeff Given IT Operations Manager Computer Protection Program Manager Office of Scientific.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Data mining in web applications
Information Organization: Overview
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Text & Web Mining 9/22/2018.
Applying Key Phrase Extraction to aid Invalidity Search
Type-directed Topic Segmentation of Entity Descriptions
Data Mining Chapter 6 Search Engines
Panos Ipeirotis Luis Gravano
Panagiotis G. Ipeirotis Luis Gravano
Information Organization: Overview
Information Retrieval and Web Design
Connecting the Dots Between News Article
Presentation transcript:

Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU

Searching the NYT Archive for Book Research

Motivation: News Archive Accessing and searching is not an easy task  Researchers and reporters spend a large amount of time going through their long query results  News archives are huge and available for tens of years  Many relevant results Results in the first page are not more relevant than the results in the 5 th or the 10 th page (NYT archive)  Search engines of news archive mainly follow the paradigm Search, skim through long results, modify, and search again Goal: Multifaceted Interfaces (MI) over the news archive of Newsblaster Newsblaster archive  About 6 years of news from 24 news sources  Stories are clustered daily into hierarchies of topics and events  Events are threaded over time, summarized, and classified

Motivation: MI for Newsblaster Archive Our multifaceted interfaces work has some limitations [CIKM2005]:  Supervised learning: facets that could be identified by our algorithm appear in the training set WordNet hypernyms WordNet has rather poor coverage of named entities Free text collections  The quality of the hierarchies built on top of news stories was low.

Challenge: Automatic Extraction of the Useful Facets from News Archive Automatically discover, in an unsupervised manner, a set of candidate facet terms from free text Automatically group together facet terms that belong to the same facet Build the appropriate browsing structure for each facet

Intuition: Look for Facet Terms Elsewhere Pilot study stories from The NYTimes  Common facets: Location, Institutes, History, People, Social Phenomenon, Markets, Nature, and Event  Sub-facets: Leaders under People, Corporations under Markets Clear phenomenon: the terms for the useful facets do not usually appear in the news stories  A journalist writing a story about Jacques Chirac will not necessarily use the terms Political Leader, Europe, or France. Such missing terms are tremendously useful for identifying the appropriate facets for the story We will look for these terms elsewhere  infrequent terms in the original collection, but are frequent in expanded documents

Context-Aware Expansion Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wikipedia Wiki Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Wordnet Text Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Google Text WordnetGoogle WordnetGoogle Name Entities Yahoo Term Extractor

Useful Facets Terms are Elsewhere Infrequent Terms Context-aware Collection titi Original Collection

Frequency-based shifting  Due to the Zipfian nature, we favor terms that have already high frequencies (inverse problem) Rank-shifting Term Frequency Analysis

Summary: Candidate Facet Terms For each document in the database, identify the important terms that are useful to characterize the contents of the document For each term in the original database, query the external resource and retrieve the terms that appear in the results. Add the retrieved terms in the original document, in order to create an expanded, “context- aware” document Analyze the frequency of the terms, in both the original and the expanded database and identify the candidate facet terms

Indicative

Research in Progress Cleaning and filtering Grouping similar facet terms under one facet Evaluation  The resulted candidate terms  The resulted hierarchies