Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.

Slides:



Advertisements
Similar presentations
Improved TF-IDF Ranker
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1/1/ Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach Min-Yuh.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Language Technologies Institute, Carnegie Mellon University Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee:
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Carnegie Mellon School of Computer Science Copyright © 2001, Carnegie Mellon. All Rights Reserved. JAVELIN Project Briefing 1 AQUAINT Phase I Kickoff December.
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
1 Statistical source expansion for question answering CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Question Answering over Implicitly Structured Web Content
Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Never-Ending Language Learning for Vietnamese Student: Phạm Xuân Khoái Instructor: PhD Lê Hồng Phương Coupled SEAL.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
An Empirical Study of Learning to Rank for Entity Search
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Topic: Semantic Text Mining
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute Carnegie Mellon University Pittsburgh, PA USA

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 2 / 30 Set Expansion for List Question Answering Task Automatically improve answers generated by Question Answering systems for list questions, by using a Set Expansion system. For example:  Name cities that have Starbucks. QA AnswersExpanded Answers Boston Seattle Carnegie-Mellon Aquafina Google Logitech Seattle Boston Chicago Pittsburgh Carnegie-Mellon Google Better!

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 3 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 4 / 30 Set Expansion for List Question Answering Question Answering (QA) Question Answering task:  Retrieve answers to natural language questions Different question types:  Factoid questions  List questions  Definitional questions  Opinion questions Major QA evaluations:  Text REtrieval Conference (TREC): English  NTCIR: Japanese, Chinese  CLEF: European languages

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 5 / 30 Set Expansion for List Question Answering Typical QA Pipeline Question Analysis Query Generation & Search Candidate Generation Answer Scoring Knowledge Sources Question String Analyzed Question Search Results Candidate Answers Scored Answers The two original text smileys were invented on September 19, 1982 by Scott E. Fahlman... smileys September 19, 1982 Scott E. Fahlman CandidateScore Scott E. Fahlman smileys0.418 September 19, “Who invented the smiley?” Answer type: Person Keywords: invented, smiley...

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 6 / 30 Set Expansion for List Question Answering QA System: Ephyra (Schlaefer et al., TREC 2007) History:  Developed at University of Karlsruhe, Germany and Carnegie Mellon University, USA  TREC participations in 2006 (13 th out of 27 teams) and 2007 (7 th out of 21 teams)  Released into open source in 2008 Different candidate generators:  Answer type classification  Regular expression matching  Semantic parsing Available for download at:

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 7 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 8 / 30 Set Expansion for List Question Answering Set Expansion (SE) For example,  Given a query: {“survivor”, “amazing race”}  Answer is: {“american idol”, “big brother”,....} More formally,  Given a small number of seeds: x 1, x 2, …, x k where each x i S t  Answer is a listing of other probable elements: e 1, e 2, …, e n where each e i S t A well-known example of a web-based set expansion system is Google Sets™ 

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 9 / 30 Set Expansion for List Question Answering SE System: SEAL (Wang & Cohen, ICDM 2007) Features  Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean,... Accept documents in HTML, XML, SGML, TeX, WikiML, …  Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Based on two research contributions  Automatically construct wrappers for extracting candidate items  Rank extracted items using random graph walk Try it out for yourself:

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 10 / 30 Set Expansion for List Question Answering SEAL’s SE Pipeline Fetcher: downloads web pages from the Web Extractor: learns wrappers from web pages Ranker: ranks entities extracted by wrappers Canon Nikon Olympus Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung …

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 11 / 30 Set Expansion for List Question Answering Challenge SE systems require relevant (non-noisy) seeds, but answers produced by QA systems are often noisy. How can we integrate those two systems together?  We propose three extensions to SEAL Aggressive Fetcher Lenient Extractor Hinted Expander

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 12 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 13 / 30 Set Expansion for List Question Answering Original Fetcher Procedure: 1. Compose a search query by concatenating all seeds 2. Use Google to request top 100 web pages 3. Fetch web pages and send to the Extractor Seeds Boston Seattle Carnegie-Mellon Query Boston Seattle Carnegie-Mellon

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 14 / 30 Set Expansion for List Question Answering Proposed Fetcher Aggressive Fetcher (AF)  Sends a two-seed query for every possible pair of seeds to the search engines  More likely to compose queries containing only relevant seeds Seeds Boston Seattle Carnegie-Mellon Queries Boston Seattle Boston Carnegie-Mellon Seattle Carnegie-Mellon

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 15 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 16 / 30 Set Expansion for List Question Answering Original Extractor A wrapper is a pair of L and R context string  Maximally-long contextual strings that bracket at least one instance of every seed  Extracts strings between L and R Learn wrappers from web pages and seeds on the fly  Utilize semi-structured documents  Wrappers defined at character level No tokenization required (language-independent) However, very page specific (page-dependent)

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 17 / 30 Set Expansion for List Question Answering

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 18 / 30 Set Expansion for List Question Answering Proposed Extractor Lenient Extractor (LE)  Maximally-long contextual strings that bracket at least one instance of a minimum of two seeds  More likely to find useful contexts that bracket only relevant seeds Text... in Boston City Hall in Seattle City Hall at Boston University at Seattle University at Carnegie-Mellon University... Learned Wrapper (w/o LE) at University Learned Wrappers (w/ LE) at University in City Hall

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 19 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 20 / 30 Set Expansion for List Question Answering Hinted Expander (HE) Utilizes contexts in the question to constrain SEAL’s search space on the Web  Extract up to three keywords from the question using Ephyra’s keyword extractor  Append the keywords to the search query Example:  Name cities that have Starbucks. More likely to find documents containing desired set of answers

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 21 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 22 / 30 Set Expansion for List Question Answering Experiment #1: Ephyra Evaluate on TREC 13, 14, and 15 datasets  55, 93, and 89 list questions respectively Use SEAL to expand top four answers from Ephyra  Outputs a list of answers ranked by confidence scores For each dataset, we report:  Mean Average Precision (MAP) Mean of average precision for each ranked list  Average F 1 with Optimal Per-Question Threshold For each question, cut off the list at a threshold which maximizes the F 1 score for that particular question

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 23 / 30 Set Expansion for List Question Answering Experiment #1: Ephyra

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 24 / 30 Set Expansion for List Question Answering Experiment #2: Ephyra In practice, thresholds are unknown For each dataset, do 5-fold cross validation:  Train: Find one optimal threshold for four folds  Test: Use the threshold to evaluate the fifth fold Introduce a fourth dataset: All  Union of TREC 13, 14, and 15 Introduce another system: Hybrid  Intersection of original answers from Ephyra and expanded answers from SEAL

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 25 / 30 Set Expansion for List Question Answering Experiment #2: Ephyra

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 26 / 30 Set Expansion for List Question Answering Outline Introduction  Question Answering  Set Expansion Proposed Approach  Aggressive Fetcher  Lenient Extractor  Hinted Expander Experimental Results  QA System: Ephyra  Other QA Systems Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 27 / 30 Set Expansion for List Question Answering Experiment: Other QA Systems Top five QA systems that perform the best on list questions in TREC 15 evaluation 1. Language Computer Corporation (lccPA06) 2. The Chinese University of Hong Kong (cuhkqaepisto) 3. National University of Singapore (NUSCHUAQA1) 4. Fudan University (FDUQAT15A) 5. National Security Agency (QACTIS06C) For each QA system, train thresholds for SEAL and Hybrid on the union of TREC 13 and 14  Expand top four answers from the QA systems on TREC 15, and apply the trained threshold

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 28 / 30 Set Expansion for List Question Answering Experiment: Top QA Systems

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 29 / 30 Set Expansion for List Question Answering Conclusion A feasible method for integrating a SE approach into any QA system Proposed SE approach is effective  Improves QA systems on list questions by using only a few top answers as seeds Proposed hybrid system is effective  Improves Ephyra and (most) top five QA systems

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 30 / 30 Set Expansion for List Question Answering Thank You!