Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 1/28 Unsupervised Learning of.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Evaluating Search Engine
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
August 17, 2005Generic Soft Pattern Models for Definitional QA1/28 Generic Soft Pattern Models for Definitional Question Answering Hang Cui Min-Yen Kan.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Retrieval in Practice
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Hang Cui et al. NUS at TREC-13 QA Main Task 1/20 National University of Singapore at the TREC- 13 Question Answering Main Task Hang Cui Keya Li Renxu Sun.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Based Information Extraction: Current Approaches.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
QUALIFIER in TREC-12 QA Main Task Hui Yang, Hang Cui, Min-Yen Kan, Mstislav Maslennikov, Long Qiu, Tat-Seng Chua School of Computing National University.
1 Statistical source expansion for question answering CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Facilitating Document Annotation using Content and Querying Value.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Facilitating Document Annotation Using Content and Querying Value.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
INF 141: Information Retrieval
Presentation transcript:

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 1/28 Unsupervised Learning of Soft Patterns for Generating Definitions from Online News Hang Cui Min-Yen Kan Tat-Seng Chua {cuihang, kanmy, comp.nus.edu.sg School of Computing, NUS, Singapore

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 2/28 Problem To answer “Who is Bob Woodward” and “What is SARS” questions. –A large portion of queries in search logs (Voorhees 2001). Where to get definitions –Dictionaries, encyclopedias, online glossaries …… –Online news – “new terms” (e.g. Sasser) In this paper, we –deal with recently popular terms and people. –identify definition sentences from online news. –distill search engine results to definitions.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 3/28 “In the News” from Google (Apr 23, 2004) In the News Bob Woodward SARS Vietnam War Yasser Arafat George W. Bush Marine Corps Gaza Strip Kofi Annan Mitsubishi Motors Alan Greenspan First Quarter Maurice Clarett

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 4/28 “In the News” from Google (Apr 23, 2004) A list of relevant documents rather than a direct answer

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 5/28 Our Solution – DefSearch Bob Woodward Woodward, an Office of Naval Intelligence (ONI) asset, interviewed over 75 Bush Cabal insiders. (CNN) Woodward, who had previously endeared himself to the Bush Administration with his pandering portrait of the President in "Bush at War", has launched a blistering assault on White House credibility with his new book, "Plan of Attack". (NY Times) People close to Mr. Powell said Sunday that they had no doubt he would weather any criticism from within over his apparent cooperation with Mr. Woodward, an assistant managing editor at The Washington Post. (CNN) The book, called Plan of Attack, is written by Bob Woodward, the respected journalist who helped break open the Watergate scandal.The book is based on interviews with 75 people, including Bush, and is due for release Tuesday. (REUTERS) Bob Woodward, the famous Watergate reporter has interviewed President Bush and other Whitehouse "insiders". As a result of the interview, Woodward might have done more damage to the Presidents re-election cause than anyone since Richard Clarkes interview on the same program and the recent events in Spain might be an indication as to how the world is beginning to view President Bush. (ABC News)

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 6/28 Behind DefSearch

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 7/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 8/28 How Do Current Systems Identify Definitions? Most of current systems use hand-crafted patterns –Appositive e.g. Gunter Blobel, a cellular and molecular biologist,… –Copulas e.g. Battery is a kind of electronic device … –Predicates (relations) e.g. TB is usually caused by … Current work on definition sentence identification –Domain-specific definition generation systems e.g. topic-specific definitions on the Web and biographies. –Definitional QA Task at TREC 2003

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 9/28 Lack of Flexibility – Hard Matching –Pattern:, also known as TB, also known as Tuberculosis, … TB ( also known as Tuberculosis ) … –Variations make hard matching fail –Introduce Soft Patterns with greater flexibility Manual labor –Introduce unsupervised learning by Group Pseudo- Relevance Feedback (GPRF). Weaknesses of Current Pattern Matching Methods mismatch

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 10/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 11/28 What are Soft Patterns? Soft patterns allow partial matching TB ( also known as Tuberculosis ) … P( ( |Slot 1 ) = 0.001, P(also|Slot 2 ) = 0.21, P(known|Slot 3 ) = 0.33, P(as|Slot 4 ) = 0.13 P(Matching) = 0.23 : still better than non-definition sentences. How does it work? –Training – accumulating pattern instances in a vector. Derive pattern instances from labeled definition sentences. –Matching with a probabilistic model, not regular expressions. Using statistical information from all pattern instances, not generalized rules. Instance-based learning.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 12/28 Preparing Pattern Instances The channel Iqra is owned by the Arab Radio and Television company and is the brainchild of the Saudi millionaire, Saleh Kamel. The_DT channel_NN Iqra_NNP is_VBZ owned_VBN by_IN NNP company_NN and_CC is_VBZ the_DT brainchild_NN of_IN NNP. Step 1 POS tagging and noun phrase chunking. Step 2 Selective substitution – replace those specific words with more general tags. Other tokens remain unchanged. DT$ NN BE$ owned by DT$ NNP and BE$ DT$ NN of NNP.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 13/28 Preparing Pattern Instances – Cont’d DT$ NN BE$ owned by Step 3 Crop a text window around the tag “ ” (window size = 3 for each side) Pattern Instance

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 14/28 Illustration of Soft Pattern Generation …… The channel Iqra is owned by the … …… severance packages, known as golden parachutes, included …… A battery is a cell which can provide electricity. DT$ NN BE$ owned by known as, VB BE$ DT$ …… NN 0.12 NN 0.11, 0.40 DT$ 0.2 known 0.09 as 0.20 BE$ 0.2 VB 0.1 DT$ 0.04 owned 0.09

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 15/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Addressing Flexibility Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 16/28 Matching Soft Patterns Test sentences are reduced to a vector S using the same strategy. Matching Soft Patterns – similarity between the pattern vector Pa and the test vector S. –Independent slot content similarity. –Slot sequence fidelity.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 17/28 Probabilistic Matching Degree Individual slot similarity – independent assumption Sequence fidelity – bigram model Combined to get the matching degree

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 18/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Flexibility Unsupervised Learning of Soft Patterns – Addressing Manual Labor Evaluations Conclusion and Future Work

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 19/28 Unsupervised Labeling of Definition Sentences using GPRF Pattern instances obtained from labeled definition sentences. –Manual labeling is too expensive. Pseudo-relevance Feedback in document retrieval –Take the top n ranked documents as relevant. We employ Group pseudo-relevance feedback (GPRF) –Statistical ranking – centroid based method. –Perform PRF over a group of questions (top 10 sentences for each question). –Generate soft patterns from all auto-labeled sentences for all questions.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 20/28 Analysis of GPRF Assumption 1 – some definition sentences can be ranked high using statistical method. –Word co-occurrence metrics can well model descriptive sentences. Over 33% of top ranked sentences are definitional. –Noise introduced in each question’s top list can be mitigated by the group strategy. Assumption 2 – definition patterns are general and can be used across questions.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 21/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Flexibility Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 22/28 Evaluation Setup Two experiments –To evaluate the effectiveness of our method on a community- standard corpus. TREC QA corpus - About 1M news articles. 50 definitional questions with answer nuggets. –To assess the adaptability of the system to actual online news and recent questions. 26 questions from Lycos. Up to 200 news articles from each of eight news sites (e.g. CNN and BBC) for each question. Comparison Systems –Baseline system – centroid based ranking (IR). –A top ranked definitional question answering system at TREC2003 – HCR Hand-crafted definition patterns (a man-month of time to construct).

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 23/28 Evaluation Metrics Based on given answer nuggets. –The most essential information about the target. –Judged by human assessors. Nugget Precision (NP) –Penalty to longer answers. Nugget Recall (NR) –Proportion of returned nuggets to vital nuggets. F 5 -measure (weighting NR 5 times as NP)

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 24/28 Evaluations on TREC Corpus Pattern matching has significant impact on definition sentence identification. Soft patterns are more effective for news text. F 5 measure % improvement (over baseline) % improvement (over HCR) Centroid (Baseline) HCR % SP+GPRF (w = 1) %7.29% SP+GPRF (w = 2) %14.06% SP+GPRF (w = 3) %12.42% SP+GPRF (w = 4) %4.88% SP+GPRF (w = 5) %2.54%

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 25/28 Evaluations on the Web Corpus Using two sets of soft patterns. –More pattern instances lead to better performance (683 from TREC vs. 375 from Lycos). Soft patterns are general enough to be applied to other corpora. –Makes offline training possible. F 5 Measure % improvement (over baseline) Centroid (baseline)0.492 HCR % SP+GPRF (Lycos patterns) % SP+GPRF (TREC patterns) %

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 26/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Flexibility Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 27/28 Conclusions and Future Work Current definition pattern matching has weaknesses –Lack of flexibility –Manual labor We address them by –Soft patterns –Unsupervised learning by Group PRF Soft patterns prove to be effective in Web-based definition generation systems. Future work –Soft patterns in information extraction and factoid question answering.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 28/28 Q & A Thanks! Try our online demo at appn.comp.nus.edu.sg/~cuihang/DefSearch/ DefSearch.htm ! appn.comp.nus.edu.sg/~cuihang/DefSearch/ DefSearch.htm

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 29/28 Statistical Ranking – Centroid Word Weighting Weighting the words by their co-occurrences with the search target. Words with the centrality weights beyond a predefined threshold form a centroid vector. Cosine similarity with the centroid vector used to rank the sentences. Top Ranked sentences by the centroid vector are deemed as definition sentence candidates.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 30/28 Sentence Selection We adopt a variation of Maximal Marginal Relevance (MMR) to summarize the definition sentences. To ensure relevance and to avoid redundancy. Examine only the top ranked sentences and stop when the length of the definition is reached. –Different from MMR, which examines all sentences. –Due to the noisy input sentences.

Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 31/28 Compared to HMM Both address individual slot content and sequence fidelity. Soft patterns perform instance-based learning – can deal with –Small training set –Noisy data from group pseudo-relevance feedback –Online training HMM needs –More training data and time –Explicit transition paths between states