Download presentation
Presentation is loading. Please wait.
Published byErick Boyd Modified over 8 years ago
1
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 1/28 Unsupervised Learning of Soft Patterns for Generating Definitions from Online News Hang Cui Min-Yen Kan Tat-Seng Chua {cuihang, kanmy, chuats} @ comp.nus.edu.sg School of Computing, NUS, Singapore
2
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 2/28 Problem To answer “Who is Bob Woodward” and “What is SARS” questions. –A large portion of queries in search logs (Voorhees 2001). Where to get definitions –Dictionaries, encyclopedias, online glossaries …… –Online news – “new terms” (e.g. Sasser) In this paper, we –deal with recently popular terms and people. –identify definition sentences from online news. –distill search engine results to definitions.
3
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 3/28 “In the News” from Google (Apr 23, 2004) In the News Bob Woodward SARS Vietnam War Yasser Arafat George W. Bush Marine Corps Gaza Strip Kofi Annan Mitsubishi Motors Alan Greenspan First Quarter Maurice Clarett
4
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 4/28 “In the News” from Google (Apr 23, 2004) A list of relevant documents rather than a direct answer
5
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 5/28 Our Solution – DefSearch Bob Woodward Woodward, an Office of Naval Intelligence (ONI) asset, interviewed over 75 Bush Cabal insiders. (CNN) Woodward, who had previously endeared himself to the Bush Administration with his pandering portrait of the President in "Bush at War", has launched a blistering assault on White House credibility with his new book, "Plan of Attack". (NY Times) People close to Mr. Powell said Sunday that they had no doubt he would weather any criticism from within over his apparent cooperation with Mr. Woodward, an assistant managing editor at The Washington Post. (CNN) The book, called Plan of Attack, is written by Bob Woodward, the respected journalist who helped break open the Watergate scandal.The book is based on interviews with 75 people, including Bush, and is due for release Tuesday. (REUTERS) Bob Woodward, the famous Watergate reporter has interviewed President Bush and other Whitehouse "insiders". As a result of the interview, Woodward might have done more damage to the Presidents re-election cause than anyone since Richard Clarkes interview on the same program and the recent events in Spain might be an indication as to how the world is beginning to view President Bush. (ABC News)
6
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 6/28 Behind DefSearch
7
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 7/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work
8
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 8/28 How Do Current Systems Identify Definitions? Most of current systems use hand-crafted patterns –Appositive e.g. Gunter Blobel, a cellular and molecular biologist,… –Copulas e.g. Battery is a kind of electronic device … –Predicates (relations) e.g. TB is usually caused by … Current work on definition sentence identification –Domain-specific definition generation systems e.g. topic-specific definitions on the Web and biographies. –Definitional QA Task at TREC 2003
9
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 9/28 Lack of Flexibility – Hard Matching –Pattern:, also known as TB, also known as Tuberculosis, … TB ( also known as Tuberculosis ) … –Variations make hard matching fail –Introduce Soft Patterns with greater flexibility Manual labor –Introduce unsupervised learning by Group Pseudo- Relevance Feedback (GPRF). Weaknesses of Current Pattern Matching Methods mismatch
10
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 10/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work
11
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 11/28 What are Soft Patterns? Soft patterns allow partial matching TB ( also known as Tuberculosis ) … P( ( |Slot 1 ) = 0.001, P(also|Slot 2 ) = 0.21, P(known|Slot 3 ) = 0.33, P(as|Slot 4 ) = 0.13 P(Matching) = 0.23 : still better than non-definition sentences. How does it work? –Training – accumulating pattern instances in a vector. Derive pattern instances from labeled definition sentences. –Matching with a probabilistic model, not regular expressions. Using statistical information from all pattern instances, not generalized rules. Instance-based learning.
12
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 12/28 Preparing Pattern Instances The channel Iqra is owned by the Arab Radio and Television company and is the brainchild of the Saudi millionaire, Saleh Kamel. The_DT channel_NN Iqra_NNP is_VBZ owned_VBN by_IN NNP company_NN and_CC is_VBZ the_DT brainchild_NN of_IN NNP. Step 1 POS tagging and noun phrase chunking. Step 2 Selective substitution – replace those specific words with more general tags. Other tokens remain unchanged. DT$ NN BE$ owned by DT$ NNP and BE$ DT$ NN of NNP.
13
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 13/28 Preparing Pattern Instances – Cont’d DT$ NN BE$ owned by Step 3 Crop a text window around the tag “ ” (window size = 3 for each side) Pattern Instance
14
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 14/28 Illustration of Soft Pattern Generation …… The channel Iqra is owned by the … …… severance packages, known as golden parachutes, included …… A battery is a cell which can provide electricity. DT$ NN BE$ owned by known as, VB BE$ DT$ …… NN 0.12 NN 0.11, 0.40 DT$ 0.2 known 0.09 as 0.20 BE$ 0.2 VB 0.1 DT$ 0.04 owned 0.09
15
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 15/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Addressing Flexibility Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work
16
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 16/28 Matching Soft Patterns Test sentences are reduced to a vector S using the same strategy. Matching Soft Patterns – similarity between the pattern vector Pa and the test vector S. –Independent slot content similarity. –Slot sequence fidelity.
17
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 17/28 Probabilistic Matching Degree Individual slot similarity – independent assumption Sequence fidelity – bigram model Combined to get the matching degree
18
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 18/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Flexibility Unsupervised Learning of Soft Patterns – Addressing Manual Labor Evaluations Conclusion and Future Work
19
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 19/28 Unsupervised Labeling of Definition Sentences using GPRF Pattern instances obtained from labeled definition sentences. –Manual labeling is too expensive. Pseudo-relevance Feedback in document retrieval –Take the top n ranked documents as relevant. We employ Group pseudo-relevance feedback (GPRF) –Statistical ranking – centroid based method. –Perform PRF over a group of questions (top 10 sentences for each question). –Generate soft patterns from all auto-labeled sentences for all questions.
20
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 20/28 Analysis of GPRF Assumption 1 – some definition sentences can be ranked high using statistical method. –Word co-occurrence metrics can well model descriptive sentences. Over 33% of top ranked sentences are definitional. –Noise introduced in each question’s top list can be mitigated by the group strategy. Assumption 2 – definition patterns are general and can be used across questions.
21
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 21/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Flexibility Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work
22
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 22/28 Evaluation Setup Two experiments –To evaluate the effectiveness of our method on a community- standard corpus. TREC QA corpus - About 1M news articles. 50 definitional questions with answer nuggets. –To assess the adaptability of the system to actual online news and recent questions. 26 questions from Lycos. Up to 200 news articles from each of eight news sites (e.g. CNN and BBC) for each question. Comparison Systems –Baseline system – centroid based ranking (IR). –A top ranked definitional question answering system at TREC2003 – HCR Hand-crafted definition patterns (a man-month of time to construct).
23
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 23/28 Evaluation Metrics Based on given answer nuggets. –The most essential information about the target. –Judged by human assessors. Nugget Precision (NP) –Penalty to longer answers. Nugget Recall (NR) –Proportion of returned nuggets to vital nuggets. F 5 -measure (weighting NR 5 times as NP)
24
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 24/28 Evaluations on TREC Corpus Pattern matching has significant impact on definition sentence identification. Soft patterns are more effective for news text. F 5 measure % improvement (over baseline) % improvement (over HCR) Centroid (Baseline) 0.423 HCR0.47211.52% SP+GPRF (w = 1)0.50719.65%7.29% SP+GPRF (w = 2)0.53927.20%14.06% SP+GPRF (w = 3)0.53125.37%12.42% SP+GPRF (w = 4)0.49516.97%4.88% SP+GPRF (w = 5)0.48414.35%2.54%
25
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 25/28 Evaluations on the Web Corpus Using two sets of soft patterns. –More pattern instances lead to better performance (683 from TREC vs. 375 from Lycos). Soft patterns are general enough to be applied to other corpora. –Makes offline training possible. F 5 Measure % improvement (over baseline) Centroid (baseline)0.492 HCR0.55512.82% SP+GPRF (Lycos patterns) 0.61124.04% SP+GPRF (TREC patterns) 0.64230.33%
26
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 26/28 Outline How Do Current Systems Identify Definitions? What are Soft Patterns? Matching Soft Patterns – Flexibility Unsupervised Learning of Soft Patterns Evaluations Conclusion and Future Work
27
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 27/28 Conclusions and Future Work Current definition pattern matching has weaknesses –Lack of flexibility –Manual labor We address them by –Soft patterns –Unsupervised learning by Group PRF Soft patterns prove to be effective in Web-based definition generation systems. Future work –Soft patterns in information extraction and factoid question answering.
28
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 28/28 Q & A Thanks! Try our online demo at http://www- appn.comp.nus.edu.sg/~cuihang/DefSearch/ DefSearch.htm !http://www- appn.comp.nus.edu.sg/~cuihang/DefSearch/ DefSearch.htm
29
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 29/28 Statistical Ranking – Centroid Word Weighting Weighting the words by their co-occurrences with the search target. Words with the centrality weights beyond a predefined threshold form a centroid vector. Cosine similarity with the centroid vector used to rank the sentences. Top Ranked sentences by the centroid vector are deemed as definition sentence candidates.
30
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 30/28 Sentence Selection We adopt a variation of Maximal Marginal Relevance (MMR) to summarize the definition sentences. To ensure relevance and to avoid redundancy. Examine only the top ranked sentences and stop when the length of the definition is reached. –Different from MMR, which examines all sentences. –Due to the noisy input sentences.
31
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 31/28 Compared to HMM Both address individual slot content and sequence fidelity. Soft patterns perform instance-based learning – can deal with –Small training set –Noisy data from group pseudo-relevance feedback –Online training HMM needs –More training data and time –Explicit transition paths between states
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.