Presentation is loading. Please wait.

Presentation is loading. Please wait.

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,

Similar presentations


Presentation on theme: "ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,"— Presentation transcript:

1 ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H., Kan, M-Y. and Chua, T-S Presenter Bei Yu

2 IE approaches Traditional IE (from NLP and CL) Using syntactic and semantic constraints Wrapper (independently developed for WWW) Using delimiter-based extraction patterns This paper Soft Pattern + IR(PRF) + summarization (sentence retrieval/ranking, MMR) techniques

3 Unsupervised Learning of Soft Patterns for Generating Definitions from Online News IE from QA perspective Research question: finding definition sentence for terms or person names; Previous approaches: hand-crafted rules (previous paper) or supervised learning Research method: unsupervised soft patterns +IR + summarization External tools needed: commercial pos tagger and syntactic chunker (NP, VP)

4 Soft Patterns A virtual vector representation (window size 3) Slot: a vector of tokens with their probabilities of occurrence Token: word, punctuation or syntactic tag (substituted?)

5 Soft Patterns Emerged from Text

6

7 Soft Patterns Matching Process Matching: 1) bag-of-words similarity using Naive Bayes 2) sequences fidelity using bigram model 3) weighing patterns by their overall weight sentences Pa instances Tagging, chunking, substitution Probability estimate Soft patterns Pa Test sentence Tagging, chunking, substitution S instance

8 Soft Patterns Matching 1)bag-of-words similarity using Naive Bayes 2)sequences fidelity using bigram model Manual Tuning alpha? Where is Pa?

9 System Architecture Input relevant sentences Search Term Ranked sentences Top n by PRF SP generation IR, anaphora resolution Centroid-based ranking Matched candidate sentences as definition Final sentence selection Redundancy removal: MMR Pseudo-relevance feedback or assumption? Reranking by pattern matching

10 Centroid Word Selection Which sentences are mostly likely to contain a definition? Local centroid words (summarization techniques) For each word, compute its mutual info with search term

11 Summary of the techniques employed Core: soft pattern generalization and matching Others: Heavy use of summarization techniques MMR for redundancy removal Sentence Ranking/Retrieval Shallow NLP POS tagging and syntactic chunker

12 Evaluation for Information Extraction

13 Evaluation for Definition Extraction Test data: TREC QA corpus Online news (heuristics leaning to news text) Experiment: Comparison to HCR and centroid-based statistical method (baseline) F5-measure

14 Evaluation for TREC collection

15 Evaluation for Web Corpus

16 Questions for this paper Chunker-variate performance? (NP, VP) Manual tuning parameter (alpha, delta)? Void PRF? Question selection: seed for pattern generation Is it patterns or just one pattern at all? Arbitrary window size? Is it really unsupervised learning? Part of data used for rule induction Can SP+PRF really beat HCR?

17 References Line Eikvil. Information Extraction from World Wide Web. Norwegian Computing Center Technical Report 1999 William Cohen and Andrew McCallum. Information Extraction from World Wide Web. Kdd tutorial 2003 Stephen Soderland. Learning Information Extraction Rules from Semi-structured and Free-text. Machine Learning (1) 1999 Fuchun Peng. Models for Information Extraction. Technical Report (2000 or 2001?) Douglas E. Appelt and David J. Israel. Introduction to Information Extraction Technologies. IJCAI99 Tutorial.


Download ppt "ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,"

Similar presentations


Ads by Google