ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Information Retrieval Models: Probabilistic Models
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Retrieval in Practice
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Hang Cui et al. NUS at TREC-13 QA Main Task 1/20 National University of Singapore at the TREC- 13 Question Answering Main Task Hang Cui Keya Li Renxu Sun.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Survey of Semantic Annotation Platforms
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
QUALIFIER in TREC-12 QA Main Task Hui Yang, Hang Cui, Min-Yen Kan, Mstislav Maslennikov, Long Qiu, Tat-Seng Chua School of Computing National University.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Patterns for Generating Definitions from Online News 1/28 Unsupervised Learning of.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Automatic Labeling of Multinomial Topic Models
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
A Novel Relational Learning-to- Rank Approach for Topic-focused Multi-Document Summarization Yadong Zhu, Yanyan Lan, Jiafeng Guo, Pan Du, Xueqi Cheng Institute.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Course Summary (Lecture for CS410 Intro Text Info Systems)
Machine Learning in Natural Language Processing
Information Retrieval and Web Design
Information Retrieval
Presentation transcript:

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H., Kan, M-Y. and Chua, T-S Presenter Bei Yu

IE approaches Traditional IE (from NLP and CL) Using syntactic and semantic constraints Wrapper (independently developed for WWW) Using delimiter-based extraction patterns This paper Soft Pattern + IR(PRF) + summarization (sentence retrieval/ranking, MMR) techniques

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News IE from QA perspective Research question: finding definition sentence for terms or person names; Previous approaches: hand-crafted rules (previous paper) or supervised learning Research method: unsupervised soft patterns +IR + summarization External tools needed: commercial pos tagger and syntactic chunker (NP, VP)

Soft Patterns A virtual vector representation (window size 3) Slot: a vector of tokens with their probabilities of occurrence Token: word, punctuation or syntactic tag (substituted?)

Soft Patterns Emerged from Text

Soft Patterns Matching Process Matching: 1) bag-of-words similarity using Naive Bayes 2) sequences fidelity using bigram model 3) weighing patterns by their overall weight sentences Pa instances Tagging, chunking, substitution Probability estimate Soft patterns Pa Test sentence Tagging, chunking, substitution S instance

Soft Patterns Matching 1)bag-of-words similarity using Naive Bayes 2)sequences fidelity using bigram model Manual Tuning alpha? Where is Pa?

System Architecture Input relevant sentences Search Term Ranked sentences Top n by PRF SP generation IR, anaphora resolution Centroid-based ranking Matched candidate sentences as definition Final sentence selection Redundancy removal: MMR Pseudo-relevance feedback or assumption? Reranking by pattern matching

Centroid Word Selection Which sentences are mostly likely to contain a definition? Local centroid words (summarization techniques) For each word, compute its mutual info with search term

Summary of the techniques employed Core: soft pattern generalization and matching Others: Heavy use of summarization techniques MMR for redundancy removal Sentence Ranking/Retrieval Shallow NLP POS tagging and syntactic chunker

Evaluation for Information Extraction

Evaluation for Definition Extraction Test data: TREC QA corpus Online news (heuristics leaning to news text) Experiment: Comparison to HCR and centroid-based statistical method (baseline) F5-measure

Evaluation for TREC collection

Evaluation for Web Corpus

Questions for this paper Chunker-variate performance? (NP, VP) Manual tuning parameter (alpha, delta)? Void PRF? Question selection: seed for pattern generation Is it patterns or just one pattern at all? Arbitrary window size? Is it really unsupervised learning? Part of data used for rule induction Can SP+PRF really beat HCR?

References Line Eikvil. Information Extraction from World Wide Web. Norwegian Computing Center Technical Report 1999 William Cohen and Andrew McCallum. Information Extraction from World Wide Web. Kdd tutorial 2003 Stephen Soderland. Learning Information Extraction Rules from Semi-structured and Free-text. Machine Learning (1) 1999 Fuchun Peng. Models for Information Extraction. Technical Report (2000 or 2001?) Douglas E. Appelt and David J. Israel. Introduction to Information Extraction Technologies. IJCAI99 Tutorial.