Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On Rival Penalization Controlled Competitive Learning.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining and Summarizing Customer Reviews Advisor : Dr.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On-line Learning of Sequence Data Based on Self-Organizing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Probabilistic Model for Definitional Question Answering.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Web usage mining: extracting unexpected periods from web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Looking inside self-organizing map ensembles with resampling.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Predicting survival time for kidney dialysis patients:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Wireless Sensor Network Wireless Sensor Network Based.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.
Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A quantitative stock prediction system based on financial news Presenter : Chun-Jung Shih Authors :Robert.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Plagiarism Detection Technique for Java Program Using.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fraud detection in online consumer reviews Presenter: Tsai Tzung Ruei Authors: Nan Hu, Ling Liu, Vallabh.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A personal route prediction system base on trajectory.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning multiple nonredundant clusterings Presenter :
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TIARA: A Visual Exploratory Text Analytic System Presenter.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Towards comprehensive support for organizational mining Presenter : Yu-hui Huang Authors : Minseok Song,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Predicting corporate bankruptcy using a self-organizing map: An empirical study to improve the forecasting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Concept Frequency Distribution in Biomedical Text Summarization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Prediction model building and feature selection with support.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology IEEE EC1 Generating War Game Strategies Using A Genetic.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Jason D. M. Rennie and Tommi Jaakkola 2005.SIGIR

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Mixture Models Experiment Summary

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Informal communication ( , bulletin boards) poses a difficult learning environment ─ because traditional grammatical and lexical information are noisy ─ timely information can be difficult to extract ─ Interested in the problem of extracting information from informal, written communication.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  Introduced a new informativeness score that directly utilizes mixture model likelihood to identify informative words.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Mixture Models  Identified informative words ─ looking at the difference in log-likelihood between a mixture model and a simple unigram model  The simplest model ni for the number of flips per document hi for the number of heads θ = 0.5  mixture model  Mixture score

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Mixture Models(example1)  Example ─ Keyword “fish”, D 1 ={fish fish fish} D 2 ={I am student} ─ four short “documents”: {{HHH},{TTT},{HHH},{TTT}}  simple unigram model {{HHH},{TTT},{HHH},{TTT}} ={0.5 3 (1-0.5) (3-3) }×{0.5 0 (1-0.5) (3-0) }×{0.5 3 (1-0.5) (3-3) }×{0.5 0 (1-0.5) (3-0) } = × × × = =2 -12  mixture model {HHH}= {0.5 × 1 3 × (1-1) (3-3) + (1-0.5) × 0 3 × (1-0) (3-3) } = 0.5 + 0 {TTT}= {0.5 × 1 0 × (1-1) (3-0) + (1-0.5) × 0 0 × (1-0) (3-0) } = 0 + 0.5 {{HHH},{TTT},{HHH},{TTT}}=0.5 × 0.5 × 0.5 × 0.5=0.0625=2 -4

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Mixture Models(example2)  Example ─ four short “documents”: {{HTT},{TTT},{HTT},{TTT}}  simple unigram model {{HTT},{TTT},{HTT},{TTT}} ={0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) }×{0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) } = × × × =  mixture model {HTT}= {0.5 × × (1-0.33) (3-1) + (1-0.5) × × (1-0.66) (3-1) } = (0.5 × 0.33 × ) + (0.5 × 0.66 × )= = {HTT},{TTT},{HTT},{TTT}}= × 0.5 × × 0.5=

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Mixture Models(example3)  Example ─ four short “documents”: {{HTTTT},{TTT},{HTT},{TTT}}  simple unigram model {{HTTTT},{TTT},{HTT},{TTT}} ={0.5 1 (1-0.5) (5-1) }×{0.5 0 (1-0.5) (3-0) }×{0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) } = × × × =  mixture model {HTTTT}={0.5 × × (1-0.2) (5-1) + (1-0.5) × × (1-0.8) (5-1) } =(0.5 × 0.2 × ) + (0.5 × 0.8 ×0.2 4 ) = = {{HTTTT},{TTT},{HTT},{TTT}}= × 0.5 × × 0.5=

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Mixture Models(Mixture score)  {{HHH},{TTT},{HHH},{TTT}} = /  {{HTT},{TTT},{HTT},{TTT}} = /  {{HTTTT},{TTT},{HTT},{TTT}} = / 2 -14

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Named Entity Extraction Performance

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Introduction(1/4)  The web is filled with information, ─ but even more information is available in the informal communications people send and receive on a day-to- day basis ─ We call this communication informal because structure is not explicit and the writing is not fully grammatical.  We are interested in the problem of extracting information from informal, written communication.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Introduction(2/4)  Newspaper text is harder to deal with. ─ But, newspaper articles have proper grammar with correct punctuation and capitalization; ─ part-of-speech taggers show high accuracy on newspaper text  Informal communication ─ even these basic cues are noisy—grammar rules are bent, capitalization may be ignored or used haphazardly and punctuation use is creative

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Introduction(3/4)  Restaurant bulletin boards ─ contain information about new restaurants almost immediately after they open a temporary closure, new management, better service or a drop in food quality. ─ This timely information can be difficult to extract.  An important sub-task of extracting information from restaurant bulletin boards is identifying restaurant names.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Introduction(4/4)  If we had a good measure of how topic-oriented, or “informative,” ─ we would be better able to identify named entities  It is well known that informative words have “peaked” or “heavy-tailed” frequency distributions.  Many informativeness scores have been introduced ─ Inverse Document Frequency (IDF) ─ Residual IDF ─ x I ─ the z-measure ─ Gain

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Mixture Models  Exhibiting two modes of operation: ─ A high frequency mode when the document is relevant to the word ─ A low (or zero) frequency mode when the document is irrelevant  Identified informative words ─ by looking at the difference in log-likelihood between a mixture model and a simple unigram model

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Mixture Models  Example ─ Consider the following four short “documents”: {{HHH},{TTT},{HHH},{TTT}}  The simplest model for sequential binary data is the unigram. ─ ni for the number of flips per document ─ hi for the number of heads ─ θ = 0.5 ─ The unigram is a poor model for the above data.  The unigram has no capability to model the switching nature of the data. ─ the data likelihood is 2 −12

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Mixture Models  Example ─ Consider the following four short “documents”: {{HHH},{TTT},{HHH},{TTT}}  The likelihood for a mixture of two unigrams is: 各取一半的比例 ─ A mixture is a composite model. ─ data likelihood is 2 −4

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Mixture Models  The two extra parameters of the mixture allow for a much better modeling of the data.  Mixture score is then the log-odds of the two likelihoods:  Interested in knowing the comparative improvement of the mixture model over the simple unigram.  Using EM to maximize the likelihood of the mixture model.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Experimental Evaluation  The Restaurant Data ─ Using the task of identifying restaurant names in posts to a restaurant discussion bulletin board. ─ Collected and labeled six sets of threads of approximately 100 posts each from a single board. ─ Used Adwait Ratnaparkhi’s MXPOST and MXTERMINATOR software to determine sentence boundaries, tokenize the text and determine part-of-speech. ─ Handlabeled each token as being part of a restaurant name or not. 56,018 token,1968 tokens were labeled as a restaurant name 5,956 unique tokens. Of those, 325 were used at least once as part of a restaurant name

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Experimental Results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Summary  Introduced a new informativenss measure, the Mixture score, and compared it against a number of other informativeness criteria.  Found the mixture score to be an effective restaurant word filter.  IDF*Mixture score is a more effective filter than either individually.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Personal Opinion  Advantage  Disadvantage