Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection Advisor ： Dr. Hsu Reporter ： Chun Kai Chen Author ： Jason D. M. Rennie and Tommi Jaakkola 2005.SIGIR 353-360

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Mixture Models Experiment Summary

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Informal communication (e-mail, bulletin boards) poses a difficult learning environment ─ because traditional grammatical and lexical information are noisy ─ timely information can be difficult to extract ─ Interested in the problem of extracting information from informal, written communication.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  Introduced a new informativeness score that directly utilizes mixture model likelihood to identify informative words.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Mixture Models  Identified informative words ─ looking at the difference in log-likelihood between a mixture model and a simple unigram model  The simplest model ni for the number of flips per document hi for the number of heads θ = 0.5  mixture model  Mixture score

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Mixture Models(example1)  Example ─ Keyword “fish”, D 1 ={fish fish fish} D 2 ={I am student} ─ four short “documents”: {{HHH},{TTT},{HHH},{TTT}}  simple unigram model {{HHH},{TTT},{HHH},{TTT}} ={0.5 3 (1-0.5) (3-3) }×{0.5 0 (1-0.5) (3-0) }×{0.5 3 (1-0.5) (3-3) }×{0.5 0 (1-0.5) (3-0) } = 0.5 3 × 0.5 3 × 0.5 3 × 0.5 3 = 0.000244140625=2 -12  mixture model {HHH}= {0.5 × 1 3 × (1-1) (3-3) ＋ (1-0.5) × 0 3 × (1-0) (3-3) } = 0.5 ＋ 0 {TTT}= {0.5 × 1 0 × (1-1) (3-0) ＋ (1-0.5) × 0 0 × (1-0) (3-0) } = 0 ＋ 0.5 {{HHH},{TTT},{HHH},{TTT}}=0.5 × 0.5 × 0.5 × 0.5=0.0625=2 -4

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Mixture Models(example2)  Example ─ four short “documents”: {{HTT},{TTT},{HTT},{TTT}}  simple unigram model {{HTT},{TTT},{HTT},{TTT}} ={0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) }×{0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) } = 0.5 3 × 0.5 3 × 0.5 3 × 0.5 3 = 2 -12  mixture model {HTT}= {0.5 × 0.33 1 × (1-0.33) (3-1) ＋ (1-0.5) × 0.66 1 × (1-0.66) (3-1) } = (0.5 × 0.33 × 0.66 2 ) ＋ (0.5 × 0.66 ×0.33 2 )=0.071874+0.035937=0.107811 {HTT},{TTT},{HTT},{TTT}}=0.107811 × 0.5 × 0.107811 × 0.5=0.0029058

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Mixture Models(example3)  Example ─ four short “documents”: {{HTTTT},{TTT},{HTT},{TTT}}  simple unigram model {{HTTTT},{TTT},{HTT},{TTT}} ={0.5 1 (1-0.5) (5-1) }×{0.5 0 (1-0.5) (3-0) }×{0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) } = 0.5 5 × 0.5 3 × 0.5 3 × 0.5 3 = 2 -14  mixture model {HTTTT}={0.5 × 0.2 1 × (1-0.2) (5-1) ＋ (1-0.5) × 0.8 1 × (1-0.8) (5-1) } =(0.5 × 0.2 × 0.8 4 ) ＋ (0.5 × 0.8 ×0.2 4 ) = 0.04096+0.00064=0.0416 {{HTTTT},{TTT},{HTT},{TTT}}=0.0416 × 0.5 × 0.107811 × 0.5=0.0011212344

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Mixture Models(Mixture score)  {{HHH},{TTT},{HHH},{TTT}} = 0.0625 / 2 -12  {{HTT},{TTT},{HTT},{TTT}} = 0.0029058 / 2 -12  {{HTTTT},{TTT},{HTT},{TTT}} = 0.0011212344 / 2 -14

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Named Entity Extraction Performance

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Introduction(1/4)  The web is filled with information, ─ but even more information is available in the informal communications people send and receive on a day-to- day basis ─ We call this communication informal because structure is not explicit and the writing is not fully grammatical.  We are interested in the problem of extracting information from informal, written communication.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Introduction(2/4)  Newspaper text is harder to deal with. ─ But, newspaper articles have proper grammar with correct punctuation and capitalization; ─ part-of-speech taggers show high accuracy on newspaper text  Informal communication ─ even these basic cues are noisy—grammar rules are bent, capitalization may be ignored or used haphazardly and punctuation use is creative

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Introduction(3/4)  Restaurant bulletin boards ─ contain information about new restaurants almost immediately after they open a temporary closure, new management, better service or a drop in food quality. ─ This timely information can be difficult to extract.  An important sub-task of extracting information from restaurant bulletin boards is identifying restaurant names.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Introduction(4/4)  If we had a good measure of how topic-oriented, or “informative,” ─ we would be better able to identify named entities  It is well known that informative words have “peaked” or “heavy-tailed” frequency distributions.  Many informativeness scores have been introduced ─ Inverse Document Frequency (IDF) ─ Residual IDF ─ x I ─ the z-measure ─ Gain

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Mixture Models  Exhibiting two modes of operation: ─ A high frequency mode when the document is relevant to the word ─ A low (or zero) frequency mode when the document is irrelevant  Identified informative words ─ by looking at the difference in log-likelihood between a mixture model and a simple unigram model

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Mixture Models  Example ─ Consider the following four short “documents”: {{HHH},{TTT},{HHH},{TTT}}  The simplest model for sequential binary data is the unigram. ─ ni for the number of flips per document ─ hi for the number of heads ─ θ = 0.5 ─ The unigram is a poor model for the above data.  The unigram has no capability to model the switching nature of the data. ─ the data likelihood is 2 −12

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Mixture Models  Example ─ Consider the following four short “documents”: {{HHH},{TTT},{HHH},{TTT}}  The likelihood for a mixture of two unigrams is: 各取一半的比例 ─ A mixture is a composite model. ─ data likelihood is 2 −4

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Mixture Models  The two extra parameters of the mixture allow for a much better modeling of the data.  Mixture score is then the log-odds of the two likelihoods:  Interested in knowing the comparative improvement of the mixture model over the simple unigram.  Using EM to maximize the likelihood of the mixture model.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Experimental Evaluation  The Restaurant Data ─ Using the task of identifying restaurant names in posts to a restaurant discussion bulletin board. ─ Collected and labeled six sets of threads of approximately 100 posts each from a single board. ─ Used Adwait Ratnaparkhi’s MXPOST and MXTERMINATOR software to determine sentence boundaries, tokenize the text and determine part-of-speech. ─ Handlabeled each token as being part of a restaurant name or not. 56,018 token,1968 tokens were labeled as a restaurant name 5,956 unique tokens. Of those, 325 were used at least once as part of a restaurant name

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Experimental Results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Summary  Introduced a new informativenss measure, the Mixture score, and compared it against a number of other informativeness criteria.  Found the mixture score to be an effective restaurant word filter.  IDF*Mixture score is a more effective filter than either individually.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Personal Opinion  Advantage  Disadvantage

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection."— Presentation transcript:

Similar presentations

About project

Feedback