Download presentation
Presentation is loading. Please wait.
Published byLisa Porter Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Jason D. M. Rennie and Tommi Jaakkola 2005.SIGIR 353-360
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Mixture Models Experiment Summary
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Informal communication (e-mail, bulletin boards) poses a difficult learning environment ─ because traditional grammatical and lexical information are noisy ─ timely information can be difficult to extract ─ Interested in the problem of extracting information from informal, written communication.
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Introduced a new informativeness score that directly utilizes mixture model likelihood to identify informative words.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Mixture Models Identified informative words ─ looking at the difference in log-likelihood between a mixture model and a simple unigram model The simplest model ni for the number of flips per document hi for the number of heads θ = 0.5 mixture model Mixture score
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Mixture Models(example1) Example ─ Keyword “fish”, D 1 ={fish fish fish} D 2 ={I am student} ─ four short “documents”: {{HHH},{TTT},{HHH},{TTT}} simple unigram model {{HHH},{TTT},{HHH},{TTT}} ={0.5 3 (1-0.5) (3-3) }×{0.5 0 (1-0.5) (3-0) }×{0.5 3 (1-0.5) (3-3) }×{0.5 0 (1-0.5) (3-0) } = 0.5 3 × 0.5 3 × 0.5 3 × 0.5 3 = 0.000244140625=2 -12 mixture model {HHH}= {0.5 × 1 3 × (1-1) (3-3) + (1-0.5) × 0 3 × (1-0) (3-3) } = 0.5 + 0 {TTT}= {0.5 × 1 0 × (1-1) (3-0) + (1-0.5) × 0 0 × (1-0) (3-0) } = 0 + 0.5 {{HHH},{TTT},{HHH},{TTT}}=0.5 × 0.5 × 0.5 × 0.5=0.0625=2 -4
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Mixture Models(example2) Example ─ four short “documents”: {{HTT},{TTT},{HTT},{TTT}} simple unigram model {{HTT},{TTT},{HTT},{TTT}} ={0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) }×{0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) } = 0.5 3 × 0.5 3 × 0.5 3 × 0.5 3 = 2 -12 mixture model {HTT}= {0.5 × 0.33 1 × (1-0.33) (3-1) + (1-0.5) × 0.66 1 × (1-0.66) (3-1) } = (0.5 × 0.33 × 0.66 2 ) + (0.5 × 0.66 ×0.33 2 )=0.071874+0.035937=0.107811 {HTT},{TTT},{HTT},{TTT}}=0.107811 × 0.5 × 0.107811 × 0.5=0.0029058
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Mixture Models(example3) Example ─ four short “documents”: {{HTTTT},{TTT},{HTT},{TTT}} simple unigram model {{HTTTT},{TTT},{HTT},{TTT}} ={0.5 1 (1-0.5) (5-1) }×{0.5 0 (1-0.5) (3-0) }×{0.5 1 (1-0.5) (3-1) }×{0.5 0 (1-0.5) (3-0) } = 0.5 5 × 0.5 3 × 0.5 3 × 0.5 3 = 2 -14 mixture model {HTTTT}={0.5 × 0.2 1 × (1-0.2) (5-1) + (1-0.5) × 0.8 1 × (1-0.8) (5-1) } =(0.5 × 0.2 × 0.8 4 ) + (0.5 × 0.8 ×0.2 4 ) = 0.04096+0.00064=0.0416 {{HTTTT},{TTT},{HTT},{TTT}}=0.0416 × 0.5 × 0.107811 × 0.5=0.0011212344
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Mixture Models(Mixture score) {{HHH},{TTT},{HHH},{TTT}} = 0.0625 / 2 -12 {{HTT},{TTT},{HTT},{TTT}} = 0.0029058 / 2 -12 {{HTTTT},{TTT},{HTT},{TTT}} = 0.0011212344 / 2 -14
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Named Entity Extraction Performance
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Introduction(1/4) The web is filled with information, ─ but even more information is available in the informal communications people send and receive on a day-to- day basis ─ We call this communication informal because structure is not explicit and the writing is not fully grammatical. We are interested in the problem of extracting information from informal, written communication.
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Introduction(2/4) Newspaper text is harder to deal with. ─ But, newspaper articles have proper grammar with correct punctuation and capitalization; ─ part-of-speech taggers show high accuracy on newspaper text Informal communication ─ even these basic cues are noisy—grammar rules are bent, capitalization may be ignored or used haphazardly and punctuation use is creative
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Introduction(3/4) Restaurant bulletin boards ─ contain information about new restaurants almost immediately after they open a temporary closure, new management, better service or a drop in food quality. ─ This timely information can be difficult to extract. An important sub-task of extracting information from restaurant bulletin boards is identifying restaurant names.
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Introduction(4/4) If we had a good measure of how topic-oriented, or “informative,” ─ we would be better able to identify named entities It is well known that informative words have “peaked” or “heavy-tailed” frequency distributions. Many informativeness scores have been introduced ─ Inverse Document Frequency (IDF) ─ Residual IDF ─ x I ─ the z-measure ─ Gain
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Mixture Models Exhibiting two modes of operation: ─ A high frequency mode when the document is relevant to the word ─ A low (or zero) frequency mode when the document is irrelevant Identified informative words ─ by looking at the difference in log-likelihood between a mixture model and a simple unigram model
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Mixture Models Example ─ Consider the following four short “documents”: {{HHH},{TTT},{HHH},{TTT}} The simplest model for sequential binary data is the unigram. ─ ni for the number of flips per document ─ hi for the number of heads ─ θ = 0.5 ─ The unigram is a poor model for the above data. The unigram has no capability to model the switching nature of the data. ─ the data likelihood is 2 −12
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Mixture Models Example ─ Consider the following four short “documents”: {{HHH},{TTT},{HHH},{TTT}} The likelihood for a mixture of two unigrams is: 各取一半的比例 ─ A mixture is a composite model. ─ data likelihood is 2 −4
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Mixture Models The two extra parameters of the mixture allow for a much better modeling of the data. Mixture score is then the log-odds of the two likelihoods: Interested in knowing the comparative improvement of the mixture model over the simple unigram. Using EM to maximize the likelihood of the mixture model.
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Experimental Evaluation The Restaurant Data ─ Using the task of identifying restaurant names in posts to a restaurant discussion bulletin board. ─ Collected and labeled six sets of threads of approximately 100 posts each from a single board. ─ Used Adwait Ratnaparkhi’s MXPOST and MXTERMINATOR software to determine sentence boundaries, tokenize the text and determine part-of-speech. ─ Handlabeled each token as being part of a restaurant name or not. 56,018 token,1968 tokens were labeled as a restaurant name 5,956 unique tokens. Of those, 325 were used at least once as part of a restaurant name
20
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Experimental Results
21
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21
22
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22
23
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23
24
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Summary Introduced a new informativenss measure, the Mixture score, and compared it against a number of other informativeness criteria. Found the mixture score to be an effective restaurant word filter. IDF*Mixture score is a more effective filter than either individually.
25
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Personal Opinion Advantage Disadvantage
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.