Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.

Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro Mori University of Tokyo (JPN)

Outline of the talk Introduction –Context-based word weighting Proposed measure System architecture Evaluation Conclusion

Introduction Many information support systems with NLP use tfidf to measure the weight of words. –Tfidf is based on statistics of word occurrence on a target document and a corpus. –It is effective in many practical systems including summarization systems and retrieval systems. However, a word that is important to one user is sometimes not important to others. Introduction

Example “Suzuki hitting streak ends at 23 games” –Ichiro Suzuki is a Japanese MLB player, MVP in 2001. –Those who are greatly interested in MLB would thinks “hitting streak ends” as important, –While a user who has no interest in MLB would note the words “game” or “Seattle Mariners” as the informative, because those words would indicate that the subject of the article was baseball. If a user is not familiar with the topic, he/she may think general words related to the topic are important. On the other hand, if a user is familiar with the topic, he/she may think more detailed words are important. Our main hypothesis Introduction

Goal of this research This research addresses context-based word weighting, focusing on the statistical feature of word co-occurrence. In order to measure the weight of words more correctly, contextual information about a user (we call “familiar words”) is used. Introduction

Outline of the talk Introduction –Context-based word weighting Proposed measure –Previous work –IRM (Interest Relevance Measure) System architecture Evaluation Conclusion

IRM A new measure, IRM, is based on a word- weighting algorithm applied to a single document. –[Matsuo 03]: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information, FLAIRS 2003

We take a paper for example. COMPUTING MACHINERY AND INTELLIGENCE A.M.TURING 1 The Imitation Game I PROPOSE to consider the question, 'Can machines think?' This should begin with definitions of the meaning of the terms 'machine 'and 'think'. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words 'machine' and 'think 'are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, 'Can machines think?' is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words. The new form of the problem can be described' in terms of a game which we call the 'imitation game'. It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either Previous work [Matsuo03]

Distribution of frequent terms Previous work [Matsuo03]

Next, count co-occurrences ・・・・・・・・・・・・・・・・ The new form of the problem can be described' in terms of a game which we call the `imitation game’. ・・・・・・・・・・・・・・・・ stem, stop word elimination, phrase extraction  “ new ” and “ form ” co-occur once.  “ new ” and “ problem ” co-occur once.  ….  “ call ” and “ imitation game ” co-occur once. Previous work [Matsuo03]

Co-occurrence matrix “kind” Frequent terms “make” Previous work [Matsuo03]

“ kind ” – frequent terms, and “ make ”— frequent terms Co-occurrences of “ kind ” – frequent terms, and “ make ”— frequent terms A general term such as “kind” or “make” is used relatively impartially with each frequent term, …but Previous work [Matsuo03]

Co-occurrence matrix “imitation” Frequent terms “digital computer” Previous work [Matsuo03]

Co-occurrences of “ imitation ” – frequent terms, and “ digital computer ”— frequent terms while a term such as “imitation” or “digital computer” shows co-occurrence especially with particular terms. Previous work [Matsuo03]

Biases of co-occurrence A general term such as “kind” or “make” is used relatively impartially with each frequent tem, while a term such as “imitation” or “digital computer” shows co-occurrence especially with particular terms. Therefore, the degree of biases of co-occurrence can be used as a surrogate of term importance. Previous work [Matsuo03]

χ 2 -measure We use the χ 2 -test, which is very common for evaluating biases between expected and observed frequencies. square Expected co-occurrenceObserved co-occurrence  G: Frequent terms  freq(w,g): Frequency of co-occurrence term w and term g.  p g : unconditional probability (the expected probability) of g.  f(w): The total number of co-oocurrence of term w and frequent terms G Previous work [Matsuo03] Large bias of co-occurrence means importance of a word.

Sort by χ 2 -value Χ 2- value LabelFreq 1593.7digital computer31 2179.3imitation game16 3163.1future4 4161.3question44 5152.8internal3 6143.5answer39 7142.8input signal3 8137.7moment2 9130.7play8 10123.0output15 ・・・・・・・・・ 5511.0slowness2 5521.0unemotional channel2 5530.8Mr.2 5540.8sympathetic2 5550.7leg2 5560.7chess2 5570.6Pickwick2 5580.6scan2 5590.3worse2 5600.1eye2 We can get important words based on co-occurrence information in a document. Previous work [Matsuo03]

Personalize the calculation of word importance The previous method is useful for extracting reader-independent important words from a document. However, importance of words depends not only on the document itself but also on a reader. IRM, proposed measure

If we change the columns to pick up… abcdefghijtotal a―3026191812 17229165 b30―5506111323111 c265―4237020067 d19504―37110489 e186233―7121061 f1211777―240050 g1210112―51023 h17321245―0034 i222001010―733 j930400007―23 ・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・ u6553318221045 V1340435361002104 w1122110140022 X1732124500034 a: machine, b: computer, c: question, d: digital, e: answer, f: game, g: argument, h: make, i: state, j: number u: imitation, v: digital computer, w:kind, x:make k 9 13 0 4 2 0 5 0 7 6 ・・・・・・ 0 2 4 0 IRM, proposed measure

If we change the columns to pick up… word 196.9imitation game 88.9play 62.4digital computer 60.1card 57.1future 50.4logic 45.1identification 44.4universality 42.7state word 196.6imitation game 88.5play 84.4logic system 62.2digital computer 60.0card 57.0future 44.9identification 44.2proposition 43.9limitation word 196.2imitation game 113.8animal 88.2play 62.0digital computer 59.9card 56.9future 49.8identification 44.7woman 40.8book Frequent words Frequent terms ＋ “logic” Frequent terms+“God” The relevant words to selected words have high χ 2 value, because they co-occurs often. IRM, proposed measure

Familiarity instead of frequency We focus on “familiar words” to the user, instead of “frequent words” in the document. [Definition] Familiar words are the words which a user has frequently seen in the past. IRM, proposed measure

Interest Relevancy Measure (IRM) where H k is a set of familiar words for user k IRM, proposed measure

IRM If the value of IRM is large, word w ij is relevant to user’s familiar words. –The word is relevant to the user’s interests, so it is a keyword for the user. Conversely, if the value of IRM is small, word w ij is not specifically relevant to any of the familiar words. IRM, proposed measure

Browsing support system It is difficult to evaluate IRM objectively because the weight of words depends on a user’s familiar words, and therefore varies among users. Therefore, we evaluate IRM by constructing a Web browsing support system. –Web pages accessed by a user are monitored by a proxy server. –The count of each word is stored in a database.

System architecture of browsing support system Proxy ServerKeyword Extraction Module - Morphological analysis - Count word frequency - Query past word frequency -Compute IRM of words - Select keywords - Modify the HTML -Keep word count for each user - Increment word count Frequency Server - Pass if non-text data - Send the body part of HTML - Receive the result and send to browser query frequency in the history Internet Browser

Sample Screen shot

Evaluation For evaluation, ten people tried this system for more than one hour. Three methods are implemented for comparison. –(I) word frequency –(II) tf ・ idf –(III) IRM

Evaluation – Result(1) After using each system (blind), we ask the following questions on a 5-point Likert-scale from 1(not at all) to 5 (very much). –Q1: Do this system help you browse the Web? (I) 2.8 (II) 3.2 (III) 3.2 –Q2: Are the red color words (=high IRM words) interesting to you? (I) 3.2 (II) 4.0 (III) 4.1 –Q3: Are the interesting words colored red? (I) 2.9 (II) 3.3 (III) 3.8 –Q4: Are the blue color words (=familiar words) interesting to you? (I) 2.7 (II) 2.5 (III) 2.0 –Q5: Are the interesting words colored blue? (I) 2.7 (II) 2.5 (III) 2.4 (I) word frequency (II) tf ・ idf (III) IRM

Evaluation – Result(2) After evaluating all three system, we ask the following two questions. –Q6: Which one helps your browsing the most? (I) 1 people (II) 3 (III) 6 –Q7: Which one detects your interests the most? (I) 0 people (II) 2 (III) 8 Overall, IRM can detect words of the user’s interests the most. (I) word frequency (II) tf ・ idf (III) IRM

We develop an context-based word weighting measure (IRM) based on the relevance (i.e., the co-occurrence) to a user’s familiar words. –If a user is not familiar with the topic, he/she may think general words related to the topic are important. –On the other hand, if a user is familiar with the topic, he/she may think more detailed words are important. We implemented IRM to browsing support system, and showed the effect.

Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.

Similar presentations

Presentation on theme: "Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.

Similar presentations

Presentation on theme: "Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro."— Presentation transcript:

Similar presentations

About project

Feedback