Download presentation
Presentation is loading. Please wait.
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor : Dr. Hsu Graduate : Jing Wei Lin Authors : Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, Yuchang Lu, Zheng Chen 2005 ACM SIGIR.
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Summarize Web Pages Using Clickthrough Data Experimental Results Discussions Conclusions Personal Opinions
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivate Many of summarization methods do not consider the hidden relationships in the Web. ─ Uncovering the hidden knowledge is important in building good Web-page summarizers.
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective We extract the extra knowledge from the clickthrough data of a Web search engine to improve Web-page summarization.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Empirical Study on Clickthrough Data Clickthrough data can be represented by a set of triples (user (u) 、 query (q) the user clicks on the pages (p) of interest) Experiment1 : In the clickthrough data among the 260,763web pages accessed by users during one month, 109,694 of them contain "KEYWORD" metadata. Result : 45.5% of the keywords occur in the query words 13.1% of query words appear as keywords. Experiment2 : We collected 90 pages which are covered by the clickthrough data, then we asked three human evaluators to conduct a manual summarization task. Result : 58% of the sentences in the original Web page contain query words each sentence contains 1.48 query words on average (without knowing query word) the percentage of sentences containing queries becomes 71.3% and the average query word length in each sentence becomes 2.0. (with query word) Query Word Keyword
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Adapted Web-page Summarization Methods—Adapted Significant Word In order to compute the significance factor of each sentence, a set of significant words are constructed first. In Luhn’s method 在某年某月某日 ( 陳義雄槍殺陳水扁總統 ) (1) Set a limit L for the distance at which any two significant words could be considered as being significantly related. (2) Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart. (ex. 陳水扁 總統 / 陳義雄 ) (3) Count the number of significant words contained in the portion and divide the square of this number by the total number of words within the portion. N s /(N^2*N all )= the significance factor of a sentence 6/(6^2*10)=1/60=0.017 Each candidate word is assigned with a significance factor wi given in Equation 1.
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Adapted Web-page Summarization Methods--ALSA Gong et al. proposed an summarization algorithm ─ 1.A term-sentence matrix is constructed from the original text document. ─ 2. LSA analysis is conducted on the matrix. Each element in measures the importance factor of this sentence on the corresponding latent concept. ─ 3.A document summary is produced incrementally. We utilize the query-word knowledge by changing the term- sentence matrix: if a term occurs as query words, its weight is increased according to its frequency in query word collection. Sentence
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Summarize Web Pages Not Covered by Clickthrough Data TS(c) a set of terms associated with category c Thematic lexicon is a set of TS Step1 : TS corresponding to each category is set empty. Step2 : For each page covered by the clickthrough data, its query words are added into TS of categories and query words frequency is added to its original weight in TS. Step3 : Term weight in each TS is multiplied by its Inverse Category Frequency (ICF). The ICF value of a term is the reciprocal of its frequency occurring in different categories of the hierarchical taxonomy. Look up the lexicon for TS according to the page's category. Use the summarization methods proposed in paper. ─ Weights of the terms in TS can be used to select significant words or to update the term-sentence matrix. Ex. 絕命終結站 3( 驚悚、死神 ……) Arts Movies TVMusic
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experimental Results Ignore clickthrough data Only query word
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Experimental Results (cont.) Ignore clickthrough data Using Lexicon
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experimental Results (cont.) Top
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Discussions The thematic lexicon built from clickthrough data can discover the topic terms associated with a specific category and the ICF-based approach can effectively assign weights to terms of this category.
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Conclusion We leverage extra knowledge from clickthrough data to improve Web-page summarization. For the pages which are not covered by the clickthrough data, we build a thematic lexicon using the clickthrough data in conjunction with an available hierarchical Web directory.
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Personal Opinions Advantage Drawback Opinions :利用 clickthough data 來做個人化的 知識 +
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.