Presentation is loading. Please wait.

Presentation is loading. Please wait.

LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :

Similar presentations


Presentation on theme: "LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :"— Presentation transcript:

1 LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.04.17

2 2 Introduction  Entries of blogs, also known as blog posts, often contain comments from blog readers.  A recent study on blog conversation showed that readers treat comments associated with a post as an inherent part of the post.  It tries to find out whether the reading of comments would change a reader’s understanding about the post.

3 3 Introduction  It conducted a user study on summarizing blog posts by labeling representative sentences in those posts.  Significant differences between the sentences labeled before and after reading comments were observed.

4 4 Introduction  In this research, the task is to summarize a blog post by extracting representative sentences from the post using information hidden in its comments.  The extracted sentences represent the topics presented in the post that are captured by its readers.

5 5 Introduction  Given a blog post and its comments, the solution consists of three modules :  sentence detection  word representativeness measure  sentence selection

6 6 Problem Definition  Definition 1:  Given a blog post P consisting of a set of sentences P = {s 1, s 2,..., s n } and the set of comments C = {c 1,c 2,..., c m } associated with P, the task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.

7 7 Problem Definition  One straightforward approach is to compute a representativeness score for each sentence s i, denoted by Rep(s i ).  As a sentence consists of a set of words, s i = {w 1,w 2,...,w m }, one can derive Rep(s i ) using representativeness scores of all words contained in s i.

8 8 Problem Definition  Word representativeness denoted by Rep(w k ), can be measured by counting the number of occurrences of a word in comments.  Such as the following three schemes :  Binary.  Comment Frequency (CF).  Term Frequency (TF).

9 9 Problem Definition  All three measures are simple statistics on comment content, binary captures minimum information; CF and TF capture slightly more, and the measures suffer from spam comments.  So it tries to find a measure that could capture more information from comments (besides content) and is less sensitive to spam.

10 10 ReQuT Model  Here states three common observations on how comments may link to each other, and they provide some guidelines on measuring word representativeness.  Observation 1 : A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s) posted by the mentioned reader. A reader may mention multiple readers in one comment.

11 11 ReQuT Model  Observation 2 : A comment may contain quoted sentences from one or more comments to reply these comments or continue the discussion.  Observation 3 : Discussion in comments often branches into several topics and a set of comments are linked together by sharing the same topic.

12 12 Reader, Quotation and Topic Measures  ReQuT means “Reader”, “Quotation” and “Topic”.  Based on the three observations, it can find that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.

13 13 Reader, Quotation and Topic Measures  Reader measure :  A directed reader graph G R :=(V R, E R ), each node ra ∈ V R is a reader, and an edge e R (rb, ra) ∈ E R exists if rb mentions ra in one of rb’s comments.  W R (rb, ra), is the ratio between the number of times rb mention ra against all times rb mention other readers (including ra).

14 14 . |R| : the total number of readers of the blog. d : damping factor. . tf(w k,c i ) : the term frequency of word w k in comment c i. c i ← r a : c i is authored by reader r a. Reader, Quotation and Topic Measures

15 15 Reader, Quotation and Topic Measures  Quotation measure :  A directed acyclic quotation graph G Q := (V Q,E Q ), each node ci ∈ V Q is a comment, and an edge (cj, ci) ∈ E Q indicates cj quoted sentences from ci.  W Q (cj, ci), is 1 over the number of comments that cj ever quoted.

16 16 Reader, Quotation and Topic Measures . |C| : the number of comments associated with the given post. . w k ∈ c i : word w k appears in comment c i.

17 17 Reader, Quotation and Topic Measures  Topic measure :  A hotly discussed topic has a large number of comments all close to the topic cluster centroid.  It tries to compute the importance of a topic cluster.

18 18 Reader, Quotation and Topic Measures . |C i | : the length of comment c i in number of words. C : the set of comments. sim(c i, t u ) : the cosine similarity between comment c i and the centroid of topic cluster t u. . c i ∈ t u : comment c i is clustered into topic cluster t u.

19 19 Word Representativeness Score  The representativeness score of a word Rep(w k ) is the combination of reader, quotation and topic measures in ReQuT model.

20 20 Word Representativeness Score .  α, β and γ are the coefficients.  0 ≤ α, β, γ ≤ 1.0 and α + β + γ = 1.0.  As both readers and bloggers have no control on authority measure and very minimum control on quotation and topic measure, so the ReQuT is less sensitive to spam comments.

21 21 Sentence Selection  Two sentence selection methods :  Density-based selection (DBS). DBS was proposed to rank and select sentences in question answering.  Summation-based selection (SBS). SBS proposed in this paper, gives a higher representativeness score to a sentence if it contains more representative words.

22 22 Sentence Selection  Density-based selection (DBS) :  Here adopted DBS in the problem by treating words appearing in comments as keywords and the rest non-keywords. . K : the total number of keywords contained in s i. Score(w j ) : the score of keyword w j. distance(w j, w j+1 ) : the number of non-keywords between two adjacent keywords w j and w j+1 in s i.

23 23 Sentence Selection  Summation-based selection (SBS) :  SBS does not favor long sentences by considering the number of words in a sentence. . |s i | : the length of sentence s i in number of words. τ (τ > 0) : a parameter to flexibly control the contribution of a word’s representativeness score.

24 24 User Study and Experiments  Here collected data from two famous blogs, i.e., Cosmic Variance and IEBlog, both having relatively large readership and being widely commented.

25 25 User Study  With 10 posts randomly picked up from each of the two blogs and 3 human summarizers recruited from Computer Engineering students.  The hypothesis is that one’s understanding about a blog post does not change after he or she read the comments associated with the post.

26 26 User Study  The user study was conducted in two phrases :  provided 3 summarizers the 20 blog posts, and asked the summarizers to select approximately 30% of sentences from each post as its summary.  phrase 1 : without comments (RefSet-1).  phrase 2 : provided the nearly 1000 comments associated with the 20 posts (RefSet-2).

27 27 User Study  For each human summarizer, it computed the level of self-agreement.  Self-agreement level is defined by the percentage of sentences labeled in both reference sets against sentences in RefSet-1 by the same summarizer.

28 28 Experimental Results  As the sentences are labeled after reading comments, RefSet-2 was used to evaluate the two sentence selection methods with four word representativeness measures.  Here adopted R-Precision and NDCG as performance metrics.

29 29 Experimental Results  In the experiment :  in SBS : τ=0.2  in combining reader, quotation and topic measure : α=β=γ=0.33


Download ppt "LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :"

Similar presentations


Ads by Google