Graph-based Text Summarization Lin Ziheng NUS WING Group Meeting
Aims Build a graph that models the development (for writers) and consumption (for readers) of ideas in text through time Use rhetorical relations to help in recognizing the important sentences in text NUS WING Group Meeting
Random Walk Depends on current state Convergence Google PageRank: 1 2 3 Depends on current state Convergence Google PageRank: 4 5 1 2 3 4 5 1 0 0.4 0.6 0 0 2 0 0 1 0 0 3 0 0 0 0 1 4 0.1 0 0.5 0 0.4 5 0 0.2 0 0.8 0 0<d<1, usually d = 0.85 NUS WING Group Meeting
Citation Network New papers can cite old papers Old papers are not updated New paper Old papers NUS WING Group Meeting
The Internet A new page must have at least one incoming link, may link to existing pages Old pages can update their links New page Old pages NUS WING Group Meeting
Graph-based summarization: LexRank Nodes = sentences Edges = cosine similarity Fully connected Undirected NUS WING Group Meeting
Graph-based summarization: TextRank Nodes = sentences Edges = similarity Backward links Directed s1 s4 s2 New sentence Old sentences s3 NUS WING Group Meeting
Writing/Reading Process Assumption Readers read from the beginning towards the end Writers write from the beginning towards the end NUS WING Group Meeting
Blog Network NUS WING Group Meeting
Building Graph Out degree: prop. to how long the sent. stays in the graph (e.g., 1st:3, 2nd:2, 3rd:1) In degree: importance Edges: cosine, co-occurrence, longest common subsequence, etc.. NUS WING Group Meeting
doc1 doc2 doc3 NUS WING Group Meeting
Sentence Extraction In degree Run PageRank Unbiased Biased towards query d1s1: 2 d2s1: 3 d3s1: 3 d1s2: 1 d2s2: 4 d3s2: 0 d1s3: 4 d2s3: 1 d3s3: 0 NUS WING Group Meeting
Evaluation 1 Dataset: Duc’04 task 2 NUS WING Group Meeting in degree pagerank LexRank t = 1 t = 0.9 t = 0.7 t = 0.5 t = 0.3 t = 0.2 t = 0.1 node start rank=1 rank=cosine ROUGE-1 R avg 0.3602626 0.3570972 0.3504308 0.3528222 0.3570688 0.3623468 0.3563554 0.3614034 0.3627054 0.3588200 ROUGE-1 P avg 0.4002014 0.3963142 0.3972844 0.3951052 0.3946780 0.3926562 0.3903444 0.3915736 0.3881828 0.3914098 ROUGE-1 F avg 0.3778134 0.3744442 0.3713722 0.3716442 0.3735710 0.3756256 0.3714446 0.3745956 0.3739770 0.3733106 ROUGE-2 R avg 0.0899096 0.0912164 0.0895932 0.0893618 0.0891864 0.0900720 0.0876280 0.0879926 0.0902010 0.0894858 ROUGE-2 P avg 0.1008002 0.1017430 0.1019968 0.1007348 0.0994878 0.0988606 0.0968300 0.0963064 0.0967940 0.0978864 ROUGE-2 F avg 0.0946892 0.0958694 0.0951186 0.0944158 0.0937042 0.0939454 0.0917214 0.0916382 0.0931182 0.0932100 ROUGE-L R avg 0.3104592 0.3091442 0.3030530 0.3051274 0.3089822 0.3141960 0.3079836 0.3123842 0.3162914 0.3123360 ROUGE-L P avg 0.3448404 0.3431476 0.3434746 0.3414388 0.3410792 0.3402590 0.3372768 0.3384562 0.3385456 0.3407320 ROUGE-L F avg 0.3255642 0.3241896 0.3211276 0.3212942 0.3230670 0.3256226 0.3209940 0.3237910 0.3261406 0.3249710 NUS WING Group Meeting
Evaluation 2 Dataset: Duc’06 Unbiased / Biased Rearranging doc length # outlinks per sent per timestep ROUGE-2 ROUGE-SU4 Unbiased no 1 0.07563 0.12738 yes 0.07042 0.12416 0.07513 0.13106 0.07752 0.13290 2 0.08181 0.13733 5 0.07789 0.13412 10 0.07799 0.13153 NUS WING Group Meeting
Conclusion from Evaluation 2 Duc’06 is query-based, so biased PageRank gives better results Rearranging doc length is not necessary if there is no extremely long document in the cluster #outlinks is important, different #outlinks gives different inlink density. We need to look at how the dimension of the graph (D * L) is related to the inlink density F(D, L) => #outlinks NUS WING Group Meeting