Download presentation
Presentation is loading. Please wait.
1
John Frazier and Jonathan perrier
Text Summarization John Frazier and Jonathan perrier
2
Natural Language Problem Statement
Given some piece of text, we want to create an accurate summary in the least amount of time possible with the fewest resources possible
3
Formal Language Problem Statement
Let π= π 1 , π 2 , β¦, π π be a text document with π 1 , π 2 , β¦, π π each being a sentence of the document in the order that it appears. Let π€ π, π βπ be words of the document where π is the ππ‘β sentence of the document and π is the π π‘β word of the sentence Given a document π, extract a subset S= π π , β¦, π π βπβπ and score summary quality with the Rouge-1 metric
4
Algorithm 1: LexRank Create a graph by constructing a vertex at each sentence Create edges between sentences using IDF-modified cosine similarity Apply PageRank to the graph PageRank weights based on number of references other sentences make to a given sentence Return sentences based on PageRank rankings for the sentences O( π 2 ) complexity
5
Algorithm 2: Luhnβs Algorithm
Give all sentences a weight based on a significance factor ππππππππππππ πΉπππ‘ππ= # πππππππππππ‘ π€ππππ ππ π πππ’π π‘ππ # π€ππππ ππ π πππ’π π‘ππ Cluster size is determined by placing the first and last use of a significant word as the beginning and end of an array and counting all words in the array Significant words are words with < 4 insignificant words in-between each repetition of a significant word Return sentences with highest significance factor O(π€) complexity
6
Algorithm 3: Brute Force/NaΓ―ve
Weigh each word in the document by getting the summation of all times the word is repeated in the document ππππ‘ππππ ππππβπ‘= π=1 π π€ π, π π Return sentences with highest weight O(π€)
7
Evaluation Method: Rouge-1 Score
Run Rouge-1 metric on summaries compared to a βgold standardβ summary written by us for Recall and Precision π
πππππ= # π’πππππππ πππ’πππππ ππ πππ‘β πππππ πππ ππππ π π’πππππππ # π’πππππππ ππ ππππ π π’πππππ¦ ππππππ πππ= # π’πππππππ πππ’πππππ ππ πππ‘β πππππ πππ ππππ π π’πππππππ # π’πππππππ ππ πππππ π π’πππππ¦ Each individual word is a unigram Want to maximize recall and precision
8
Evaluation: Determining Loop Value
When collecting data, the algorithms were looped to reduce the impact of program overhead and timer inefficiency.
9
Evaluation: Algorithm Run Times
10
System Name Avg. Recall Avg. Precision LexRank 50% Luhn 50% NaΓ―ve 50% LexRank 100% Luhn 100% 0.6 NaΓ―ve 100% LexRank 200% Luhn 200% NaΓ―ve 200%
11
System Name Avg_Recall Avg_Precision LexRank 50% Luhn 50% NaΓ―ve 50% 0.2488 Luhn 100% LexRank 100% 0.4067 NaΓ―ve 100% LexRank 200% Luhn 200% 0.3467 NaΓ―ve 200% 0.5933
12
System Name Avg_Recall Avg_Precision LexRank 50% Luhn 50% 0.2069 NaΓ―ve 50% LexRank 100% Luhn 100% NaΓ―ve 100% LexRank 200% Luhn 200% NaΓ―ve 200%
13
Questions Why should we have pre-processed the text for stop-words in our naΓ―ve algorithm? Because stop-words are so common and lack significant meaning, they skew sentence weight in favor of stop-words rather than more meaningful words. What is the difference between extractive and abstractive summarization? Extraction pulls full sentences directly from the text while abstraction uses machine learning to condense text in a heuristic manner. What is the difference between recall and precision? Recall is the ratio between shared unigrams and a gold standard. Precision is the ratio between shared unigrams and the summarized model.
14
Questions Continued What does PageRank do in the LexRank summary?
PageRank determines sentence weight by measuring the number of sentences that reference a given sentence Why does Luhnβs Algorithm only have O(π€) complexity? Because it only counts repetition within each sentence rather than compared to the document as a whole.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.