Presentation is loading. Please wait.

Presentation is loading. Please wait.

John Frazier and Jonathan perrier

Similar presentations


Presentation on theme: "John Frazier and Jonathan perrier"β€” Presentation transcript:

1 John Frazier and Jonathan perrier
Text Summarization John Frazier and Jonathan perrier

2 Natural Language Problem Statement
Given some piece of text, we want to create an accurate summary in the least amount of time possible with the fewest resources possible

3 Formal Language Problem Statement
Let 𝑁= 𝑛 1 , 𝑛 2 , …, 𝑛 π‘š be a text document with 𝑛 1 , 𝑛 2 , …, 𝑛 π‘š each being a sentence of the document in the order that it appears. Let 𝑀 π‘Ÿ, 𝑠 βˆˆπ‘ be words of the document where π‘Ÿ is the π‘›π‘‘β„Ž sentence of the document and 𝑠 is the π‘ π‘‘β„Ž word of the sentence Given a document 𝑁, extract a subset S= 𝑛 𝑖 , …, 𝑛 𝑗 βˆ‹π‘†βŠ†π‘ and score summary quality with the Rouge-1 metric

4 Algorithm 1: LexRank Create a graph by constructing a vertex at each sentence Create edges between sentences using IDF-modified cosine similarity Apply PageRank to the graph PageRank weights based on number of references other sentences make to a given sentence Return sentences based on PageRank rankings for the sentences O( 𝑛 2 ) complexity

5 Algorithm 2: Luhn’s Algorithm
Give all sentences a weight based on a significance factor π‘†π‘–π‘”π‘›π‘–π‘“π‘–π‘π‘Žπ‘›π‘π‘’ πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ= # π‘†π‘–π‘”π‘›π‘–π‘“π‘–π‘π‘Žπ‘›π‘‘ π‘€π‘œπ‘Ÿπ‘‘π‘  𝑖𝑛 π‘Ž π‘π‘™π‘’π‘ π‘‘π‘’π‘Ÿ # π‘€π‘œπ‘Ÿπ‘‘π‘  𝑖𝑛 π‘Ž π‘π‘™π‘’π‘ π‘‘π‘’π‘Ÿ Cluster size is determined by placing the first and last use of a significant word as the beginning and end of an array and counting all words in the array Significant words are words with < 4 insignificant words in-between each repetition of a significant word Return sentences with highest significance factor O(𝑀) complexity

6 Algorithm 3: Brute Force/NaΓ―ve
Weigh each word in the document by getting the summation of all times the word is repeated in the document 𝑆𝑒𝑛𝑑𝑒𝑛𝑐𝑒 π‘Šπ‘’π‘–π‘”β„Žπ‘‘= 𝑖=1 𝑠 𝑀 π‘Ÿ, 𝑖 𝑠 Return sentences with highest weight O(𝑀)

7 Evaluation Method: Rouge-1 Score
Run Rouge-1 metric on summaries compared to a β€œgold standard” summary written by us for Recall and Precision π‘…π‘’π‘π‘Žπ‘™π‘™= # π‘’π‘›π‘–π‘”π‘Ÿπ‘Žπ‘šπ‘  π‘œπ‘π‘’π‘Ÿπ‘Ÿπ‘–π‘›π‘” 𝑖𝑛 π‘π‘œπ‘‘β„Ž π‘šπ‘œπ‘‘π‘’π‘™ π‘Žπ‘›π‘‘ π‘”π‘œπ‘™π‘‘ π‘ π‘’π‘šπ‘šπ‘Žπ‘Ÿπ‘–π‘’π‘  # π‘’π‘›π‘–π‘”π‘Ÿπ‘Žπ‘šπ‘  𝑖𝑛 π‘”π‘œπ‘™π‘‘ π‘ π‘’π‘šπ‘šπ‘Žπ‘Ÿπ‘¦ π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘›= # π‘’π‘›π‘–π‘”π‘Ÿπ‘Žπ‘šπ‘  π‘œπ‘π‘’π‘Ÿπ‘Ÿπ‘–π‘›π‘” 𝑖𝑛 π‘π‘œπ‘‘β„Ž π‘šπ‘œπ‘‘π‘’π‘™ π‘Žπ‘›π‘‘ π‘”π‘œπ‘™π‘‘ π‘ π‘’π‘šπ‘šπ‘Žπ‘Ÿπ‘–π‘’π‘  # π‘’π‘›π‘–π‘”π‘Ÿπ‘Žπ‘šπ‘  𝑖𝑛 π‘šπ‘œπ‘‘π‘’π‘™ π‘ π‘’π‘šπ‘šπ‘Žπ‘Ÿπ‘¦ Each individual word is a unigram Want to maximize recall and precision

8 Evaluation: Determining Loop Value
When collecting data, the algorithms were looped to reduce the impact of program overhead and timer inefficiency.

9 Evaluation: Algorithm Run Times

10 System Name Avg. Recall Avg. Precision LexRank 50% Luhn 50% NaΓ―ve 50% LexRank 100% Luhn 100% 0.6 NaΓ―ve 100% LexRank 200% Luhn 200% NaΓ―ve 200%

11 System Name Avg_Recall Avg_Precision LexRank 50% Luhn 50% NaΓ―ve 50% 0.2488 Luhn 100% LexRank 100% 0.4067 NaΓ―ve 100% LexRank 200% Luhn 200% 0.3467 NaΓ―ve 200% 0.5933

12 System Name Avg_Recall Avg_Precision LexRank 50% Luhn 50% 0.2069 NaΓ―ve 50% LexRank 100% Luhn 100% NaΓ―ve 100% LexRank 200% Luhn 200% NaΓ―ve 200%

13 Questions Why should we have pre-processed the text for stop-words in our naΓ―ve algorithm? Because stop-words are so common and lack significant meaning, they skew sentence weight in favor of stop-words rather than more meaningful words. What is the difference between extractive and abstractive summarization? Extraction pulls full sentences directly from the text while abstraction uses machine learning to condense text in a heuristic manner. What is the difference between recall and precision? Recall is the ratio between shared unigrams and a gold standard. Precision is the ratio between shared unigrams and the summarized model.

14 Questions Continued What does PageRank do in the LexRank summary?
PageRank determines sentence weight by measuring the number of sentences that reference a given sentence Why does Luhn’s Algorithm only have O(𝑀) complexity? Because it only counts repetition within each sentence rather than compared to the document as a whole.


Download ppt "John Frazier and Jonathan perrier"

Similar presentations


Ads by Google