John Frazier and Jonathan perrier

John Frazier and Jonathan perrier
Text Summarization John Frazier and Jonathan perrier

Natural Language Problem Statement
Given some piece of text, we want to create an accurate summary in the least amount of time possible with the fewest resources possible

Formal Language Problem Statement
Let 𝑁= 𝑛 1 , 𝑛 2 , …, 𝑛 𝑚 be a text document with 𝑛 1 , 𝑛 2 , …, 𝑛 𝑚 each being a sentence of the document in the order that it appears. Let 𝑤 𝑟, 𝑠 ∈𝑁 be words of the document where 𝑟 is the 𝑛𝑡ℎ sentence of the document and 𝑠 is the 𝑠𝑡ℎ word of the sentence Given a document 𝑁, extract a subset S= 𝑛 𝑖 , …, 𝑛 𝑗 ∋𝑆⊆𝑁 and score summary quality with the Rouge-1 metric

Algorithm 1: LexRank Create a graph by constructing a vertex at each sentence Create edges between sentences using IDF-modified cosine similarity Apply PageRank to the graph PageRank weights based on number of references other sentences make to a given sentence Return sentences based on PageRank rankings for the sentences O( 𝑛 2 ) complexity

Algorithm 2: Luhn’s Algorithm
Give all sentences a weight based on a significance factor 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹𝑎𝑐𝑡𝑜𝑟= # 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑎 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 # 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑎 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 Cluster size is determined by placing the first and last use of a significant word as the beginning and end of an array and counting all words in the array Significant words are words with < 4 insignificant words in-between each repetition of a significant word Return sentences with highest significance factor O(𝑤) complexity

Algorithm 3: Brute Force/Naïve
Weigh each word in the document by getting the summation of all times the word is repeated in the document 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑊𝑒𝑖𝑔ℎ𝑡= 𝑖=1 𝑠 𝑤 𝑟, 𝑖 𝑠 Return sentences with highest weight O(𝑤)

Evaluation Method: Rouge-1 Score
Run Rouge-1 metric on summaries compared to a “gold standard” summary written by us for Recall and Precision 𝑅𝑒𝑐𝑎𝑙𝑙= # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑜𝑐𝑢𝑟𝑟𝑖𝑛𝑔 𝑖𝑛 𝑏𝑜𝑡ℎ 𝑚𝑜𝑑𝑒𝑙 𝑎𝑛𝑑 𝑔𝑜𝑙𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑔𝑜𝑙𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑜𝑐𝑢𝑟𝑟𝑖𝑛𝑔 𝑖𝑛 𝑏𝑜𝑡ℎ 𝑚𝑜𝑑𝑒𝑙 𝑎𝑛𝑑 𝑔𝑜𝑙𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑚𝑜𝑑𝑒𝑙 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 Each individual word is a unigram Want to maximize recall and precision

Evaluation: Determining Loop Value
When collecting data, the algorithms were looped to reduce the impact of program overhead and timer inefficiency.

Evaluation: Algorithm Run Times

System Name Avg. Recall Avg. Precision LexRank 50% Luhn 50% Naïve 50% LexRank 100% Luhn 100% 0.6 Naïve 100% LexRank 200% Luhn 200% Naïve 200%

System Name Avg_Recall Avg_Precision LexRank 50% Luhn 50% Naïve 50% 0.2488 Luhn 100% LexRank 100% 0.4067 Naïve 100% LexRank 200% Luhn 200% 0.3467 Naïve 200% 0.5933

System Name Avg_Recall Avg_Precision LexRank 50% Luhn 50% 0.2069 Naïve 50% LexRank 100% Luhn 100% Naïve 100% LexRank 200% Luhn 200% Naïve 200%

Questions Why should we have pre-processed the text for stop-words in our naïve algorithm? Because stop-words are so common and lack significant meaning, they skew sentence weight in favor of stop-words rather than more meaningful words. What is the difference between extractive and abstractive summarization? Extraction pulls full sentences directly from the text while abstraction uses machine learning to condense text in a heuristic manner. What is the difference between recall and precision? Recall is the ratio between shared unigrams and a gold standard. Precision is the ratio between shared unigrams and the summarized model.

Questions Continued What does PageRank do in the LexRank summary?
PageRank determines sentence weight by measuring the number of sentences that reference a given sentence Why does Luhn’s Algorithm only have O(𝑤) complexity? Because it only counts repetition within each sentence rather than compared to the document as a whole.

John Frazier and Jonathan perrier

Similar presentations

Presentation on theme: "John Frazier and Jonathan perrier"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Frazier and Jonathan perrier

Similar presentations

Presentation on theme: "John Frazier and Jonathan perrier"— Presentation transcript:

Similar presentations

About project

Feedback