Download presentation
Presentation is loading. Please wait.
Published byBambang Makmur Modified over 6 years ago
1
Contributors Jeremy Brown, Bryan Winters, and Austin Ray
The Automatic Summarization of Text Documents By Team [TL;DR] Contributors Jeremy Brown, Bryan Winters, and Austin Ray
2
Abstract Description of Problem Statement
When online, users Can be overloaded by what they see every day May not have enough time to fully engage the content they encounter Can cause users to settle for less when they should be getting the best benefit they can
3
Proposed Solution The automatic summarization of bodies of text.
Users can get the same information without having to read a full article, web post, etc. Automatic Summarization Generating a summary of text using an algorithm that reduces the total number of sentences. Retains main points and readability.
4
Our Plan Individual Creation of Three Summarization Algorithms
Based off of work done by others Language agnostic Requires Preprocessor Evaluation of Algorithms Speed Efficiency Readability
5
TextRank How Jeremy’s Implementation Works
Based off of work done by Rada Mihalcea and Paul Tarau How Jeremy’s Implementation Works Original text file is ran through the preprocessor, breaking up the text by sentences Computes sentence similarity using a similarity equation Grades each sentence based on this similarity Gets a user given number of sentences to return Outputs a text file of the top graded sentences in the order they appear in the original text
6
TextRank’s Similarity Equation
7
SumBasic Based on the work of Nenkova and Vanderwende
8
SMMRY Based off of the smmry algorithm developed in 2009 How SMMRY Works Original text file is put through the preprocessor, breaking up the text by sentences Assigns tokens (points) to each word Counts and sorts frequency of words based off of total tokens Reorders sentences based on total tokens per sentence Outputs a text file of the top sentences. Number of sentences is determined by the user.
9
SMMRY Graphs
10
Running our Algorithms
Ran each algorithm to summarize five texts Cinderella by the Grimm Brothers Originally 156 sentences Summaries: 50 sentences and 25 sentences Gettysburg Address by Abraham Lincoln Originally 10 sentences Summaries: 5 sentences and 2 sentences Eisenhower’s Farewell Address Originally 84 sentences Summaries: 25 sentences and 10 sentences General MacArthur’s ‘Duty, Honor, Country’ Yellow Submarine by The Beatles Originally 36 sentences Summaries: 12 sentences and 6 sentences
11
Comparing Algorithm Results - Speed
12
Comparing Algorithm Results - Efficiency (CPU)
13
Comparing Algorithm Results - Efficiency (Memory)
14
Readability Rubric 4 3 2 1 Purpose Statement Clear
Purpose statement is clearly defined within the summary. The purpose statement is there but is not fully defined. The purpose statement can be found it bits and pieces. No Purpose Statement Starts Well Clearly starts the summary off in the direction the original author intended. Less clear, but the preceding sentences help get the summary started. The start of the paragraph is in the summary somewhere. The start is at the end, or the summary did not include something to start the text off. Ends Well Clearly ends the summary in the direction the original author intended. Less clear, but the preceding sentences help get the summary ended. The end of the paragraph is in the summary somewhere. The end is at the start, or the summary just trailed off without a conclusion. Sentence Order Sentences are in a correct order such that reading is fluid and understandable Majority of sentences are in order such that the reader has little trouble understanding Majority of sentences are out of order, but the reader can understand the summary with difficulty Every sentence is out of order presenting an unintelligible summary Main Points Clear Reader can determine all the main points of the original document from summary Reader can determine the majority of main points of the original document from the summary Reader cannot determine the majority of main points of the original document from the summary Reader cannot determine any of the main points of the original document form the summary Tone Tone is the same as the original document Tone is changed slightly (i.e. positive to mostly positive) Tone is changed vastly (i.e. positive tone to mostly negative tone) Tone is opposite of original document (i.e. positive to negative) Readability Rubric
15
Comparing Algorithm Results - Readability
16
Comparing Algorithm Results - Conclusion
SMMRY Pros: Lowest memory usage, can produce good summaries Cons: Slowest, highest CPU usage on average, can produce bad summaries TextRank Pros: Highest average readability scores, consistent CPU usage, Cons: Slower than SumBasic, requires more memory than SMMRY SumBasic Pros: Consistently the fastest, consistent readability scores Cons: Highest memory usage, inconsistent CPU usage, average readability scores
17
Future Work TextRank SMMRY
Look into implementing a regular expression for the preprocessor Figure out ways to decrease RAM usage Have every sentence cast a vote for every other sentence Deal with sentence co-occurrence SMMRY Prevent conjunction words from being tokenized Do not allow for summaries being longer than original text Reduce NLTK library size Handles other UNICODE encoding other than ASCII SumBasic Better dictionary lookups Associate grammar counterparts and contractions (e.g. city and cities, it’s and it is) Reduce implementation’s time complexity
18
Question #1 In the TextRank Similarity Equation, what is the purpose of normalizing?
19
To avoid promoting long sentences.
Answer #1 To avoid promoting long sentences.
20
What is the purpose of a preprocessor?
Question #2 What is the purpose of a preprocessor?
21
Answer #2 A preprocessor is needed to separate sentences in order to compare and rank them.
22
What defines a good summary?
Question #3 What defines a good summary?
23
Answer #3 A good summary has a clear purpose statement, starts/ends well, maintains sentence order, clearly states main point, and keeps the tone consistent with the original text.
24
What type of approach do our three algorithms utilize?
Question #4 What type of approach do our three algorithms utilize?
25
Answer #4 Extraction. The process of extraction is done by selecting a subset of existing words, phrases, or sentences to form a summary
26
Question #5 What is the purpose of squaring the probability of each word in a chosen sentence in the SumBasic algorithm?
27
To provide context sensitivity.
Answer #5 To provide context sensitivity. i.e. change the notion of “what is most important to include” based on already summarized information
28
Question #6 Why is our project 380?
29
and oh, we also analyzed them ;)
Answer #6 But really… The class is about the design and analysis of algorithms. Obviously, we designed three algorithms with unique approaches. and oh, we also analyzed them ;)
30
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.