Download presentation
Presentation is loading. Please wait.
1
Approaches to automatic summarization Lecture 5
2
Types of summaries Extracts – Sentences from the original document are displayed together to form a summary Abstracts – Materials is transformed: paraphrased, restructured, shortened
3
Extractive summarization Each sentence is assigned a score that reflects how important and contenful they are Data-driven approaches – Do not use any domain knowledge or external resources – Importance “immerges” for the data – Probabilistic models of word occurrence and sentence similarity
4
Sentence ranking options Based on word probability – S is sentence with length n – P i is the probability of the i-th word in the sentence – Based on word tf.idf
5
Centrality measures How representative is a sentence of the overall content of a document – The more similar are sentence is to the document, the more representative it is
6
Data-driven approach Unsupervised---no information about what constitutes a desirable choice How can be supervised approaches used? – For example the scientific article summarization paper from last week
8
Rhetorical status What is the purpose of the sentence? To communicate – Background – Aim – Basis (related work) How can we know which sentence serves each aim?
9
Rhetorical zones
11
Distribution of categories
12
Selecting important sentences (relevance) How well can it be performed by people? – Rather subjective; depends on prior knowledge and interests Even the same person would select 50% different sentences if she performs the task at different times Still, judgments can be solicited by several people to mitigate the problem For each sentence in at article---say if it is important and interesting enough to be included in a summary
13
Annotated data 80 computational linguistics articles Can be used to train classifiers – Given a sentence, which rhetorical class does it belong to? – Given a sentence, should it be included in the summary or not?
14
Features Location – Absolute location of the sentence – Section structure: first sentence, last sentence, other – Paragraph structure What section the sentence appeared in – Introduction, implementation, example, conclusion, result, evaluation, experiment etc
15
Sentence length – Very long and very short sentences are unusual Title word overlap Tf.idf word content – Binary feature – “yes” if the sentence contains one of the 18 most important words – “no” otherwise
16
Presence and type of citation Formulaic expressions – “in traditional approaches”, “a novel method for”
19
Important lessons for us Vector representation of sentences – Can be words – But can also be other features! The probability of a sentences belonging to a class can be computed Complex distinctions can be accurately predicted using simple features
20
Problems with ML for summarization Annotation is expensive – Here---relevance and rhetorical status judgments People don’t agree – So more annotators are necessary – And/or more training of the annotators
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.