Download presentation
Presentation is loading. Please wait.
1
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the Semi-supervised NL Learning Reading Group
2
Presentation Outline Overview of Document Summarization Major contribution: Semi-Supervised Logistic Classification Maximum Likelihood summaries. Evaluation –Baseline Systems –Results
3
Document Summarization Motivation: [text volume] >> [user’s time] Single Document Summarization: –Used for display of search results, automatic ‘abstracting’, browsing, etc. Multi-Document Summarization: –Describe clusters & document collections, QA, etc. Problem: What is the summary used for? Does a generic summary exist?
4
Single Document Summarization example
5
Document Summarization Generative Summaries: –Synthetic text produced after analysis of high level linguistic features: discourse, semantics, etc. –Hard. Extract Summaries: –Text excerpts (usually sentences) composed together to create summary –Boils down to a passage classification/ranking problem
6
Major Contribution Semi-supervised Logistic Classifying Expectation Maximization (CEM) for passage classification Advantage over other methods: –Works on small set of labeled data + large set of unlabeled data –No modeling assumptions for density estimation Cons: –(probably) slow; no performance numbers given
7
Expectation Maximization (EM) Finds maximum likelihood estimates of parameters when underlying distribution depends on unobserved latent variables. Maximizes model fit to data distribution Criterion function:
8
Classifying EM (CEM) Like EM, with the addition of an indicator variable for component membership. Maximizes ‘quality’ of clustering Criterion function:
9
Semi-supervised generative-CEM Fix component membership for labeled data. Criterion function: Labeled DataUnlabeled Data
10
Semi-supervised logistic-CEM Use discriminative classifier (logistic) instead of generative. M-step, need to re-do gradient descent to estimate β’s Labeled DataUnlabeled Data
11
Evaluation Algorithm evaluated against 3 other single- document summarization algorithms –Non-trainable System: passage ranking –Trainable System: Naïve Bayes sentence classifier –Generative-CEM (using full Gaussians) Precision/Recall with regard to gold-standard extract summaries The fine print: –All systems used *similar* representation schemes, but not the same…
12
Baseline System: Sentence Ranking Rank sentences, using a TF-IDF similarity measure with query expansion (Sim 2 ) –Blind-relevance feedback from the top sentences –WordNet similarity thesaurus Generic query created with the most frequent words in the training set.
13
Naïve Bayes Model: Sentence Classification Simple Naïve Bayes classifier trained on 5 features: 1.Sentence length < t length {0,1} 2.Sentence contains ‘cue words’ {0,1} 3.Sentence query similarity (Sim 2 ) > t sim {0,1} 4.Upper-case/Acronym features (count?) 5.Sentence/paragraph position in text {1, 2, 3}
14
Logistic-CEM: Sentence Representation Features Features used to train Logistic-CEM: 1.Normalized sentence length [0, 1] 2.Normalized ‘cue word’ frequency [0, 1] 3.Sentence Query Similarity (Sim 2 ) [0, ∞) 4.Normalized acronym frequency [0, 1] 5.Sentence/paragraph position in text {1, 2, 3} (All of the binary features converted to continuous.)
15
Results on Reuters dataset
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.