The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the Semi-supervised NL Learning Reading Group

Presentation Outline Overview of Document Summarization Major contribution: Semi-Supervised Logistic Classification Maximum Likelihood summaries. Evaluation –Baseline Systems –Results

Document Summarization Motivation: [text volume] >> [user’s time] Single Document Summarization: –Used for display of search results, automatic ‘abstracting’, browsing, etc. Multi-Document Summarization: –Describe clusters & document collections, QA, etc. Problem: What is the summary used for? Does a generic summary exist?

Single Document Summarization example

Document Summarization Generative Summaries: –Synthetic text produced after analysis of high level linguistic features: discourse, semantics, etc. –Hard. Extract Summaries: –Text excerpts (usually sentences) composed together to create summary –Boils down to a passage classification/ranking problem

Major Contribution Semi-supervised Logistic Classifying Expectation Maximization (CEM) for passage classification Advantage over other methods: –Works on small set of labeled data + large set of unlabeled data –No modeling assumptions for density estimation Cons: –(probably) slow; no performance numbers given

Expectation Maximization (EM) Finds maximum likelihood estimates of parameters when underlying distribution depends on unobserved latent variables. Maximizes model fit to data distribution Criterion function:

Classifying EM (CEM) Like EM, with the addition of an indicator variable for component membership. Maximizes ‘quality’ of clustering Criterion function:

Semi-supervised generative-CEM Fix component membership for labeled data. Criterion function: Labeled DataUnlabeled Data

Semi-supervised logistic-CEM Use discriminative classifier (logistic) instead of generative. M-step, need to re-do gradient descent to estimate β’s Labeled DataUnlabeled Data

Evaluation Algorithm evaluated against 3 other single- document summarization algorithms –Non-trainable System: passage ranking –Trainable System: Naïve Bayes sentence classifier –Generative-CEM (using full Gaussians) Precision/Recall with regard to gold-standard extract summaries The fine print: –All systems used *similar* representation schemes, but not the same…

Baseline System: Sentence Ranking Rank sentences, using a TF-IDF similarity measure with query expansion (Sim 2 ) –Blind-relevance feedback from the top sentences –WordNet similarity thesaurus Generic query created with the most frequent words in the training set.

Naïve Bayes Model: Sentence Classification Simple Naïve Bayes classifier trained on 5 features: 1.Sentence length < t length {0,1} 2.Sentence contains ‘cue words’ {0,1} 3.Sentence query similarity (Sim 2 ) > t sim {0,1} 4.Upper-case/Acronym features (count?) 5.Sentence/paragraph position in text {1, 2, 3}

Logistic-CEM: Sentence Representation Features Features used to train Logistic-CEM: 1.Normalized sentence length [0, 1] 2.Normalized ‘cue word’ frequency [0, 1] 3.Sentence Query Similarity (Sim 2 ) [0, ∞) 4.Normalized acronym frequency [0, 1] 5.Sentence/paragraph position in text {1, 2, 3} (All of the binary features converted to continuous.)

Results on Reuters dataset

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

Similar presentations

Presentation on theme: "The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

Similar presentations

Presentation on theme: "The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the."— Presentation transcript:

Similar presentations

About project

Feedback