E XTRACTIVE S UMMARIZATION John Cadigan, David Ellison, and Ethan Roday
Approach 1.Preprocessing and data cleanup 2.Vectorization 3.K-means 4.Information ordering with the experts system 5.CLASSY-style content realization Raw Input Docset (XML) /*.txt (plain text representation) Preprocessing: Sentence splitting Tokenization “Junk” removal Sentence Vectorization: Compute tf-idf weighted average GloVe vectors Compute LDA topic weights Content Selection: k-means clustering on sentence vectors Information Ordering: Four-expert panel Content Realization: CLASSY-style scrubbing Untokenization Summaries Term Weighting: Compute document- level tf-idf Term weighting: Compute idf scores for all terms in ACQUAINT ACQUAINT corpus GloVe vectors LDA: Compute LDA topic models over ACQUAINT
Simple regex substitutions to remove non-content Remove things like “ARVADA, Colo. (AP) –” at beginning of article With photo. By John T. McQuiston QUESTIONS OR RERUNS: … The late-night supervisor is…
Content Selection
Vectorization Changes GloVe vectors are now tf-idf weighted averages Previously: unweighted averages tf-idf is computed over the entire ACQUAINT corpus
K-means and information ordering K-means centroids were not ordered Tried: Most similar to other centroids Most contained sentences In this release, we used information ordering on top sentences with a cutoff totaling 100+ words
Content Realization
Implemented CLASSY-style cleanup heuristics (from 2006 paper): Remove bylines, etc. (this is always done in preprocessing) Remove adverbs, limited list of conjunctions at BOS Remove ages (“Bill, 50, ate.” “Bill ate.”) Remove relative clause attributions (“Bill, who already ate, ate again” “Bill ate again.”) Remove attributions, as long as it isn’t a direct quotation (“Bill said he already ate.” “He already ate.”) Untokenize sentences before presentation of summaries did n’t didn’t, … untokenize punctuation Content Realization
CLASSY 2006 Sentence Trimming configurations
Content Realization: some issues Possible quote mangling Oddly placed commas Too-aggressive adverb removal? “Physically, it’s the same town it was Monday.” “…the Guinean capital of Conakry was unexpectedly closed Monday…” District Attorney Robert Johnson plans to meet with the Diallos shortly before 2 p.m., when the grand jury indictments are scheduled to be unsealed in open court. Police officers have rarely been convicted for killings that occurred while they were on duty. How quickly did he fall?
Quantitative Results
D4: Devtest MetricPrecisionRecallF-Score ROUGE ROUGE D4: Evaltest MetricPrecisionRecallF-Score ROUGE ROUGE
Game of Qualitative Results Best, worst and mediocre
MEDIOCRE ROUGE 1: ROUGE 2: But for now, for the next several weeks, people seem able only to get through the worst of it, to handle the realization that some people are not coming back and that yes, things like this do happen here. Students returned to classes Thursday at Chatfield High School, but the bloodbath at rival Columbine High haunted the halls. Investigators, spending the day at the memorial service, were to resume their work this morning, conducting more interviews and eyeing the possibility of additional suspects in Tuesday's massacre. Team members decided they wanted to play out the rest of the season. Really long, non-specific first sentence Variation of themes
WORST (D1030 ): ROUGE-1: ROUGE-2: the current regulations have created a quagmire of consumer confusion and set up potential health crises that even industry officials say could hurt producers as well as users of herbal products. `The main thing you want is someone who knows enough to keep you out of trouble,'' said Dr. John B. Neeld Jr., president of the American Society of Anesthesiologists. While over-the-counter drugs are subject to Food and Drug Administration regulation, herbal supplements are assumed safe unless proved otherwise. If the products were safe, companies could say what they wished, so long as they did not claim their products could prevent, treat or cure disease. News-speak (not newspeak) “quagmire” “hurt producers” Not that bad It’s about FDA regulations
Best: ROUGE-1: ROUGE-2: An Indonesian minister, Aburizal Bakrie, claimed last month the flow was a ``natural disaster'' unrelated to the drilling activities of a company, Lapindo Brantas Inc, which belongs to a group controlled by his family. President Susilo Bambang Yudhoyono has ordered Lapindo to pay 3.8 trillion rupiah -LRB- 420 million dollars -RRB- in compensation and costs related to the mud flow. A gas well near Surabaya in East Java has spewed steaming mud since May last year, submerging villages, factories and fields and forcing more than 15,000 people to flee their homes. All the themes: Money Disaster Government Could improve ordering
The good: Content being selected is mostly relevant Topicality has improved over time The bad: Lack of thematic cohesion seems to predominate Possibly a drawback of k-means
Discussion Parameter tuning matters: tf scheme, idf scheme, GloVe weight, LDA weight, k Worst and best devtest: Devtest best MetricPrecisionRecallF-Score ROUGE ROUGE Devtest worst MetricPrecisionRecallF-Score ROUGE ROUGE