Towards an Extractive Summarization System Using Sentence Vectors and Clustering John Cadigan, David Ellison, Ethan Roday.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improved TF-IDF Ranker
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng.
SEMANTIC FEATURE ANALYSIS IN RASTER MAPS Trevor Linton, University of Utah.
Normalized alignment of dependency trees for detecting textual entailment Erwin Marsi & Emiel Krahmer Tilburg University Wauter Bosma & Mariët Theune University.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Latent Dirichlet Allocation a generative model for text
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
LexRank: Graph-based Centrality as Salience in Text Summarization
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Recommending Twitter Users to Follow Using Content and Collaborative Filtering Approaches John HannonJohn Hannon, Mike Bennett, Barry SmythBarry Smyth.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Chapter 23: Probabilistic Language Models April 13, 2004.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
TRANS: T ransportation R esearch A nalysis using N LP Technique S Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Collaborative Deep Learning for Recommender Systems
E XTRACTIVE S UMMARIZATION John Cadigan, David Ellison, and Ethan Roday.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Semantic Processing with Context Analysis
Video Google: Text Retrieval Approach to Object Matching in Videos
John Frazier and Jonathan perrier
Towards a Personal Briefing Assistant
CS224N: Query Focused Multi-Document Summarization
Video Google: Text Retrieval Approach to Object Matching in Videos
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Building Dynamic Knowledge Graphs From Text Using Machine Reading Comprehension Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, Andrew.
Presented by Nick Janus
Presentation transcript:

Towards an Extractive Summarization System Using Sentence Vectors and Clustering John Cadigan, David Ellison, Ethan Roday

System Overview Document summarization system is organized as a multi- step pipeline.

System Overview Two major components for content selection: Feature selection step to generate sentence vectors Clustering or classification for sentence selection

Overview: Preprocessing Extract raw text Split into sentences Tokenize each sentence

Overview: Feature Selection GloVe vector lookup Dense word vector Average all word vectors to compute sentence vector

Overview: Intermediate Representation Unique ID for each sentence Sentence ID Encodes: Which document set (e.g. D0901A) Which document Which sentence Example: D0901A.1.2 Also keep track of: Sentence embedding (GloVe average) Original text as appears in document

Overview: Content Selection Cluster sentences (vectors) with k-means Idea: most “representative” sentence is closest to center of cluster

Information Ordering For each sentence, retrieve and order by original document ordering This information is maintained in each sentence ID

Content Realization Present each centroid sentence “as-is” We’ve kept the original representation from input documents

Quantitative results ROUGE ScoresAnalysis Precision consistently higher than recall On par with the k-means implementation of Li et al (2011). PRFLi et al (k- means R) ROUGE ROUGE ROUGE ROUGE

Qualitative results Decent example summary Flowering arrow bamboo is threatening a colony of endangered giant pandas in China but experts have come up with a plan to save them, state media said Monday. Giant Panda By the end of 2004, arrow bamboo, the favorite food of giants, had blossomed on 7,420 hectares at the Baishuijiang … Bad example summary “At Littleton, America got a glimpse of the last stop on that train to hell America boarded decades ago when we declared that God is dead and that each of us is his or her own god who can make up the rules as we go along.” As the day progressed, the snow became thick mud and as many as 100 visitors at a time came to the top of the hill to look at the rough wood cross that had been staked into the ground there. I think we have to learn from this and not let those kids die in vain.” Analysis Bias towards long sentences Poor prioritization of sentences Low variation

Discussion: Preprocessing Lots of bad sentences in current summaries: Quotes Non-sentences Article metadata: New York --LRB— Stop words contribute significant noise Lack of term weighting also results in noisy vectors tf-idf and LLR are low-effort and high-return

Discussion: Vector Representation Vectors are an effective way to represent similarity: Compare GLOVE vectors of verbs, subjects and objects Integrate a LDA topic model NER and co-reference Other representations may be very effective components: Probabilistic: KL divergence Graph-based: LexRank

Discussion: Selection Methods Several drawbacks of k-means as selection algorithm: What is k? How to prioritize clusters? How to select from clusters? Could other representations overcome these flaws?

Conclusion Three major directions for next steps: 1.Preprocessing and noise removal 2.Refining and augmenting vector representations 3.Experimenting with new selection algorithms