Topic Orientation + Information Ordering Syed Sameer Arshad Tristan Chong.

Slides:



Advertisements
Similar presentations
Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Advertisements

Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Deviation = The sum of the variables on each side of the mean will add up to 0 X
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Chapter 6: Random Variables
Summarization using Event Extraction Base System 01/12 KwangHee Park.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 6: Random Variables Section 6.1 Discrete and Continuous Random Variables.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Chapter 6 Random Variables
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Mebi 591D – BHI Kaggle Class Baselines kaggleclass.weebly.com/
© 2006 Pearson Education Inc., publishing as Longman Publishers Chapter 3: Thesis, Main Ideas, Supporting Details, & Transitions Reading Across the Disciplines:
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Reporting Template LAPI Exploitation Event What do I do with this template? This template is to be used for reporting on the LAPI Exploitation Events.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Blog Summarization We have built a blog summarization system to assist people in getting opinions from the blogs. After identifying topic-relevant sentences,
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
4. An Alternative Method Priority Information Themes for ACP Agriculture and Rural Development This lesson will introduce you to an alternative method.
Lecture 1 Stat Applications, Types of Data And Statistical Inference.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Data Mining: Text Mining
Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.
1 Generating Comparative Summaries of Contradictory Opinions in Text (CIKM09’)Hyun Duk Kim, ChengXiang Zhai 2010/05/24 Yu-wen,Hsu.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
TAKS Reading Process Analyze the Task Activate Prior Knowledge Plan and Predict Read Use Information.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Natural Language Processing Topics in Information Retrieval August, 2002.
Customer Satisfaction Index July 2008 RESULTS. Introduction This report presents the results for the Customer Satisfaction Index survey undertaken in.
Information Technology IMS5024 Information Systems Modelling Review of topics.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Chapter 6: Random Variables
Final Grade Averages Weighted Averages.
Multimedia Information Retrieval
Chapter 6: Random Variables
Chapter 6: Random Variables
Warmup Consider tossing a fair coin 3 times.
Chapter 6: Random Variables
Chapter 3: Thesis, Main Ideas, Supporting Details, & Transitions
Chapter 6: Random Variables
The ultimate in data organization
Chapter 6: Random Variables
Chapter 6: Random Variables
Chapter 6: Random Variables
Chapter 6: Random Variables
Chapter 6: Random Variables
Chapter 6: Random Variables
READING FOR MAIN IDEAS.
Chapter 6: Random Variables
Presentation transcript:

Topic Orientation + Information Ordering Syed Sameer Arshad Tristan Chong

What we did for Topic Orientation We wanted to acquire the concept of a “document theme”. We decided to exploit the fact that all the documents were news articles. In news articles, what we care most about is: What was the event that happened? Where did the event happen? Who was involved with this event? In first-order-logic formal semantics, an event can be defined by the existence of a “verb”. The place where an event happened would be a Named Entity. The people involved with an event would also be Named Entities.

Themes, Verbs and Named Entities Therefore, to spot the themes in the document, we could possibly do the following: Find the most commonly used verb stems Find the most commonly mentioned Named Entities These pieces of information would be the main themes of the document set we are summarizing.

Theme Example For example, a document set containing information on Ian Thorpe competing at the Beijing Olympics and winning gold would have the following themes: Verb Stems: compete, win Named Entities: Ian, Thorpe, Beijing, Olympics, gold Any sentences that have these Named Entities or have verbs that would stem to “compete” or “win” should be given priority over others for being chosen for summarization content, because they are more in-line with the theme of the document-set

Our Approach For every document set Use NLTK’s POS Tagger to tag all sentences Extract all verb-related POS tags. Stem them with the Porter Stemmer. Store the set of stems that were found along with the counts of those stems. Normalize these counts with min-max normalization between 0 and 1. Call this the “verb_counts” dictionary. Also store the average word index of each of these stems. Where word index is defined as the percentage of a document that has been traversed on average whenever this stem shows up in a document. Call this the “verb_average_indices” dictionary. Normalize these indices with min-max normalization between 0 and 1. This is used for information ordering later. Use NLTK’s Named Entity Recognizer to tag all named-entities. Store the set of Named Entities that wer found along with their counts. Call this the “named_entitiy_counts” dictionary. Normalize these counts with min-max normalization between 0 and 1.

Changes to MEAD relevance score for sentences Normally, the mead score is a weighted sum of the centroid-score, first-sentence-relevance score and position-score of a sentence. We will now add a fourth number to this weighted sum, which is the topic-orientation score. For each sentence: The topic orientation score is the sum of the count values of all verb stems that have a match in the verb_counts dictionary plus the sum of all the count values of all named-entities that have a match in the named_entities_counts dictionary. The final score of a sentence involves using a weight of 1.8 to be multiplied with the topic-orientation score of a sentence.

Why 1.8? We ran an optimization study on it. The best combination of the ROUGE-1 recall, precision and f- score was when the weight of the topic- orientation score was 1.8

Information Ordering Approach Our approach is to find factors that could be useful in information ordering. And then make a numerical representation of those factors. And find an information-ordering score that is a weighted sum of these numbers. We found two factors that can be calculated for each sentence: MEAD Position Score. We calculated this as a main-quest for the content-selection work. Average of Verb-Stem Average Indices We calculated this as a side-quest for the topic orientation work.

Position Score and the Verb Index Score We already have the position-Score from the MEAD calculations for each sentence. These are normalized percentage-style numerical representations of how much of a sentence’s parent document has been traversed in order to get to that sentence. We already have the average verb-stem index from the topic-orientation calculations for each sentence. The Verb-Stem Index is a normalized percentage-style numerical representation of how much of a sentence’s parent document has been traversed when a particular verb-stem shows up. We average this across all instances of the verb-stem. And then average that across all documents. And that’s our Verb Index Score. The Information Ordering score just becomes a weighted sum of these two scores. The weights that worked best were 0.8 for the position score and 0.2 for the verb index score. The weight assignments should add up to 1 because we assigned Information Ordering Score to be between 0 and 1. There was a 3% precision swing between the worst weight combination and the best one. Information Ordering scores near 0 imply that a sentence should be near the beginning of a summary, 0.5 implies the middle of the summary and scores near 1 implies the end of the summary.

Results ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-4 Recall Precision F-Score