SemQuest: University of Houston’s Semantics-based Question Answering System Rakesh Verma University of Houston Team: Txsumm Joint work with Araly Barrera.

Slides:



Advertisements
Similar presentations
Title: The Author-Topic Model for Authors and Documents
Advertisements

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
UNC-CH at DUC2007: Query Expansion, Lexical Simplification, and Sentence Selection Strategies for Multi-Document Summarization Catherine Blake Julia Kampov.
TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
Using Social Networking Techniques in Text Mining Document Summarization.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
Rui Yan, Yan Zhang Peking University
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Example 16,000 documents 100 topic Picked those with large p(w|z)
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
KDD 2012, Beijing, China Community Discovery and Profiling with Social Messages Wenjun Zhou Hongxia Jin Yan Liu.
Mrs. Maninder Kaur 1 Mrs. Maninder Kaur.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
Text summarization MEAD NewsInEssence Cross-document structure Sentence compression Lexrank Political science Discourse dynamics Centrality identification.
The Computational Linguistics Summarization Pilot TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang.
CHATS IN THE CLASSROOM: EVALUATIONS FROM THE PERSPECTIVES OF STUDENTS AND TUTORS AT CHEMNITZ UNIVERSITY OF TECHNOLOGY, COMMUNICATION ON TECHNOLOGY AND.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Search and Information Extraction Lab IIIT Hyderabad.
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
LexRank: Graph-based Centrality as Salience in Text Summarization
1 Towards Automated Related Work Summarization (ReWoS) HOANG Cong Duy Vu 03/12/2010.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.
Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
How to write a scientific article Nikolaos P. Polyzos M.D. PhD.
Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Endangered Species A Collaborative Teaching Unit.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Automatic Evaluation of Linguistic Quality in Multi- Document Summarization Pitler, Louis, Nenkova 2010 Presented by Dan Feblowitz and Jeremy B. Merrill.
Topic Modeling using Latent Dirichlet Allocation
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
Project 2 Latent Dirichlet Allocation 2014/4/29 Beom-Jin Lee.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University.
Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Link Distribution on Wikipedia [0422]KwangHee Park.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
NUS at DUC 2007: Using Evolutionary Models of Text Ziheng Lin, Tat-Seng Chua, Min-Yen Kan, Wee Sun Lee, Long Qiu and Shiren Ye Department of Computer Science.
Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.
SENSEVAL: Evaluating WSD Systems
A Brief Introduction to Distant Supervision
Topic Modeling Nick Jordan.
CSE 635 Multimedia Information Retrieval
Text-to-Text Generation
Big Data Text Summarization Westminster Attack
Summarization for entity annotation Contextual summary
Hierarchical Relational Models for Document Networks
Team 7 → Final Presentation
Text Analytics Solutions with Azure Machine Learning
Information Retrieval
Machine Reading.
Presentation transcript:

SemQuest: University of Houston’s Semantics-based Question Answering System Rakesh Verma University of Houston Team: Txsumm Joint work with Araly Barrera and Ryan Vincent SemQuest: University of Houston’s Semantics-based Question Answering System Rakesh Verma University of Houston Team: Txsumm Joint work with Araly Barrera and Ryan Vincent

Guided Summarization Task Given: Newswire sets of 20 articles, each set belongs to 1 category out of 5 categories Produce: 100-word summaries that answer specific aspects for each category. Part A - A summary of 10 documents Topic* Part B - A summary of 10 documents with knowledge of Part A. * Total of 44 topics in TAC 2011

Aspects Topic CategoryAspects 1) Accidents and Natural Disasters what when where why who affected damages countermeasures 2) Attacks what when where perpetrators who affected damages countermeasures 3) Health and Safety what who affected how why countermeasures 4) Endangered Resources what importance threats countermeasures 5) Investigations and Trials who/who involved what importance threats countermeasures Table 1. Topic categories and required aspects to answer in a summary

SemQuest 2 Major Steps Data Cleaning Sentence Processing Sentence Preprocessing Information Extraction

SemQuest: Data Cleaning Noise Removal – removal of tags, quotes and some fragments. Redundancy Removal – removal of sentence overlap for Update Task (part B articles). Linguistic Preprocessing – named entity, part-of- speech and word sense tagging.

SemQuest: Sentence Processing Figure 1. SemQuest Diagram

SemQuest: Sentence Preprocessing SemQuest

SemQuest: Sentence Preprocessing 1) Problem: “They should be held accountable for that” Our Solution: Pronoun Penalty Score 2) Observation: “Prosecutors alleged Irkus Badillo and Gorka Vidal wanted to “sow panic” in Madrid after being caught in possession of 500 kilograms 1,100 pounds of explosives, and had called on the high court to hand down 29-year sentences.” Our method: Named Entity Score

SemQuest: Sentence Preprocessing 3) Problem: Semantic relationships need to beestablished between sentences and the aspects! Our method: WordNet Score affect, prevention, vaccination, illness, disease, virus, demographic Figure 2. Sample Level 0 words considered to answer aspects from ‘’Health and Safety’’ topics. Five of synonym-of-hyponym levels for each topic were produced using WordNet [4].

SemQuest: Sentence Preprocessing 4) Background: Previous work on single document summarization (SynSem) has demonstrated successful results on past DUC02 and magazine-type scientific articles. Our Method: Convert SynSem into a multi-document acceptor, naming it M- SynSem,and reward sentences with best M-SynSem scores

SynSem – Single Document Extractor Figure 3. SynSem diagram for single document extraction

SynSem Datasets tested: DUC 2002 and non-DUC scientific articles Sample Scientific Article ROUGE 1-gram scores SystemRecallPrecisionF-mea. SynSem Baseline MEAD TextRank DUC02 ROUGE 1-gram scores SystemRecallPrecisionF-mea. S SynSem S Baseline S TextRank Table 2. ROUGE evaluations for SynSem on DUC and nonDUC data (a) (b)

M-SynSem

Two M-SynSem Keyword Score approaches: 1)TextRank [2] 2)LDA [3] M-SynSem version (weights)ROUGE-1ROUGE-2ROUGE-SU4 TextRank (.3) TextRank (.3) LDA (0) LDA (.3) M-SynSem version (weights)ROUGE-1ROUGE-2ROUGE-SU4 TextRank (.3) TextRank (.3) LDA (0) LDA (.3) Table 3. SemQuest evaluations on TAC 2011 using various M-SynSem keyword versions and weights. (a) Part A evaluation results (b) Part B evaluation results

SemQuest: Information Extraction SemQuest

SemQuest: Information Extraction 1.) Named Entity Box Summary: Named Entity Box Figure 4. Sample summary and Named Entity Box

SemQuest: Information Extraction 1.) Named Entity Box Topic CategoryAspects Named Entity Possibilities Named Entity Box 1) Accidents and Natural Disasters what when where why who affected damages countermeasures -- date location -- person/organization -- money 5/7 2) Attacks what when where perpetrators who affected damages countermeasures -- date location person person/organization -- money 5/8 3) Health and Safety what who affected how why countermeasures -- person/organization -- money 3/5 4) Endangered Resources what importance threats countermeasures -- money 1/4 5) Investigations and Trials who/who involved what importance threats countermeasures person/organization -- 2/6 Table 4. TAC 2011 Topics, aspects to answer, and named entity associations

SemQuest: Information Extraction 2) We utilize all linguistic scores and Named Entity Box requirements for the computation of a final sentence score, FinalS for an extract, E:

SemQuest: Information Extraction

Our Results Submission YearROUGE-2ROUGE-1ROUGE-SU4BELinguistic Quality Submission YearROUGE-2ROUGE-1ROUGE-SU4BELinguistic Quality Table 5. Evaluations scores for SemQuest submissions for Average ROUGE-1, ROUGE-2, ROUGE-SU4, BE, and Linguistic Quality for Parts A & B (a) Part A Evaluation results for Submissions 1 and 2 of 2011 and 2010 (b) Part B Evaluation results for Submissions 1 and 2 of 2011 and 2010

Our Results Performance: Higher overall scores for both submissions from participation in TAC 2010 Improved rankings by 17% in Part A and by 7% in Part B. We beat both baselines for the B category in overall responsiveness score and one baseline for the A category. Our best run is better than 70% of participating systems for the linguistic score.

Analysis of NIST Scoring Schemes Evaluation correlations between ROUGE/BE scores to average manual scores for all participating systems of TAC 2011: Evaluation method Average Manual Scores for Part A Modified pyramid Num SCU’s Num Repetitions Modified with 3 models Linguistic Quality Overall responsiveness ROUGE ROUGE ROUGE-SU BE Evaluation method Average Manual Scores for Part B Modified pyramid Num SCU’s Num Repetitions Modified with 3 models Linguistic Quality Overall responsiveness ROUGE ROUGE ROUGE-SU BE Table 6. Evaluation correlations between ROUGE/BE and manual scores.

Future Work Improvements to M-SynSem Sentence compression

Acknowledgments Thanks to all the students: Felix Filozov David Kent Araly Barrera Ryan Vincent Thanks to NIST!

References [1] J.G. Carbonell, Y. Geng, and J. Goldstein. Automated Query-relevant Summarization and Diversity-based Reranking. In 15 th International Joint Conference on Artificial Intelligence, Workshop: AI in Digital Libraries, [2] R. Mihalcea and P. Tarau. TextRank: Bringing Order into Texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). March [3] David M. Blei, Andrew Y. Ng,. And Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2: , [4] WordNet: An Electronic Lexical Database, Edited by Christiane Fellbaum, MIT Press, 1998.

Questions?