1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

Slides:



Advertisements
Similar presentations
Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.
Advertisements

Question Answering Based on Semantic Graphs Lorand Dali – Delia Rusu – Blaž Fortuna – Dunja Mladenić.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Polarity Dictionary: Two kinds of words, which are polarity words and modifier words, are involved in the polarity dictionary. The polarity words have.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Using IR techniques to improve Automated Text Classification
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Mining and Summarizing Customer Reviews
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Open Information Extraction using Wikipedia
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Triplet Extraction from Sentences Lorand Dali Blaž “Jožef Stefan” Institute, Ljubljana 17 th of October 2008.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.
Triplet Extraction from Sentences Technical University of Cluj-Napoca Conf. Dr. Ing. Tudor Mureşan “Jožef Stefan” Institute, Ljubljana, Slovenia Assist.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
Algorithmic Detection of Semantic Similarity WWW 2005.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
ANAPHORA RESOLUTION SYSTEM FOR NATURAL LANGUAGE REQUIREMENTS DOCUMENT IN KOREAN 課程 : 自然語言與應用 課程老師 : 顏國郎 報告者 : 鄭冠瑀.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
PRESENTED BY: PEAR A BHUIYAN
System for Semi-automatic ontology construction
Web News Sentence Searching Using Linguistic Graph Similarity
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Background & Overview Proposed Model Experimental Results Future Work
Clustering Algorithms for Noun Phrase Coreference Resolution
Presentation transcript:

1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan Institute Ljubljana, Slovenia 2 Microsoft Research Cambridge, UK Presenter: Yi-Ting Chen

2 Reference Jure Leskovec, Marko Grobelnik, Natasa Milic-Frayling, “ Learning Sub-structures of Document Semantic Graphs for Document Summarization”, LinkKDD 2004, August 2004, Seattle, WA, USA.

3 Outline Introduction Semantic Graph Generation Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental Conclusions

4 Introduction (1/2) In this paper, identification of textual elements for using in extracting summaries are primarily concerned –The assumption that capturing semantic structure of a document is essential for summarization –Then, to create semantic representations of the document, visualized as semantic graphs, and learn the model to extract sub-structures that could be used in document extracts The basic elements of our semantic graphs are logical form triples, subject-predicate-object Experiments show that characteristics of the semantic graphs are most prominent attributes in the learnt SVM model

5 Introduction (2/2) Machine learning algorithms are applied to capture characteristics of human extracted summary sentences Elementary syntactic structures from individual sentences in the form of logical form triples are extracted Then using semantic properties of nodes in the triples to build semantic graphs for both documents and corresponding summaries This paper use logical form triples as basic features and apply Support Vector Machines to learn the summarization model

6 Semantic Graph Generation (1/7) To generating semantic representations of documents and summaries involves five phases: –Deep syntactic analysis –Co-reference resolution –Pronominal reference resolution –Semantic normalization –Semantic graph analysis

7 Semantic Graph Generation (2/7) An example that outlines the main process and output of different stages involved in generating the semantic graph To perform deep syntactic analysis to the raw text The analysis applying several methods: co-reference resolution, pronominal anaphora resolution, and semantic normalization Resulting triples are then linked based on common concepts into a semantic graph

8 Semantic Graph Generation (3/7) Linguistic Analysis –For linguistic analysis of text we use Microsoft’s NLPWin natural language processing tool –NLPWin first segments the text into individual sentences, converts sentence text into a parse tree (Figure2) and then produces a sentence logical form that reflects the meaning (Figure3) The logical form (triples): “ Jure ” ←”send”→”Marko” For each node, semantic tags that are assigned by the NLPWin software

9 Semantic Graph Generation (4/7) Co-reference Resolution For Named Entities –In document it is common that terms with different surface forms refer to the same entity –The algorithm, for example, correctly finds that “Hillary Rodham Clinton”, “Hillary Clinton”, “Hillary Rodham”, and “Mrs. Clinton” all refer to the same entity Pronomial Anaphora Resolution –To focus on resolving only five basic types of pronouns including all different forms: “he” (his, him, himself), “she”, “I”, “they”, and “who” –For a given pronoun, to search sequentially through sentences, backward and forward and check the type of named entity in order to find suitable candidates –Each candidate is scored and among the scored candidate entities to chose the one with the best score

10 Semantic Graph Generation (5/7) Pronomial Anaphora Resolution –The set contained 1506 pronouns of which 1024 (68%) were from the above selected pronoun types –Average accuracy over the 5 selected pronoun types is 82.1%

11 Semantic Graph Generation (6/7) Semantic Normalization –In order to increase the coherence and compactness of the graphs, WordNet can be used to establish synonymous relationships among node –Synonymy information of concepts enables us to find equivalence relations between distinct triples –For example, Watcher ← follow →moon spectator← watch →moon Construction of a Semantic Graph –To merge the logical form triples on subject and object nodes which belong to the same normalized semantic class and thus produce semantic graph –For each node, to calculate in/out degree of a node, PageRank weight, Hubs and Authorities weights and many more

12 Semantic Graph Generation

13 Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental DUC Dataset –Document Understanding Conference (DUC) 2002 (300 newspaper on 30 different topics) –For each article have a 100 word human written abstract and for half of them also human extracted sentences –An average article in the data set contains about 1000 words or 50 sentences –About 7.5 sentences are selected into the summary –After applying linguistic processing, on average 82 logical triples per document with 15 of them contained in extracted summary sentence Feature Set –The resulting vector for logical form triples contain about 466 binary and real-valued attributes –72 of these components have non-zero values, on average

14 Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental Experiment results –Impact of the Training Data Size –Sentence classifiers are trained and tested through 10-fold cross- validation experiments –Sizes of training data: 10, 20, 50, and 100 documents –A negative impact of small training sets on the generalization of the model

15 Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental Experiment results –Impact of Different Feature Attributes

16 Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental Experiment results –Impact of Different Feature Attributes –Semantic graph attributes are consistently ranked high among all the attributes used in the model –Syntactic and semantic properties of the text are important for summarization

17 Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental Experiment results –Topic Specific Summarization –While the data size for topic specific summarization is small and thus does not allow generalization

18 Learning Semantic Sub-Graphs Using Support Vector Machines & Experimental Sentence Extraction

19 Conclusions This paper presented a novel approach of document summarization by generating semantic representation of the document and applying machine learning to extract a sub-graph that correspond to the semantic structure of a human extracted document summary Compared to human extracted summaries they achieve on average recall of 75% and precision of 30%