Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey.

Slides:



Advertisements
Similar presentations
Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
Advertisements

Improved TF-IDF Ranker
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Graph-based Text Summarization
 Andisheh Keikha Ryerson University Ebrahim Bagheri Ryerson University May 7 th
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Swoogle Swoogle Semantic Search Engine Web-enhanced Information Management Bin Wang.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Using Social Networking Techniques in Text Mining Document Summarization.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Weighted Semantic PageRank Using RDF Metadata on Hadoop ICOMP 2014 Jun 20, 2014 Hee-gook Jun.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Querying Structured Text in an XML Database By Xuemei Luo.
Diversity in Ranking via Resistive Graph Centers Avinava Dubey IBM Research India Soumen Chakrabarti IIT Bombay Chiranjib Bhattacharyya IISc Bangalore.
LexRank: Graph-based Centrality as Salience in Text Summarization
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
Applying Genetic Algorithm to the Knapsack Problem Qi Su ECE 539 Spring 2001 Course Project.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Automatic Labeling of Multinomial Topic Models
A Novel Relational Learning-to- Rank Approach for Topic-focused Multi-Document Summarization Yadong Zhu, Yanyan Lan, Jiafeng Guo, Pan Du, Xueqi Cheng Institute.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
Genetic Programming Using Simulated Natural Selection to Automatically Write Programs.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
March 1, 2016Introduction to Artificial Intelligence Lecture 11: Machine Evolution 1 Let’s look at… Machine Evolution.
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
NUS at DUC 2007: Using Evolutionary Models of Text Ziheng Lin, Tat-Seng Chua, Min-Yen Kan, Wee Sun Lee, Long Qiu and Shiren Ye Department of Computer Science.
Neighborhood - based Tag Prediction
Web News Sentence Searching Using Linguistic Graph Similarity
Finding Story Chains in Newswire Articles
Applying Key Phrase Extraction to aid Invalidity Search
Presented by Nick Janus
Presentation transcript:

Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey

Outline Introduction & Background Introduction & Background Ontology based Summarization Ontology based Summarization Evaluation Evaluation Discussion Discussion Future Work Future Work Conclusion Conclusion

Motivation An exponentially increasing volume of information requires summarization An exponentially increasing volume of information requires summarization –Humans are finite –Text is being generated faster than a reader can read –Need to quickly identify the relevance of documents

Central Question: Does knowing more really help? TextRank and a number of other random walk NLP algorithms have been applied to different areas like text summarization and keyword extraction. TextRank and a number of other random walk NLP algorithms have been applied to different areas like text summarization and keyword extraction. How would additional information from an ontology like WordNet or Cyc would affect such algorithms. Would it be better or worse? How would additional information from an ontology like WordNet or Cyc would affect such algorithms. Would it be better or worse?

Evaluation Criteria The evaluation criteria would be the change in performance of TextRank when given the extra information. The evaluation criteria would be the change in performance of TextRank when given the extra information. The evaluation dataset will be the Document Understanding Conference 2002 (DUC-2002) summarization test set The evaluation dataset will be the Document Understanding Conference 2002 (DUC-2002) summarization test set The ROUGE summarization evaluation tool will be used to measure performance change The ROUGE summarization evaluation tool will be used to measure performance change

Project Plan Implement TextRank Implement TextRank Construct a algorithm to import data from Cyc into TextRank Construct a algorithm to import data from Cyc into TextRank Construct evaluation dataset preprocessor Construct evaluation dataset preprocessor Develop a parameter tuning process Develop a parameter tuning process Measure performance with optimal parameters Measure performance with optimal parameters Analyze and report results Analyze and report results

Implementation Implemented Intelligent surfer model in Perl Implemented Intelligent surfer model in Perl Implemented text-to-Cyc graph extraction Implemented text-to-Cyc graph extraction –Denotation map –Using: isa, genls, conceptuallyRelated, mainDomain, definingMt Explored graph visualization technology (easier to debug what you can see) Explored graph visualization technology (easier to debug what you can see) –Nodes3d from BrainMaps.org

Ontology Based Summarization Augment TextRank with Cyc relationships Augment TextRank with Cyc relationships –Perform initial context free mapping into Cyc Terms –Perform Ranking process –Select the highest ranked sentences as extractive summary

Intelligent Surfer Model The Standard Model Intelligent Surfer Model For all nodes use  For all nodes use --> Constraint on S i  S i apportioned as a function of query relevancy. Here words in the input text have S i = 1/N while all other nodes have S i =0. When you get tired you jump back to the “problem statememt”, the input.

Weighted Version Sum of the outputs  Weighted updates  Summation of the weighted outputs of the currently ranked nodes

From text to Cyc graph Text-to-Cyc graph extraction Text-to-Cyc graph extraction –Denotation map –Using: isa, genls, conceptuallyRelated, mainDomain, definingMt –Each edge has its own weight associated with it –Finding the right weight is its own process

Finding the right terms (denotation-mapper "Hurricane Gilbert swept toward the Dominican Republic Sunday") Results : (("Hurricane". HurricaneAsObject)HurricaneAsObject ("Hurricane". HurricaneAsEvent) ("Gilbert". JohnGilbert) ("Gilbert". JodyGilbert) ("Gilbert". MelissaGilbert) ("Gilbert". GilbertStuart-TheArtist) ("Gilbert". GilbertGottfried) ("swept". SweepingAnArea) ("swept". (ThingDescribableAsFn Sweep-TheWord Adjective)) ("toward". (HypothesizedPrepositionSenseFn Toward-TheWord Preposition)) ("the Dominican Republic". DominicanRepublic) ("Sunday". wikip-Sunday) ("Sunday". (ThingDescribableAsFn Sunday-TheWord Adjective)))HurricaneAsEventJohnGilbertJodyGilbertMelissaGilbertGilbertStuart-TheArtistGilbertGottfriedSweepingAnArea(ThingDescribableAsFnSweep-TheWordAdjective(HypothesizedPrepositionSenseFnToward-TheWordPrepositionDominicanRepublicwikip-Sunday(ThingDescribableAsFnSunday-TheWordAdjective

The Big View

Tuning the system with Genetic Algorithms A Steady State Genetic Algorithm was used to find an optimal weighting compared against ROUGE-S on a subset of documents.

Genetic Algorithm & Evaluation Function 1.Select k members for tournament (here k=4). 2.For all members in tournament evaluate performance on the task and compute fitness. 3.Perform tournament selection by sorting based on fitness and creating a parent set and a replacement set. 4.Copy parents over replacement set to make children. 5.Do mutation and crossover operations on children. 6.Go to step 1.

Initial GA Evaluation DocumentTextRank OntoRank Ratio AVG GA was run on a random subset of documents that scored below average with default settings, and was run until it provided a +5.75% gain over TextRank on the ROUGE-S scores.

Combined Ranking: HurricanAsObject vs. Hurricane as Event Commonsense distinctions that vary from an ontology like WordNet. HurricaneAsObject: “Hurricane Gilbert moved to the north …” HurricaneAsEvent: “During Hurricane Gilbert many trees were …

Combined Ranking: Many Gilberts but one hurricane topic …. Gilbert is an ambiguous word for Cyc Gilbert is an ambiguous word for Cyc Yet the words primary connections are topic related Yet the words primary connections are topic related Similar to human name association in context Similar to human name association in context

EVALUATIONS Initial GA scores showed a +5% improvement Initial GA scores showed a +5% improvement Evaluation on the whole dataset Evaluation on the whole dataset Shocking Revelation Shocking Revelation Re-Evaluation Re-Evaluation

First Full evaluation Performed full per-document evaluation on DUC-2002 Performed full per-document evaluation on DUC-2002 Carried out detailed per-document review of relative performance using ROUGE-S Carried out detailed per-document review of relative performance using ROUGE-S

Disappointing full dataset performance

Debugging via Histogramming Sorted the relative performance on a per- document basis High variance, with average positive effect +15% and average negative effect -14% Unfortunately more often negative than positive, so a net negative skew

Revelation While working on a distributed version of TextRank discovered the two datasets in DUC-2002 While working on a distributed version of TextRank discovered the two datasets in DUC-2002 –The per-document generative summary –The multi-document extractive summary Of course the system was using the generative summary to evaluate an extractive system ! Of course the system was using the generative summary to evaluate an extractive system ! Convert and Re-Test on the multi-document dataset Convert and Re-Test on the multi-document dataset No time to re-evolve using the GA for the multi- document data No time to re-evolve using the GA for the multi- document data

Multi-document Re-Evaluation

Evaluation Conclusions Much more encouraging when comparing same data types Much more encouraging when comparing same data types Initial weakness prompted analysis of negative result leading to theory covered in discussion Initial weakness prompted analysis of negative result leading to theory covered in discussion No breakthrough No breakthrough

Discussion Adding the commonsense graph produces wide variation in TextRank performance both positive and negative. Adding the commonsense graph produces wide variation in TextRank performance both positive and negative. –TextRank tries to preserve the total information present in a graph –Adding commonsense to the graph can identify what a reader should be interested in as well as what they probably already know –In the first case there is an improvement : disambiguation and context are selected –In the second you transmit redundant information … common sense, and reduce the effective bandwidth of the summary

Discussion Identification of stopconcepts Identification of stopconcepts –The ontology version of stopwords –Nodes that have so much connectivity that they contain little information –Created a stopconcepts list

Future Work Run the GA on the multi-document data set Run the GA on the multi-document data set Develop the ability to detect novel information from redundant information Develop the ability to detect novel information from redundant information The Ontology ranking process itself is useful The Ontology ranking process itself is useful –Ontological debugging –Familiarization with the language of the ontology via a form of parallel text

Conclusions Adding commonsense graphs to TextRank can affect the performance both positively and negatively Adding commonsense graphs to TextRank can affect the performance both positively and negatively Need to identify how to modulate the effects of commonsense information Need to identify how to modulate the effects of commonsense information Having the right data helps! Having the right data helps! Spin-offs for the text-to-ontology graph can be useful Spin-offs for the text-to-ontology graph can be useful

References [Richardson and Domingos 2002] Richardson and Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, NIPS 2002 [Richardson and Domingos 2002] Richardson and Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, NIPS 2002 [Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 [Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 [Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004 [Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004 [Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul Tarau, Rada Mihalcea and Elizabeth Figa, Semantic Document Engineering with WordNet and PageRank, in Proceedings of the ACM Conference on Applied Computing (ACM-SAC 2005), New Mexico, March 2005 [Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul Tarau, Rada Mihalcea and Elizabeth Figa, Semantic Document Engineering with WordNet and PageRank, in Proceedings of the ACM Conference on Applied Computing (ACM-SAC 2005), New Mexico, March 2005 [Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. Graph-based ranking algorithms for text processing, Patent application # [Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. Graph-based ranking algorithms for text processing, Patent application # [Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. Multi-Document Summarization with Iterative Graph-based Algorithms, Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA, May 2005 [Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. Multi-Document Summarization with Iterative Graph-based Algorithms, Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA, May 2005

References [Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) Ranking the Importance of Boards of Directors. [Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) Ranking the Importance of Boards of Directors. [Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, [Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, [Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, "Real time control of a Khepera robot using genetic programming," Cybernetics and Control, Vol. 26, No. 3, pp , [Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, "Real time control of a Khepera robot using genetic programming," Cybernetics and Control, Vol. 26, No. 3, pp , [de Jager 2004] de Jager, D., “PageRank: Three distributed algorithms,” M.Sc. thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September [de Jager 2004] de Jager, D., “PageRank: Three distributed algorithms,” M.Sc. thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September [Brin and Page 1998] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International World Wide Web Conference, Brisbane, Australia, [Brin and Page 1998] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International World Wide Web Conference, Brisbane, Australia, [Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Riddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In Proc. of the 13th ACM Conference on Information and Knowledge Management, pages , [Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Riddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In Proc. of the 13th ACM Conference on Information and Knowledge Management, pages , 2004.