The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge.

Slides:



Advertisements
Similar presentations
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Advertisements

Improved TF-IDF Ranker
Entity Tracking in Real- Time using Sub-Topic Detection on Twitter SANDEEP PANEM, ROMIL BANSAL, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
LING 581: Advanced Computational Linguistics Lecture Notes April 27th.
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
© 2001 Franz J. Kurfess Introduction 1 CPE/CSC 580: Knowledge Management Dr. Franz J. Kurfess Computer Science Department Cal Poly.
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia Presenter: Ziqi Zhang OAK Research Group, Department.
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Francisco Viveros-Jiménez Alexander Gelbukh Grigori Sidorov.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Wikitology Wikipedia as an Ontology Zareen Syed, Tim Finin and Anupam Joshi University of Maryland.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors Mohamed Ali Hadj Taieb *, Mohamed Ben Aouicha, Abdelmajid Ben Hamadou KBS Computing.
What Helps Where – And Why? Semantic Relatedness for Knowledge Transfer Marcus Rohrbach 1,2 Michael Stark 1,2 György Szarvas 1 Iryna Gurevych 1 Bernt Schiele.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Evgeniy Gabrilovich and Shaul Markovitch
Using Semantic Relatedness for Word Sense Disambiguation
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
INFORMATION RETRIEVAL PROJECT Creation of clusters of concepts that represent a domain corpus.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Exploiting Wikipedia Inlinks for Linking Entities in Queries Entity Recognition and Disambiguation Challenge ACM SIGIR 2014 July 6-11, 2014 The 37 th Annual.
WordNet::Similarity Measuring the Relatedness of Concepts Yue Wang Department of Computer Science.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Word Sense Disambiguation Algorithms in Hindi
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Cross-lingual Dataless Classification for Many Languages
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Automatically Extending NE coverage of Arabic WordNet using Wikipedia
Cross-lingual Dataless Classification for Many Languages
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
EDIUM: Improving Entity Disambiguation via User modelling
A method for WSD on Unrestricted Text
35 35 Extracting Semantic Knowledge from Wikipedia Category Names
C SC 620 Advanced Topics in Natural Language Processing
Summarization for entity annotation Contextual summary
Presentation transcript:

The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt

2 Wikipedia as a Language Resource NLP applications  Information Extraction [Ruiz-Casado et al., 2005]  Information Retrieval [Gurevych et al., 2007]  Keyphrase Extraction [Medelyan, Milne & Witten, 2008]  Named Entity Recognition [Bunescu & Pasca, 2006]  Question Answering [Ahn et al., 2004]  Semantic Relatedness [Zesch & Gurevych, 2010]  Text Categorization [Gabrilovich & Markovitch, 2006]  WSD [Mihalcea, 2007] [Medelyan et al., 2008] for an excellent overview | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

3 Growth of Wikipedia | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

4 Growth of Wikipedia Categories introduced | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

5 Growth of Wikipedia +Coverage  Influence of Wikipedia’s growth on task performance is unknown  Only most recent Wikipedia snapshots are publicly available  Previous research cannot be reproduced | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

6 JWPL – TimeMachine Snapshot 2Snapshot 1 Application Java-based API (JWPL) Run- time TimeMachine One time effort Wikipedia Dump (All revisions) | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

7 Wikipedia as a Language Resource NLP applications  Information Extraction [Ruiz-Casado et al., 2005]  Information Retrieval [Gurevych et al., 2007]  Keyphrase Extraction [Medelyan, Milne & Witten, 2008]  Named Entity Recognition [Bunescu & Pasca, 2006]  Question Answering [Ahn et al., 2004]  Semantic Relatedness [Zesch & Gurevych, 2010]  Text Categorization [Gabrilovich & Markovitch, 2006]  WSD [Mihalcea, 2007] [Medelyan et al., 2008] for an excellent overview | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

8 Wikipedia as a Language Resource NLP applications  Information Extraction [Ruiz-Casado et al., 2005]  Information Retrieval [Gurevych et al., 2007]  Keyphrase Extraction [Medelyan, Milne & Witten, 2008]  Named Entity Recognition [Bunescu & Pasca, 2006]  Question Answering [Ahn et al., 2004]  Semantic Relatedness [Zesch & Gurevych, 2010]  Text Categorization [Gabrilovich & Markovitch, 2006]  WSD [Mihalcea, 2007] [Medelyan et al., 2008] for an excellent overview | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

9 Semantic Relatedness Measures treecar treewillow  Quantify the strength of semantic relatedness [0,1] | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

10 Semantic Relatedness Measures tree car willow  Quantify the strength of semantic relatedness [0,1] | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

11 Types of Semantic Relatedness Measures  Path Based  Gloss Based  Concept Vector Based  Link Vector Based | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

12 Path based Measures  Semantic relatedness corresponds e.g. to number of edges of the shortest path between two nodes (articles, categories) car motor vehicle cab...minivan biketruck garbage trucktractor cabminivan tractor cab – minivan: 2cab – tractor: | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

13 Gloss based measures  WordNet glosses  tree (plant) “a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown”  trunk (tree) “the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber” | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

14 Term – Document Matrix t1t1 t2t2 t3t3 … t m-1 tmtm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230…50 Terms Documents | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

15 Gloss Based Measures t1t1 t2t2 t3t3 … t m-1 tmtm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230…50 Articles [Lesk, 1986] Inner Product (usually Lesk) | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | c1c1 c2c2 c3c3 c n-1 cncn Article Titles

16 Concept Vector Based Measure c1c1 t1t1 t2t2 t3t3 … t m-1 tmtm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230…50 c2c2 c3c3 c n-1 cncn Inner Product (usually Cosine) | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | ESA [Gabrilovich & Markovitch, 2007]

17 Link Vector Based Measure l1l1 l2l2 l3l3 … l m-1 lmlm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230…50 c1c1 c2c2 c3c3 c n-1 cncn Articles Article Titles Links Inner Product (usually Cosine) | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

18  Path Based  Gloss Based  Concept Vector Based  Link Vector Based Types of Semantic Relatedness Measures car motor vehicle cab...minivan biketruck garbage trucktractor cabminivan tractor t1t1 t2t2 t3t3 … t m-1 tmtm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230…50 t1t1 t2t2 t3t3 … t m-1 tmtm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230…50 l1l1 l2l2 l3l3 … l m-1 lmlm d1d1 310…00 d2d2 050…10 d3d3 102…33 ………………… d n-1 023…21 dndn 230… | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

19 Experimental Setup  Created 6-montly snapshots of the German Wikipedia  Start  End  Accessed the dumps using JWPL Wikipedia API  Implemented all measure types on top of JWPL  Two evaluation approaches:  Correlation with human judgments on word pair lists  Solving word choice problems | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

20 Experimental Setup  Created 6-montly snapshots of the German Wikipedia  Start  End  Accessed the dumps using JWPL Wikipedia API  Implemented all measure types on top of JWPL  Two evaluation approaches:  Correlation with human judgments on word pair lists  Solving word choice problems | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

21 Evaluation Datasets Ø tree – lake tree – willow tree – car Spearman rank correlation coefficient σ | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

22 Evaluation Datasets Ø tree – lake tree – willow tree – car Spearman rank correlation coefficient σ Gur350 dataset [Gurevych, 2005]  350 word pairs  Nouns, verbs, and adjectives Gur350 dataset [Gurevych, 2005]  350 word pairs  Nouns, verbs, and adjectives | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

23 Coverage tree – lake tree – willow tree – car | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

24 Coverage – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

25 Coverage – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

26 Coverage – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Categories introduced

27 Correlation – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

28 Correlation – Gur350 (Fixed Coverage) | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

29 Experimental Setup  Created 6-montly snapshots of the German Wikipedia  Start  End  Accessed the dumps using JWPL Wikipedia API  Implemented all measure types on top of JWPL  Two evaluation approaches:  Correlation with human judgments on word pair lists  Solving word choice problems | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

30 Dataset  Datasets  1008 German word choice problems [Mohammad et al., 2007]  Evaluation metric  Coverage / Accuracy / Harmonic Mean | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

31 Coverage | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

32 Accuracy | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

33 Harmonic Mean | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

34 Summary  Wikipedia is a great resource for many NLP tasks  Wikipedia grows very fast The more, the better? → Growth does not hurt performance of semantic relatedness measures → Using more recent Wikipedia dumps does not increase coverage much JWPL Time Machine  Create a snapshot reflecting any past state of Wikipedia  Reproducing previous results obtained using a certain snapshot  Perform similar studies for other NLP tasks | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

35 References (I) Ahn, D., Jijkoun, V., Mishne, G., Müller, K., de Rijke, M., and Schlobach, S. (2004). Using Wikipedia at the TREC QA Track. In Proceedings of the Thirteenth Text REtrieval Conference (TREC), Gaithersburg, Maryland Bunescu, R. and Pasca, M. (2006). Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 9–16, Trento,Italy. Gabrilovich, E. and Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia- based Explicit Semantic Analysis. In Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), pages 1606–1611, Hyderabad, India. Gurevych, I. (2005). Using the Structure of a Conceptual Network in Computing Semantic Relatedness. In Proceedings of the 2nd International Joint Conference on Natural Language Processing, pages 767–778, Jeju Island, Republic of Korea. Gurevych, I., Müller, C., and Zesch, T. (2007). What to be? - Electronic Career Guidance Based on Semantic Relatedness. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 1032–1039, Prague, Czech Republic. Lesk, M. (1986). Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation, pages 24–26, Toronto, Canada | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

36 References (II) Mihalcea, R. (2007). Using Wikipedia for Automatic Word Sense Disambiguation. In Proceedings of HLT 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, NY, April 2007 Medelyan, O, Legg, C., Milne, D., and Witten. I.H. (2008) Mining Meaning from Wikipedia. International Journal of Human-Computer Studies. 67:9, September 2009, p Medelyan, O, Witten, I.H., and Milne, D. (2008) Topic Indexing with Wikipedia. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI'08), Chicago, I.L. Mohammad, S., Gurevych, I., Hirst, G., and Zesch, T. (2007). Cross-lingual Distributional Profiles of Concepts for Measuring Semantic Distance. In Proceedings of EMNLP-CoNLL, pages 571–580, Prague, Czech Republic. Ruiz-Casado, M., Alfonseca, E., and Castells, P. (2005). Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. In Advances in Web Intelligence, pages 380– 386. Zesch, T., and Gurevych, I. (2010). Wisdom of Crowds versus Wisdom of Linguists - Measuring the Semantic Relatedness of Words. In: Journal of Natural Language Engineering., vol. 16, no. 01, pages 25— | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

Backup Slides

38 Coverage – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

39 Correlation – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

40 Correlation – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

41 Correlation – Gur | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |