CICWSD: configuration guide

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Advertisements

Archiving and linguistic databases Jeff Good, MPI EVA LSA Annual Meeting Oakland, California January 6, 2005 Available at:
1 Computational Asset Description for Cyber Experiment Support using OWL Telcordia Contact: Marian Nodine Telcordia Technologies Applied Research
The following 10 questions test your knowledge of desired configuration management in Configuration Manager Configuration Manager Desired Configuration.
ITEC200 Week04 Lists and the Collection Interface.
PEPS Weekly Data Extracts User Guide September 2006.
1 Displaying Open Purchase Orders (F/Y 11). 2  At the end of this course, you should be able to: –Run a Location specific report of all Open Purchase.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
A Toolbox for Blackboard Tim Roberts
SL-10 Laboratory Hot Tack / Seal Tester TMI Group of Companies TMI Group of Companies.
1 BRState Software Demonstration. 2 After you click on the LDEQ link to download the BRState Software you will get this message.
Integrated II Workshop August 28, Purpose of Integrated II To assist our weaker science students master the high school science benchmarks To help.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
CICWSD: programming guide
© Paradigm Publishing, Inc Access 2010 Level 2 Unit 2Advanced Reports, Access Tools, and Customizing Access Chapter 8Integrating Access Data.
© Paradigm Publishing, Inc Excel 2013 Level 2 Unit 2Managing and Integrating Data and the Excel Environment Chapter 6Protecting and Sharing Workbooks.
Languages for IT & CS Pseudo-code What HTML isn’t Early history Compiling & interpreting Classifying languages The process of programming.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
XSL eXtensible Stylesheet Language. What is XSL? XSL is a language that allows one to describe a browser how to process an XML file. XSL can convert an.
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
Improved TF-IDF Ranker
INTRODUCTION Chapter 1 1. Java CPSC 1100 University of Tennessee at Chattanooga 2  Difference between Visual Logic & Java  Lots  Visual Logic Flowcharts.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
1 Appendix A: Writing and Submitting SAS ® Programs A.1 Writing and Submitting SAS Programs.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Methodology Conceptual Database Design
NEES Central Goran Josipovic IT Manager
XP New Perspectives on Microsoft Access 2002 Tutorial 71 Microsoft Access 2002 Tutorial 7 – Integrating Access With the Web and With Other Programs.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Francisco Viveros-Jiménez Alexander Gelbukh Grigori Sidorov.
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.
SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.
Part 3. Knowledge-based Methods for Word Sense Disambiguation.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
As Class Convenes l Find your team’s table and have a seat l Pick up Team’s Modeling Folder l Remove Chapter 1 Assignment l Place your Chapter 2 Assignment.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Abstract We present two Model Driven Engineering (MDE) tools, namely the Eclipse Modeling Framework (EMF) and Umple. We identify the structure and characteristic.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge.
Page 1 SenDiS Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" Project co-financed by the European Regional.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
1 Lesson Five. 2 Organizing Files Making a New Folder Naming Files Opening Files from Different Folders Switching Between Files Renaming a File.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Using Semantic Relatedness for Word Sense Disambiguation
UsersTraining StatisticsCommunication Tests Knowledge Board Welcome to the Knowledge Board interactive guide! We encourage you to start with a click on.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Venue Recommendation: Submitting your Paper with Style Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering, Lehigh University.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Word Sense Disambiguation Algorithms in Hindi
SENSEVAL: Evaluating WSD Systems
A Brief Introduction to Distant Supervision
Lecture 21 Computational Lexical Semantics
WordNet WordNet, WSD.
Tutorial 7 – Integrating Access With the Web and With Other Programs
臺灣大學資訊工程學系 高紹航 臺灣大學外國語文學系 高照明
CS224N Section 3: Corpora, etc.
Unsupervised Word Sense Disambiguation Using Lesk algorithm
Presentation transcript:

CICWSD: configuration guide Francisco Viveros-Jiménez Alexander Gelbukh Grigori Sidorov

Contents What is CICWSD? Quick Start Configuration file <dict> <testbed> <docs> <xls> <test> <algorithm> <condition> Contact information and citation

What is CICWSD? CICWSD is a Java API and command for word sense disambiguation. Its main features are: It has included some state-of-the-art WSD dictionary-based algorithms for you to use. Easy configuration of many parameters such as window size, number of senses retrieved from the dictionary, back-off method, tie solving method and conditions for retrieving window words. Easy configuration on a single XML file. Output is generated in a simple XLS file by using JExcelApi. The API is licensed under the GNU General Public License (v2 or later). Source is included. Senseval 2 and Senseval 3 English-All.Words task are bundled together within CICWSD.

Quick Start Download CICWSD from http://fviveros.gelbukh.com/downloads/CICWSD- 1.0.zip Unzip files Open a command line Change the current directory to the CICWSD directory Execute java –jar cicwsd.jar. You should see something like this:

Configuration file (1) You can configure your own experimental setup through modifying config.xml file. The XML is organized as follows: <dict> is the section for describing which data will be loaded as sense inventory. <testbed> is the section for specifying which target files are going to be disambiguated. <xls> is the section for describing which algorithms are going to be tested and for configuring the output file. These three sections must be encapsulated inside a <run> node.

Configuration file (2) The valid structure of the file is the following: <?xml version="1.0" encoding="UTF-8"?> <run> <dict /> <testbed> <docs /> <testbed/> <xls> <test> <algorithm /> <condition /> </test> </xls> </run> Red nodes indicates that you can add more than one of them.

<dict> section The <dict> section specifies the data that will conform each sense's bag of words. A bag of words is a set of words representing a sense definition. I.E.: paper_4=(medium_N written_J communication_N) Several sources can be specified. Sources must be separated by a ";". The valid options are: WNGlosses: Definitions extracted from WordNet 3.1 WNSamples: Samples extracted from WordNet 3.1 SemCor: SemCor corpus Some sample <dict> sections are: <dict sources="WNGlosses;WNSamples"/> <dict sources=“SemCor"/> <dict sources="WNGlosses;SemCor"/>

<testbed> section (1) The <testbed> section specifies the test sets (SemCor formatted XML data). The <testbed> section contains one or more <docs> nodes. Each <docs> node describes a SemCor file or folder. The <testbed> section has an attribute called senses. This attribute specifies which senses are going to form the sense inventory. The valid values are: "All": Read all senses. "+N": Read the first N senses. "*N": Read only the Nth sense. "-N": Exclude the Nth sense For example, if you want to disambiguate a text just considering the first two senses of any word in WordNet you should use the following: <testbed senses=“+2">

<testbed> section (2) For example, the word paper has the following definitions in WordNet: 1: (n) paper (a material made of…) 2: (n) composition, paper, report, theme (an essay …) 3: (n) newspaper, paper (a daily or weekly publication…) 4: (n) paper (a medium for written communication) 5: (n) paper (a scholarly article describing …) 6: (n) newspaper, paper, newspaper publisher (a business firm that …) 7: (n) newspaper, paper (the physical object that …) The loaded senses will be: All +2 *2 -2 Loaded sense set (1,2,3,4,5,6,7) (1,2) (2) (1,3,4,5,6,7)

<docs> A <docs> node describes the files that will conform a test set. You can add as many <docs> nodes as you need. The results registered over each test set will be stored in independent Excel files. Meaning that if you have two docs nodes and one <xls> node, CICWSD will produce two Excel files. For example: <docs src="Resources/senseval2" prefix="S2_"/> indicates that the xml files inside the folder Resources/senseval2 will be used as a test set. Also, the registered resources will be stored in an Excel file named “S2_Name.xls” where name is specified in the <xls> section. <docs src=“mydoc.xml" prefix=“md_"/> indicates that the mydoc.xml file will be the test set and its corresponding output will be named “md_Name.xls”. Remember that the files should be XML SemCor compliant. CICWSD includes Senseval 2 and Senseval 3 English-all-words test sets. These test sets are located inside the Resources folder.

<xls> section The <xls> section details two things: WSD algorithms for testing and comparison. WSD algorithms and its configurations are specified through adding <test> nodes. You must add at least one <test> node for each <xls> section. Location of the generated excel files. You can add as many <xls> nodes as you want. Please remember that for each <xls> section, CICWSD will produce N excel files (where N is the number of test sets specified in the <testbed> section). Let us see an <xls> section example: <xls src="tests/results.xls" detail="false"> this node tell us that the results are going to be stored in the file(s) tests/PREFIXN_results.xls (PREFIXN_ is specified in each <docs> node as mentioned previously). Also, the detail attribute tell CICWSD that the detail sheet isn’t going to be included (please read the results interpretation guide).

<test> <test> nodes describe the target algorithms for comparison. Each test node must include the following: <algorithm> node for specifying the WSD algorithm. <condition> node(s) for setting window selection filters. These algorithms will be included in the comparison stored in the specified excel file. You can add several <test> nodes to a single <xls> section. However, we recommend you including only a few <test> nodes per <xls>.

<algorithm> (1) <algorithm> nodes specify a WSD algorithm and its configuration. Each <algorithm> node must have the following attributes: disambiguation for specifying the WSD algorithm. The valid options are the following: "SimplifiedLesk": Simplified lesk algorithm as proposed by Kilgarriff and Rosenzweig in "Framework and Results for English SENSEVAL". "GraphInDegree": Graph-based disambiguation as presented in "Unsupervised Graph- based Word Sense Disambiguation Using Measures of Word Semantic Similarity" by Sinha and Mihalcea. "ConceptualDensity": Conceptual density algorithm as proposed by Agirre and Rigau in "Word Sense Disambiguation using Conceptual Density". MFS": A simple Most Frequent Sense heuristic. “RandomSense": A simple Random Sense heuristic. backoff specifies the WSD algorithm to be used as a back-off strategy. Use "none" for no back-off or use any WSD algorithm. windowSize specifies the maximum words contained in the window.

<algorithm> (2) tie specifies the disambiguation algorithm to be used for solving a tie. A tie occurs when the disambiguation algorithm return more than one sense as an answer. Use "none" for leaving the tie unsolved or use any WSD algorithm. Lets see some examples of <algorithm> nodes: <algorithm disambiguation="GraphInDegree" backoff="none" windowSize="4" tie="MFS"/> this node describes the GraphInDegree algorithm using no back-off, Most-Frequent sense for solving ties and a window consisting of a maximum of 4 words. <algorithm disambiguation="SimplifiedLesk" backoff=“RandomSense" windowSize=“6" tie="MFS"/>this node describes the Simplified Lesk algorithm using Random Sense as back-off strategy, Most-Frequent sense for solving ties and a window consisting of a maximum of 6 words.

<condition> (1) A <condition> node sets a filter for choosing the window words. You can specify (combine) the amount of filters that you need. The valid conditions are the following: "none": All words can be part of the window. "IDFThreshold:I": Only words with an IDF value >=I will be selected. "IsUseful:WSDAlgorithm": Only words that allow the WSD algorithm to return an answer will be selected. "NoDuplicates": This will generate a window without duplicates. "NoTarget":The target word will be excluded in the window. "VascilescuLexicalChain:J": Extracted from the paper " Evaluating Variants of the Lesk Approach for Disambiguating Words". J is a value in [0.0,1.0] that acts as a threshold for creating lexical chains (A lower value will allow an easy integration to the lexical chain). Only words that form a lexical chain with the target word will be selected.

<condition> (2) Some examples of condition nodes are the following: <condition type="none"/> all words will be included. <condition type="IsUseful:SimplifiedLesk"/> only words that allow Simplified Lesk to give an answer will be included. <condition type="VascilescuLexicalChain:0.3"/> only words that form a lexical chain with the target word will be included. The Jaccard score threshold to be used for deciding if a word is part of the lexical chain is 0.3. Remember, you must add at least a condition node.

Contact information For any doubt regarding the CICWSD API please contact Francisco Viveros-Jiménez by email (pacovj@hotmail.com) or Skype (pacovj). Please cite the following paper in your work: Viveros-Jiménez, F., Gelbukh, A., Sidorov, G.: Improving Simplified Lesk Algorithm by using simple window selection practices. Submitted.

References (1) Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proc. of SIGDOC-86: 5th International Conference on Systems Documentation, Toronto, Canada. Rada R, Mill H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets, in IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 1, pp 17-30. Miller G (1995) WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41. Agirre E, Rigau G (1996) Word Sense Disambiguation using Conceptual Density Proceedings of COLING'96, 16-22. Copenhagen (Denmark). Kilgarriff A (1997) I don't believe in word senses. Computers and the Humanities. 31(2), pp. 91–113.

References (2) Edmonds P (2000) Designing a task for SENSEVAL-2. Tech. note. University of Brighton, Brighton. U.K. Kilgarriff A, Rosenzweig J (2000) English Framework and Results Computers and the Humanities 34 (1-2), Special Issue on SENSEVAL. Toutanova K, Manning C D (2000) Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70. Cotton S, Edmonds P, Kilgarriff A, Palmer M (2001) “SENSEVAL-2.” Second International Workshop on Evaluating Word Sense Disambiguation Systems. SIGLEX Workshop, ACL03. Toulouse, France. Mihalcea R, Edmons P (2004) Senseval-3 Third International Workshop on Evaluating of Systems for the Semantic Analysis of Text. Association for Computational Linguistics. ACL 04. Barcelona, Spain.

References (3) Vasilescu F, Langlais P, Lapalme G (2004) Evaluating Variants of the Lesk Approach for Disambiguating Words. LREC, Portugal. Mihalcea R (2006) Knowledge Based Methods for Word Sense Disambiguation, book chapter in Word Sense Disambiguation: Algorithms, Applications, and Trends, Editors Phil Edmonds and Eneko Agirre, Kluwer. Navigli R, Litkowski K, Hargraves O (2007) SemEval-2007 Task 07: Coarse-Grained English All-Words Task. Proc. of Semeval-2007 Workshop (SemEval), in the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic. Sinha R, Mihalcea R (2007) Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity, in Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA. Navigli R (2009) Word Sense Disambiguation: a Survey. ACM Computing Surveys, 41(2), ACM Press, pp. 1-69.