Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.

Slides:



Advertisements
Similar presentations
BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB.
Advertisements

Improved TF-IDF Ranker
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
The Impact of Task and Corpus on Event Extraction Systems Ralph Grishman New York University Malta, May 2010 NYU.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
Requirements Engineering n Elicit requirements from customer  Information and control needs, product function and behavior, overall product performance,
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Software Requirements
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Toman, Steinberger, Ježek Searching and Summarizing in a Multilingual Environment Michal Toman, Josef Steinberger, Karel Ježek University of West Bohemia.
LANGUAGE NETWORKS THE SMALL WORLD OF HUMAN LANGUAGE Akilan Velmurugan Computer Networks – CS 790G.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
Oana Adriana Şoica Building and Ordering a SenDiS Lexicon Network.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Ihr Logo Fundamentals of Database Systems Fourth Edition El Masri & Navathe Chapter 2 Database System Concepts and Architecture.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
8. ONLINE REFERENCE TOOLS Dictionaries and Thesauruses Concordancers and corpuses for language analysis Translators for language analysis Encyclopedias.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Configuration Management and Change Control Change is inevitable! So it has to be planned for and managed.
Software Architecture Evaluation Methodologies Presented By: Anthony Register.
Element Level Semantic Matching Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan Paper by Fausto.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
MedKAT Medical Knowledge Analysis Tool December 2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
Experiences of (Lexicographers and) Computer Scientists in Validating Estonian Wordnet with Test Patterns Ahti Lohk | Kadri Vare | Heili Orav | Leo Võhandu.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Automatic Writing Evaluation
Presented by Deborah Eldridge, CAEP Consultant
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
WordNet WordNet, WSD.
CVE.
Applied Linguistics Chapter Four: Corpus Linguistics
The role of metadata in census data dissemination
Presentation transcript:

Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech Republic

Outline Introduction, general-purpose language resources Introduction, general-purpose language resources General considerations General considerations Case Study of Quality Control in BalkaNet Case Study of Quality Control in BalkaNet Conclusions and Future Directions Conclusions and Future Directions

Introduction BalkaNet shares many fundamental principles with EuroWordNet (expected sharing of procedures, policy, structure and tools). BalkaNet shares many fundamental principles with EuroWordNet (expected sharing of procedures, policy, structure and tools). Discovered limitations of the EuroWordNet approach brought us to the decision to change data format, to design and implement new applications, and also to propose a modified perspective of the future development of the lexical semantic databases. Discovered limitations of the EuroWordNet approach brought us to the decision to change data format, to design and implement new applications, and also to propose a modified perspective of the future development of the lexical semantic databases.

Introduction application-specific vs. general-purpose LR application-specific vs. general-purpose LR procedures of quality control for general- purpose language resources much less developed procedures of quality control for general- purpose language resources much less developed this area has been strongly underestimated in many previous projects this area has been strongly underestimated in many previous projects if quality assurance policy has not been applied the results could differ considerably from that what was declared if quality assurance policy has not been applied the results could differ considerably from that what was declared

General Considerations the availability of documentation of the development process and the final state of data the availability of documentation of the development process and the final state of data resource documentation should be comprehensive but at the same time concise to allow quick scan resource documentation should be comprehensive but at the same time concise to allow quick scan project deliverables project deliverables

General Considerations the availability of documentation of the development process and the final state of data the availability of documentation of the development process and the final state of data resource documentation should be comprehensive but at the same time concise to allow quick scan resource documentation should be comprehensive but at the same time concise to allow quick scan project deliverables (longer than necessary, do not describe all aspects, do not reflect the process of development) project deliverables (longer than necessary, do not describe all aspects, do not reflect the process of development)

The First Commandment!!! Summarize the description of resources in the end of your project and check validity of information in all documents that will be part of the documentation! Summarize the description of resources in the end of your project and check validity of information in all documents that will be part of the documentation!

The Second Commandment!!! Explicitly define your terminology! Explicitly define your terminology! (even the meaning of terms that seem to be basic in the context!) (even the meaning of terms that seem to be basic in the context!) what kinds of variants (typographic, regional, register…) are contained in synsets? what kinds of variants (typographic, regional, register…) are contained in synsets? (lake, loch and lough – regional variants of the same concept – form 3 different synsets in PWN, lake is the hypernym of the two others) (lake, loch and lough – regional variants of the same concept – form 3 different synsets in PWN, lake is the hypernym of the two others)

Other Requirements description of the data format in which the resource is provided description of the data format in which the resource is provided XML as the standard for data interchange XML as the standard for data interchange DTD, XSW and other XML Schemata DTD, XSW and other XML Schemata Quantitative characteristics (empty tags may signalize inconsistency) Quantitative characteristics (empty tags may signalize inconsistency)

BalkaNet Experience The most successful procedure to control the quality of linguistic output is to implement a set of validation checks and periodically publish their results. It holds especially for projects with many participants that are not under the same supervision. Validation check reports together with the quantitative assessment can serve as development synchronization points too. The most successful procedure to control the quality of linguistic output is to implement a set of validation checks and periodically publish their results. It holds especially for projects with many participants that are not under the same supervision. Validation check reports together with the quantitative assessment can serve as development synchronization points too.

Case Study of Quality Control in BalkaNet Resource description sheets: Resource description sheets: description of the content of synset records and constraints on data types; description of the content of synset records and constraints on data types; types of relations included together with examples; types of relations included together with examples; degree of checking relations borrowed from PWN (related to the expand model); degree of checking relations borrowed from PWN (related to the expand model); numbering scheme of different senses (random, according to their frequency in a balanced corpus, from a particular dictionary, etc.) numbering scheme of different senses (random, according to their frequency in a balanced corpus, from a particular dictionary, etc.) source of definitions and usage examples; source of definitions and usage examples; order of literals in synsets (corpus frequency, familiarity, register or style characteristics) order of literals in synsets (corpus frequency, familiarity, register or style characteristics)

Quantitative characteristics tag frequencies tag frequencies ratio of the number of literals in the national wordnet and in PWN ratio of the number of literals in the national wordnet and in PWN ID prefix frequencies ID prefix frequencies frequency of link types frequency of link types frequency of POS frequency of POS coverage of BCS coverage of BCS number-of-senses distribution number-of-senses distribution number of “multi-parent” synsets number of “multi-parent” synsets number of leaves, inner nodes, roots, free nodes in hyper-hyponymic “trees” number of leaves, inner nodes, roots, free nodes in hyper-hyponymic “trees” path-length distribution path-length distribution

Automatic and Semi-automatic Quality Checking Classification according to: Classification according to: the amount of human effort the amount of human effort applicability for all languages (or language- specific) applicability for all languages (or language- specific) the need for additional resources and/or tools (annotated monolingual or parallel corpora, spell-checkers, explanatory or bilingual dictionaries, encyclopedias, lemmatizers, morphological analyzers) the need for additional resources and/or tools (annotated monolingual or parallel corpora, spell-checkers, explanatory or bilingual dictionaries, encyclopedias, lemmatizers, morphological analyzers)

Inconsistencies regularly examined on all BalkaNet data XML validation – empty ID, POS, SYNONYM, SENSE,... ; XML validation – empty ID, POS, SYNONYM, SENSE,... ; XML tag data types for POS, SENSE, TYPE (of relation), characters from a defined character set in DEF and USAGE; XML tag data types for POS, SENSE, TYPE (of relation), characters from a defined character set in DEF and USAGE; duplicate IDs; duplicate IDs; duplicate triplets (POS, literal, sense); duplicate triplets (POS, literal, sense); duplicate literals in one synset; duplicate literals in one synset; not corresponding POS in the relevant tag and in the ID postfix; not corresponding POS in the relevant tag and in the ID postfix; hypernym and holonym links (uplinks) to a synset with different POS; hypernym and holonym links (uplinks) to a synset with different POS;

Inconsistencies regularly examined on all BalkaNet data dangling links (dangling uplinks); dangling links (dangling uplinks); cycles in uplinks (conflicting with PWN, e.g. goalpost:1 is a kind of post:4 is a kind of upright:1; vertical:2 which is a part of goalpost:1); cycles in uplinks (conflicting with PWN, e.g. goalpost:1 is a kind of post:4 is a kind of upright:1; vertical:2 which is a part of goalpost:1); cycles in other relations; cycles in other relations; top-most synset not from the defined set (unique beginners) – missing hypernym or holonym of a synset (see BCS selecting procedure above); top-most synset not from the defined set (unique beginners) – missing hypernym or holonym of a synset (see BCS selecting procedure above); non-compatible links to the same synset; non-compatible links to the same synset; non-continuous numbering where declared (possibility of automatic renumbering). non-continuous numbering where declared (possibility of automatic renumbering).

Semi-automatic checks (additional language resources) spell-checking of literals, definitions, usage examples and notes coverage of the most frequent words from monolingual corpora; coverage of the most frequent words from monolingual corpora; coverage of translations (bilingual dictionaries, parallel corpora); coverage of translations (bilingual dictionaries, parallel corpora); incompatibility with relations extracted from corpora, dictionaries, or encyclopedias incompatibility with relations extracted from corpora, dictionaries, or encyclopedias

Lists of “suspicious” synsets nonlexicalized literals; literals with many senses; multi-parent relations; autohyponymy, automeronymy and other relations between synsets containing the same literal; longest paths in hyper-hyponymic graphs; similar definitions; incorrect occurrences of defined literals in definitions; presence of literals in usage examples; dependencies between relations (e.g. near antonyms differing in their hypernyms);

Validation of quality in applications corpus annotation for WSD experiments (missing senses, impossibility to choose between different senses) comparison between the semantic classifications from the wordnet with the syntactic patterns based on computational grammar (verb valencies, selectional restrictions) information retrieval - augmented user- interface for search engines

Conclusions and Future Directions The quality control has been one of the priorities of the BalkaNet project. As our evaluation proves even the actual data from the second year of the project are more consistent that the results of previous wordnet- development projects. XSLT and other XML standards to define validation checks in DEB