Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.

Similar presentations


Presentation on theme: "Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech."— Presentation transcript:

1 Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech Republic

2 Outline Introduction, general-purpose language resources Introduction, general-purpose language resources General considerations General considerations Case Study of Quality Control in BalkaNet Case Study of Quality Control in BalkaNet Conclusions and Future Directions Conclusions and Future Directions

3 Introduction BalkaNet shares many fundamental principles with EuroWordNet (expected sharing of procedures, policy, structure and tools). BalkaNet shares many fundamental principles with EuroWordNet (expected sharing of procedures, policy, structure and tools). Discovered limitations of the EuroWordNet approach brought us to the decision to change data format, to design and implement new applications, and also to propose a modified perspective of the future development of the lexical semantic databases. Discovered limitations of the EuroWordNet approach brought us to the decision to change data format, to design and implement new applications, and also to propose a modified perspective of the future development of the lexical semantic databases.

4 Introduction application-specific vs. general-purpose LR application-specific vs. general-purpose LR procedures of quality control for general- purpose language resources much less developed procedures of quality control for general- purpose language resources much less developed this area has been strongly underestimated in many previous projects this area has been strongly underestimated in many previous projects if quality assurance policy has not been applied the results could differ considerably from that what was declared if quality assurance policy has not been applied the results could differ considerably from that what was declared

5 General Considerations the availability of documentation of the development process and the final state of data the availability of documentation of the development process and the final state of data resource documentation should be comprehensive but at the same time concise to allow quick scan resource documentation should be comprehensive but at the same time concise to allow quick scan project deliverables project deliverables

6 General Considerations the availability of documentation of the development process and the final state of data the availability of documentation of the development process and the final state of data resource documentation should be comprehensive but at the same time concise to allow quick scan resource documentation should be comprehensive but at the same time concise to allow quick scan project deliverables (longer than necessary, do not describe all aspects, do not reflect the process of development) project deliverables (longer than necessary, do not describe all aspects, do not reflect the process of development)

7 The First Commandment!!! Summarize the description of resources in the end of your project and check validity of information in all documents that will be part of the documentation! Summarize the description of resources in the end of your project and check validity of information in all documents that will be part of the documentation!

8 The Second Commandment!!! Explicitly define your terminology! Explicitly define your terminology! (even the meaning of terms that seem to be basic in the context!) (even the meaning of terms that seem to be basic in the context!) what kinds of variants (typographic, regional, register…) are contained in synsets? what kinds of variants (typographic, regional, register…) are contained in synsets? (lake, loch and lough – regional variants of the same concept – form 3 different synsets in PWN, lake is the hypernym of the two others) (lake, loch and lough – regional variants of the same concept – form 3 different synsets in PWN, lake is the hypernym of the two others)

9 Other Requirements description of the data format in which the resource is provided description of the data format in which the resource is provided XML as the standard for data interchange XML as the standard for data interchange DTD, XSW and other XML Schemata DTD, XSW and other XML Schemata Quantitative characteristics (empty tags may signalize inconsistency) Quantitative characteristics (empty tags may signalize inconsistency)

10 BalkaNet Experience The most successful procedure to control the quality of linguistic output is to implement a set of validation checks and periodically publish their results. It holds especially for projects with many participants that are not under the same supervision. Validation check reports together with the quantitative assessment can serve as development synchronization points too. The most successful procedure to control the quality of linguistic output is to implement a set of validation checks and periodically publish their results. It holds especially for projects with many participants that are not under the same supervision. Validation check reports together with the quantitative assessment can serve as development synchronization points too.

11 Case Study of Quality Control in BalkaNet Resource description sheets: Resource description sheets: description of the content of synset records and constraints on data types; description of the content of synset records and constraints on data types; types of relations included together with examples; types of relations included together with examples; degree of checking relations borrowed from PWN (related to the expand model); degree of checking relations borrowed from PWN (related to the expand model); numbering scheme of different senses (random, according to their frequency in a balanced corpus, from a particular dictionary, etc.) numbering scheme of different senses (random, according to their frequency in a balanced corpus, from a particular dictionary, etc.) source of definitions and usage examples; source of definitions and usage examples; order of literals in synsets (corpus frequency, familiarity, register or style characteristics) order of literals in synsets (corpus frequency, familiarity, register or style characteristics)

12 Quantitative characteristics tag frequencies tag frequencies ratio of the number of literals in the national wordnet and in PWN ratio of the number of literals in the national wordnet and in PWN ID prefix frequencies ID prefix frequencies frequency of link types frequency of link types frequency of POS frequency of POS coverage of BCS coverage of BCS number-of-senses distribution number-of-senses distribution number of “multi-parent” synsets number of “multi-parent” synsets number of leaves, inner nodes, roots, free nodes in hyper-hyponymic “trees” number of leaves, inner nodes, roots, free nodes in hyper-hyponymic “trees” path-length distribution path-length distribution

13 Automatic and Semi-automatic Quality Checking Classification according to: Classification according to: the amount of human effort the amount of human effort applicability for all languages (or language- specific) applicability for all languages (or language- specific) the need for additional resources and/or tools (annotated monolingual or parallel corpora, spell-checkers, explanatory or bilingual dictionaries, encyclopedias, lemmatizers, morphological analyzers) the need for additional resources and/or tools (annotated monolingual or parallel corpora, spell-checkers, explanatory or bilingual dictionaries, encyclopedias, lemmatizers, morphological analyzers)

14 Inconsistencies regularly examined on all BalkaNet data XML validation – empty ID, POS, SYNONYM, SENSE,... ; XML validation – empty ID, POS, SYNONYM, SENSE,... ; XML tag data types for POS, SENSE, TYPE (of relation), characters from a defined character set in DEF and USAGE; XML tag data types for POS, SENSE, TYPE (of relation), characters from a defined character set in DEF and USAGE; duplicate IDs; duplicate IDs; duplicate triplets (POS, literal, sense); duplicate triplets (POS, literal, sense); duplicate literals in one synset; duplicate literals in one synset; not corresponding POS in the relevant tag and in the ID postfix; not corresponding POS in the relevant tag and in the ID postfix; hypernym and holonym links (uplinks) to a synset with different POS; hypernym and holonym links (uplinks) to a synset with different POS;

15 Inconsistencies regularly examined on all BalkaNet data dangling links (dangling uplinks); dangling links (dangling uplinks); cycles in uplinks (conflicting with PWN, e.g. goalpost:1 is a kind of post:4 is a kind of upright:1; vertical:2 which is a part of goalpost:1); cycles in uplinks (conflicting with PWN, e.g. goalpost:1 is a kind of post:4 is a kind of upright:1; vertical:2 which is a part of goalpost:1); cycles in other relations; cycles in other relations; top-most synset not from the defined set (unique beginners) – missing hypernym or holonym of a synset (see BCS selecting procedure above); top-most synset not from the defined set (unique beginners) – missing hypernym or holonym of a synset (see BCS selecting procedure above); non-compatible links to the same synset; non-compatible links to the same synset; non-continuous numbering where declared (possibility of automatic renumbering). non-continuous numbering where declared (possibility of automatic renumbering).

16 Semi-automatic checks (additional language resources) spell-checking of literals, definitions, usage examples and notes coverage of the most frequent words from monolingual corpora; coverage of the most frequent words from monolingual corpora; coverage of translations (bilingual dictionaries, parallel corpora); coverage of translations (bilingual dictionaries, parallel corpora); incompatibility with relations extracted from corpora, dictionaries, or encyclopedias incompatibility with relations extracted from corpora, dictionaries, or encyclopedias

17 Lists of “suspicious” synsets nonlexicalized literals; literals with many senses; multi-parent relations; autohyponymy, automeronymy and other relations between synsets containing the same literal; longest paths in hyper-hyponymic graphs; similar definitions; incorrect occurrences of defined literals in definitions; presence of literals in usage examples; dependencies between relations (e.g. near antonyms differing in their hypernyms);

18 Validation of quality in applications corpus annotation for WSD experiments (missing senses, impossibility to choose between different senses) comparison between the semantic classifications from the wordnet with the syntactic patterns based on computational grammar (verb valencies, selectional restrictions) information retrieval - augmented user- interface for search engines

19 Conclusions and Future Directions The quality control has been one of the priorities of the BalkaNet project. As our evaluation proves even the actual data from the second year of the project are more consistent that the results of previous wordnet- development projects. XSLT and other XML standards to define validation checks in DEB


Download ppt "Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech."

Similar presentations


Ads by Google