BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković University of Belgrade Faculty.
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
A Robust Approach to Aligning Heterogeneous Lexical Resources Mohammad Taher Pilehvar Roberto Navigli MultiJEDI ERC
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
The WordNet Lexical Database Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - Italy.
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
Building a Large- Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Overview of the Database Development Process
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
Interpreting Dictionary Definitions Dan Tecuci May 2002.
Expanding the Accessibility and Impact of Language Technologies for Supporting Education (TFlex): Edinburgh Effort Dr. Myroslava Dzikovska, Prof. Johanna.
Configuration Management (CM)
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Use of WordNet and on-line dictionaries to build EN-SK synsets (experimental tool) Ján GENČI Technical University of Košice, Slovakia
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Methodology - Conceptual Database Design
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
WordNet Enhancements: Toward Version 2.0 WordNet Connectivity Derivational Connections Disambiguated Definitions Topical Connections.
1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu “ Al. I. Cuza.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST.
Annual Review, Brussels March XX, 2006 SemanticMining No Annual Review NoE No Semantic Interoperability and Data Mining in Biomedicine WP20.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
WordNet, EuroWordNet, Balkanet Faculty of Informatics MU Karel Pala
Talp Research Center, UPC, Barcelona, Spain
Automatically Extending NE coverage of Arabic WordNet using Wikipedia
LACONEC A Large-scale Multilingual Semantics-based Dictionary
Statistical NLP: Lecture 13
Statistical NLP: Lecture 9
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
WordNet WordNet, WSD.
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
A method for WSD on Unrestricted Text
Cross Language Information Retrieval (CLIR)
EuroGroups register First results of measures on advancement
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB

Overview of the talk Goals and Design Principles Validation Methodologies Well-formedness checking (synt. val.) Cross-lingual validation of the ILI mapping (sem. val.) Current state of the BalkaNet wordnets Applications (WSD) Standard Balkanet Tools (VisDic, WMS)

Balkanet An EU funded project (IST ) for the development of a (core) multilingual semantic lexicon along the principles of EuroWordNet; Started in September 2001, will end August 2004 Languages concerned: Bulgarian, Czech, Greek, Romanian, Serbian, Turkish.

Teams  Bulgarian – DCMB (Sofia) & PU (Plovdiv)  Czech – FI MU (Brno)  Greek – DBLAB (Patras-coordinator) & CTI (Athens)  Romanian – RACAI (Bucharest) & UAIC (Iaşi)  Turkish – SABANCI University (Istanbul)  Memodata (Caen) – com. partner: evaluation studies Subcontractors Serbian – MATF (Belgrad) OTE (Athens) – ind. partner: user studies

Goals and Design Principles (1) Goals: g1)at least 8000 synsets per partner, g2)maximal interlingual overlap (> 80%), g3)building tools to efficiently exploit the multilingual wordnet (sum of the ILI-based aligned monolingual wordnets) g4)development of free software for the use and management of the BalkaNet wordnets g5)building various applications (WSD, intelligent document indexing, CLIR, etc.).

Goals and Design Principles (2) Design Principles: d1) ensuring as much as possible compatibility with the EuroWordnet approaches (e.g. unstructured ILI based on Princeton WordNet) d2) synset structuring (relations) inside each wordnet (lots of redundancy, but much more powerful) d3) keeping up with Princeton WordNet (PWN) developments d4) ensuring conceptually dense wordnets

Goals and Design Principles (3) d5)defining a reusable methodology for data acquisition and validation (open for further development) d6)linguistically motivated (reference language resources, with human experts actively involved in all decision makings and validation) d7)minimizing the development time and costs

Maximisation of the cross-lingual coverage (1)  ILI= the set of PWN synsets (labeled by their offsets in the database) taken as interlingual concepts: ( n; v; a; b)  The consortium selected a common set of ILI codes to be implemented for all languages; this selection took place in three steps: BCS1 (essentially the BC set of EuroWordnet):1218 concepts BCS2: 3471 concepts BCS3: 3827 concepts

Maximisation of the cross-lingual coverage (2) Selection criteria for BCS1,2,3…(8516 ILI-codes) number of languages in EuroWordNet linked to an ILI code (imperative) conceptual density: once a concept was selected, all its ancestors (nouns and verbs), up to the top level were also selected (imperative); adjectives were selected so that they would typically be related to nominal concepts in the selection (be_in_state) language specific criteria: each team proposed a set of concepts of interest and the maximum intersection set among these proposals became imperative

Synsets structuring (1)  At the level of each individual wordnet  Common set of relations (the semantic relations) as used in the PWN  Language specific relations (the lexical relations: such as derivative, usage_domain, region_domain)

Synsets structuring(2)  Principle of hierarchy preservation M 1 L1 H + M 2 L1 M 1 L1 = N 1 L2 N 1 L2 H + N 2 L2 M 2 L1 = N 2 L2 Allows for importing taxonomic relations and checking interlingual alignments.  When taxonomic relations were imported, they were hand validated.

Keeping up with PWN developments  When the project started ILI was based on PWN1.5 (as EuroWordNet was).  BalkaNet ILI was updated following the new releases of PWN: PWN1.5 => PWN1.7.1 PWN1.7.1 => PWN2.0  As the automatic remapping is not always deterministic the partners manually solved the remaining ambiguities in their wordnets.

Defining a reusable methodology for data acquisition and validation  Each partner developed own specific tools for acquisition and validation, having a commonly agreed set of functionalities.  These tools were documented for a lay computer user.  The language specific tools differ mainly because of the set of language resources available to each partner; depending on available resources each partner chose the appropriate balance among the d6) and d7)  next issue

Trading effort and development time for language centricity (1)  This issue has been addressed by each partner differently, basically, depending on: available man power and language resources available. For instance, if relevant (encoded) electronic dictionaries (2lang. Dicts + Expl. Dicts + Syn. Dicts + Antonym Dicts + etc.) were available, the development effort concentrated to a large extent on equivalence interlingual mappings. This approach allowed a more language centric development (merge model).

Trading effort and development time for language centricity (2) if reliable dictionaries other than bilingual dictionaries (which every partner had) were not available (e.g. because of the reluctance of the copyright holders to release or to allow the use of their data) a translation approach of the literals in the PWN was generally followed (approximately an expand model); additional efforts were necessary in this case to check out the translated synsets as well as their language adequacy.

Syntactic validations (wordnet well-formedness checking) Semantic validation (word sense alignment in parallel corpora) Validation methodologies

Validation of syntactically well-formed wordnets: -compliance with the dtd for the VISDIC editor. -no duplicate literals in the same synset -no sense duplications (literal&sense number) -valid set of semantic relations -no dangling nodes (conceptual density) -no loops -valid synsets identifiers … and many others Syntactic validations

Sense conflicts (a literal&sense-label in two or more synsets): easy to solve (obvious human errors in sense assignment) hard to solve (provide evidence for the Wordnet sense distinctions hard to make in other languages; hints for ILI soft clustering) Consistency checking

Cross-lingual validation of the ILI mapping  A bilingual lexicon might say TR (w L1 )=w 1 L2, w 2 L2, … (not enough)  A lexical alignment process might give you contextual translation information: The m th word in language L1 (w m L1 ) is translated by the n th word in language L2 (w n L2 ) (step1) TR-EQ (w m L1 )= w n L2 (not enough, but better)

Cross-lingual validation of the ILI mapping  A sense clustering procedure might give you info on similar senses of different occurrences of the same word: Sense (Occ(w i L1, p), Occ(w i L1, q) …) =  (step2) Sense (Occ(w j L2, m), Occ(w j L2, n) …) = β , β=? (sense labeling)  synset(w i L1 ) TR-EQV synset(w j L2 ) (step3) , β are ILI-codes (ideally  = β)

Cross-lingual validation of the ILI mapping (idealistic view) Translation(W i L1 )=W j L2 =>  Syn 1 L1, Syn 2 L2 so that W i L1  Syn 1 L1 and W j L2  Syn 2 L2 and => EQ-SYN (Syn 1 L1 )=EQ-SYN(Syn 2 L2 ) = ILI k WN1 WN2 ILI EQ-SYN W i L1 W j L2 ILI k TR-EQ

Cross-lingual validation of the ILI mapping (more realistic view) ILI EQ-SYN W i L1 W j Lk TR-EQ WN1 WN2

Translation equivalents

Checking intelingual mappings by translations in parallel corpora Sense Assignment Example (I) is ‘lamp’ is ‘lampă’  Common Sense of lamp and lampă is ENG n and they correspond to lamp(2) and lampă(1)

Checking intelingual mappings by translations in parallel corpora Sense Assignment Example (II) is ‘lamp’ is ‘felinar’  The closest conceptual match of lamp and felinar is for the pairs ENG n and ENG n and they correspond to lamp(1) and felinar(1)

Current status of the BalkaNet wordnets (1) LanguageSynsetsNounsVerbsAdjectivesAdverbs BG (73%) 3317 (22%) 653 (4%) 0 CZ (72%) 4950 (18%) 2128 (8%) 164 (0.6%) GR (79%) 2921 (18%) 352 (2%) 14 (0.1%) RO (73%) 2930 (20%) 844 (6%) 200 (1%) SR (65%) 1471 (30%) 154 (3%) 7 (0.1%) TR (74%) 2538 (23%) 358 (3%) 0

Current status of the BalkaNet wordnets (2) LanguageSynsetsLiteralsSensesAvg. Syn. LgAvg. sense/Lit BG , CZ , GR , RO , SR , TR ,521.35

Balkanet Common Set Coverage LanguagesBCS1BCS2BCS3BCSs TOTAL BG CZ GR RO TR

Cross-lingual coverage (2) LanguagesCZGRROTR BG CZ GR RO9076 BG  CZ  RO = BG  CZ  GR  RO  TR = 6035 (75,43%) We hope 100% at the end of the project!

Applications (WSD)  WSDtool (presented in the morning session) Initially designed as tool for semantic validation of the BalkaNet wordnets (the interactive regime) In autonomous regime WSDtool works as a word- sense disambiguator based on parallel corpora For the WSD task it was evaluated on the EN-RO bitext (“1984” parallel corpus).

Applications (WSD) The word assignment in both parts of the bitext are ILI codes Very promising results: for a set of 211 target words, with 1411 occurrences in the parallel corpus the accuracy was > 80% User friendly interface

Standard BalkaNet Tools  VISDIC (to follow)  WMS (to follow)