Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

Improved TF-IDF Ranker
Extracting Knowledge-Bases from Machine- Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
A Framework for Ontology-Based Knowledge Management System
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
A STUDY ON THE KNOWLEDGE SOURCES OF TURKISH EFL LEARNERS IN LEXICAL INFERENCING İlknur İSTİFÇİ Anadolu University Eskişehir, TURKEY Eskişehir, TURKEY.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Using resources WordNet and the BNC. WordNet: History 1985: a group of psychologists and linguists start to develop a “lexical database” –Princeton University.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Business Domain Modelling Principles Theory and Practice HYPERCUBE Ltd 7 CURTAIN RD, LONDON EC2A 3LT Mike Bennett, Hypercube Ltd.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Topic: Hindi Wordnet, Formalization.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Ontology Learning from Text: A Survey of Methods Source: LDV Forum,Volume 20, Number 2, 2005 Authors: Chris Biemann Reporter:Yong-Xiang Chen.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
A Semi-automatic Ontology Acquisition Method for the Semantic Web Man Li, Xiaoyong Du, Shan Wang Renmin University of China, Beijing WAIM May 2012.
Short Text Understanding Through Lexical-Semantic Analysis
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Use of WordNet and on-line dictionaries to build EN-SK synsets (experimental tool) Ján GENČI Technical University of Košice, Slovakia
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia Penang, Malaysia.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
The interface between model-theoretic and corpus-based semantics
WordNet Enhancements: Toward Version 2.0 WordNet Connectivity Derivational Connections Disambiguated Definitions Topical Connections.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.
Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Zdroje jazykových dat Word senses Sense tagged corpora.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
OWL Web Ontology Language Summary IHan HSIAO (Sharon)
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Lexicons, Concept Networks, and Ontologies
Approaches to Machine Translation
DOMAIN ONTOLOGY DESIGN
Talp Research Center, UPC, Barcelona, Spain
Learning Attributes and Relations
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
WordNet: A Lexical Database for English
WordNet WordNet, WSD.
A method for WSD on Unrestricted Text
Approaches to Machine Translation
Presentation transcript:

Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui Komputer Pusat Pengajian Sains Komputer Universiti Sains Malaysia Penang, Malaysia {liantze,

Presentation Overview Introduction Building an Ontology-based Multilingual Lexicon Using the Lexicon for Target Word Selection in Machine Translation Future Work Conclusion

Introduction

Word Sense Disambiguation Ambiguous words: words with multiple meanings WSD: determine correct meaning (sense) of ambiguous word in particular discourse Need of WSD in machine translation (word selection)  Input: The computer logs were deleted.  Output: *Balak komputer telah dipotong.  Based on the list of meanings of words as defined in a bilingual dictionary

Language Resource for WSD (Bilingual) list of words and senses WordNet  broad coverage, rich lexical information, freely available  too fine-grained for practical NLP tasks  Linking of words in target languages to WordNet senses is insufficient  Propose to construct multilingual lexicon based on ontology framework

Combining Lexical Resources GoiTaikei hierarchies English WordNet Mandarin Dictionary of Modern Chinese Words Malay Kamus Dewan Multilingual Lexicon Ontology Framework (Protégé)

Building an Ontology- based Multilingual Lexicon

Gruber, T.: A Translation Approach to Portable Ontology Specification (1993) Taxonomies and Ontologies Ontology: "explicit formal specifications of the terms in the domain and relations among them" (Gruber 1993) Concepts organised in taxonomy structure Ontology + Instances = Knowledge Base Ontology  GoiTaikei hierarchies + relations + facets Instance  Lexical Entry Knowledge Base  Lexicon

Existing Lexical Resources using Hierarchical Structures Roget’s Thesaurus, WordNet Shortcomings – not perfect resources for WSD  Build our own

Miller, G. A. et al.: Introduction to WordNet: An On-line Lexical Database (1990); Kilgariff, Y., Yallop, C.: What's in a Thesaurus? (2000) Existing Lexical Resources using Hierarchical Structures Roget’s ThesaurusWordNet Main aim  help writers choose the appropriate word  reflect lexical memory using psycholinguistic principles Word grouping  groups words with similar or related meanings (not always synonymous) under a category  words of different parts-of- speech(POS) under a common heading  synonymous words grouped as “synsets”  categorises words with different strategies, according to part-of- speech Hierarchies  hierarchical structure of categories (of few levels)  hierarchical structure for nouns and verbs (number of levels varies, deep for nouns and some classes of verbs) Short- comings  no definitions or glosses  top level categories not organised based on POS  sense-distinctions too fine for practical NLP tasks  hierarchical structure cannot be used if attempting to move from fine-grained to coarse-grained approach

Construction of the Lexicon Building the hierarchical structures Preparing the lexical entries Classifying or categorising the lexical entries Specifying suitable relations among the lexical entries

GoiTaikei–A Japanese Lexicon, Ikehara et al (1999) The Hierarchies Based on GoiTaikei – A Japanese Lexicon 3,000 semantic classes in 3 hierarchies  General nouns  Proper nouns  "Phenomenons" (verbs, adjectives, adverbs) Each Japanese word tagged with  POS  semantic class(es)  "phenomenons": phrasal patterns with selectional restrictions Japanese label of classes translated to English Structure re-created in ontology web language (OWL) file/database

Source: GoiTaikei–A Japanese Lexicon, Ikehara et al (1999) The Hierarchies (cont.)

General Noun Hierarchy Proper Noun Hierarchy Phenomenon Hierarchy

The Lexical Entries Each lexical entry represents a sense of a word Information included:  English word-form  POS  definition keywords  equivalent word(s) in other languages  definition entries from dictionaries

The Lexical Entries (cont.) WordNet Dictionary of Modern Chinese Words Kamus Dewan

Classifying the Lexical Entries Classifying lexical entries in appropriate classes English word  Japanese word looked up in GoiTaikei to determine semantic class translate GoiTaikei lookup Japanese Equivalent GoiTaikei Entry Lexical Entry

Source: GoiTaikei–A Japanese Lexicon, Ikehara et al (1999) The Relations GoiTaikei noun hierarchy: hyponymy (“is-a”) and meronymy (“part-of”) GoiTaikei: phrasal patterns and selectional restrictions for verbs, adjectives

Source: GoiTaikei–A Japanese Lexicon, Ikehara et al (1999) Watashi ga hoteru wo Odawarashi ni toru Aside: Using GoiTaikei in Machine Translation (Japanese-English)

Source: GoiTaikei–A Japanese Lexicon, Ikehara et al (1999) Aside: Using GoiTaikei in Machine Translation (Japanese-English)

The Relations (cont.) Morphological relations between words WordNet: various types of semantic relations  Hyponymy and meronymy already present in GoiTaikei noun hierarchies  (still considering types of relations suitable to be included)

Using the Ontology-based Multilingual Lexicon for Word Selection

Lim et al (2002) calculates Lexical Conceptual Distance Data (LCDD) as measure of relatedness between word senses, using definition texts Extension: compute LCDD between classes of words too Apply different heuristics and weights – words of different POS "behave" differently (Miller et al 1990, Ide and Véronis 1998) Lim, B.T, Guo, C. M., Tang, E. K.: Building a Semantic-Primitive-Based Lexical Consultation System (2002); Miller, G. et al: Introduction to WordNet: An On-line Lexical Database (1990); Ide, N., Véronis, J.: Word Sense Disambiguation (1998)

Lim, B.T., Guo, C. M. & Tang, E. K. (2002): Building a Semantic Primitive Based Lexical Consultation System Descriptive Semantic Primitives Derived from dictionary (as opposed to prescriptive semantic primitives) Identify self-defined cycle sense_1  semantic primitive sense_1 [def] [sense_2 sense_5 sense_6] sense_2 [def] [sense_3 sense_2] sense_3 [def] [sense_1 sense_2] sense_4 [def] [sense_5]

Lim, B. T., Guo, C. M. & Tang, E. K. (2002); Building a Semantic Primitive Based Lexical Consultation System LCDD Calculation forecast#2: predict#1 in advance#3 fixed#6: specify#1 in advance#3 predict#1: make#3 a prediction#1 about specify#1: be specific#1 about advance#3: a change#1 for the better#2 progress#4

An Example Input: The ranch hands are going on a strike. hand (tangan) def hand (pekerja) def hand (bantuan) def hand (tulisan) def ranch def strike def

Lee, H. A., Kim, G. C.: Translation Selection through Source Word Sense Disambiguation and Target Word Selection Using the Ontology-based Multilingual Lexicon (cont.) If multiple equivalent words in target language found?  Can use co-occurrence data from parallel corpora for a more "natural", grammatical output, as done by Lee and Kim (2002) Miscellaneous  speech synthesis: homonyms  eg. "semak"

Future Work and Conclusion

Future Work Early stages – still much to be done! Some concerns:  identifying suitable relations  identifying other information for lexical entries  extending LCDD algorithm with structural or relational information  determining if and how adjectives and adverbs can be re-categorised

Future Work (cont.) Manual preparation  time and labour consuming Investigate automation of:  acquiring lexical information from various sources  inserting new lexical entries into the lexicon, given existing entries in lexicon and definition texts of new entries (bootstrapping)

Conclusion Proposed construction of a multilingual lexicon, using ontology framework, for WSD in machine translation Includes definition texts, equivalent translations in other languages Using existing language resources (GoiTaikei, WordNet, etc) Reusable for other NLP tasks

Thank You