Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

Slides:



Advertisements
Similar presentations
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
A Robust Approach to Aligning Heterogeneous Lexical Resources Mohammad Taher Pilehvar Roberto Navigli MultiJEDI ERC
Improved TF-IDF Ranker
1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Building a Large- Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Word Sense Disambiguation Using Semantic Graph (Narayan Unny and Pushpak Bhattacharyya) A presentation by Ranjini Swaminathan University of Arizona.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Lexicons, Concept Networks, and Ontologies
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Exploiting Wikipedia as External Knowledge for Document Clustering
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
WordNet: A Lexical Database for English
Presentation transcript:

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church

About this paper Creates a bilingual ontology by aligning WordNet with an existing Chinese ontology HowNet Borrows techniques used in information retrieval and machine translation. Wants to show there exists an efficient algorithm that is capable aligning ontologies with two very different language structures Structural information within the ontologies – Not applicable to ontology that have vastly diff. structure

A Bilingual Chinese-English ontology Linking the American English WordNet and Simplified Chinese HowNet together by their most basic concepts – the WordNet synset and the HowNet Definition. Why picked WordNet & HowNet? – Structure – Polysemous words – Excellent test for the portability of the algorithm

WordNet  Electronic lexical database  Differentiate word senses from each other through the use of synsets. Ex: “ address ” -- {address, computer address}, {address, speech}  Synsets are linked to other synsets through hierarchical relations. (ex: hyponyms, hypernyms)  A total of 109,377 synsets are defined.

HowNet  Electronic lexical database  Mostly in Chinese with some English technical terms (ex: ASCII)  Synsets are not explicitly defined  Many words often belongs to the same definitions  1500 basic definitions  A total of 16,788 word concepts are composed of subsets of the definition

Want to know more?  A detailed WordNet – HowNet Structural comparison can be found in Wong & Fong (2002)

Word Sense ambiguation problem Finding the correct translation for Polysemous word in Chinese and English was the biggest problem. – Example: “Crane” One can see the problem of ambiguation by : – Baseline Experiment: Step 1: Pick 2000 HowNet definitions (and associated words) at random Step 2: Translate each of these words to English Step 3: Associate each of the translated English words with one synset in WordNet.

Result of Baseline Experiment For every definition in HowNet, there are on average 5 Chinese words with that definition For every definition in HowNet, there are on average 8 WordNet associated synsets. Average no. of HowNet Entries per Definition5.4 Average no. of WordNet Synsets per Definition8.1

Finer-Mapping Approach … Definition Match Algorithm (Knight & Luk, 1994) o Compare words with their contexts from example sentences and definition found in a dictionary. oUses word contexts from a large bilingual corpus. Fung & Lo ‘s information retrieval-like method oComparison of word contexts across languages and corpora that need not be parallel oEffective at extracting bilingual word trans. pairs

Using Synsets for Word Sense Disambiguation Goal of the algorithm: The alignment of the proper translation pair to the correct word sense The candidate WordNet synsets are ranked according to their similarity with the Chinese HowNet definition. The alignment ‘winner’ is defined as the HIGHEST- RANKING WordNet synset.

Word Sense Alignment Method … 1. Given a HowNet definition d, first extract its associated set of Chinese words and their English translations. 2. For each word from the English translations, find all the WordNet synsets that it belongs to. 3. For each of these candidate WordNet synsets s, a) If s contains only a single word( |s| = 1), expand it by adding words from its direct hyperset *. b) Define:

What is hyperset? The set of hypernyms of the current word which are included to aid in defining the meaning. Why need it?  The algorithm works better with synsets that contains more entries.  More elements in the Synsets, the greater of the value of Similarity (d,s).

Experiment … Bilingual data source: English-Chinese Hong Kong News Corpus which comprises of 18,500 aligned article pairs, from news doc released between * over 6 million words on the English side * use the entire HowNet vocabulary as a lexicon. The word list for the context vector construction was extracted by taking the monosemous (single meaning) word from WordNet Throw out all the words that had more than one translation in Chinese

Overall Result For each HowNet definition, the highest scoring WordNet synset that was aligned to it, and the corresponding alignment score are shown. The reverse mapping of WordNet synsets to HowNet definitions can also demonstrate the capabilities of the method.

Final Analysis 1-to-1 mapping from all HowNet definitions to WordNet synsets does not exists The seed word (a word that can be directly translation from one lang. to the other) coverage Precise translation? ( !! No !!) What about Rare Words? It creates lots of blank fields. Non-compositional compounds (NCC) causes problem Ex: floppy disk, hot dog Implement stemming technique Be able to capture the way a word is used more accurately

Conclusion and Future Work Does not make any assumptions about the structural alignment between both ontologies Expand the work on: – Address the concerns in the analysis section – Produce a full alignment from HowNet to WordNet – Expand the algorithm with more structural info. – Examine the use of the aligned ontology in application ( cross-lingual information retrieval and machine translation)