RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus Preslav.

Slides:



Advertisements
Similar presentations
CLASSICAL ENCRYPTION TECHNIQUES
Advertisements

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav.
DSC 2008 – June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment.
Statistics Part II Math 416. Game Plan Creating Quintile Creating Quintile Decipher Quintile Decipher Quintile Per Centile Creation Per Centile Creation.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
1 1  1 =.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
The 5S numbers game..
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
A Word at a Time Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky, Eugene Agichteiny, Evgeniy Gabrilovichz, Shaul Markovitch.
Chapter 2.3 Counting Sample Points Combination In many problems we are interested in the number of ways of selecting r objects from n without regard to.
Chapter 5: Query Operations Hassan Bashiri April
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Number bonds to 10,
1 Functions and Applications
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Improved TF-IDF Ranker
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Natural Language Processing Expectation Maximization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Chapter 6: Information Retrieval and Web Search
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Korea Maritime and Ocean University NLP Jung Tae LEE
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav.
Natural Language Processing
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
January 2012Spelling Models1 Human Language Technology Spelling Models.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
Statistical NLP: Lecture 13
CSCI 5832 Natural Language Processing
Improved Word Alignments Using the Web as a Corpus
Presentation transcript:

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus Preslav Nakov, Sofia University "St. Kliment Ohridski" Svetlin Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Introduction

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Introduction Cognates and False Friends Cognates are pairs of words in different languages that are perceived as similar and are translations of each other False friends are pairs of words in two languages that are perceived as similar, but differ in meaning The problem Design an algorithm for extracting all pairs of false friends from a parallel bi-text

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Cognates and False Friends Some cognates ден in Bulgarian = день in Russian (day) idea in English = идея in Bulgarian (idea) Some false friends майка in Bulgarian (mother) майка in Russian (vest) prost in German (cheers) прост in Bulgarian (stupid) gift in German (poison) gift in English (present)

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Method

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Method False friends extraction from a parallel bi- text works in two steps: 1.Find candidate cognates / false friends Modified orthographic similarity measure 2.Distinguish cognates from false friends Sentence-level co-occurrences Word alignment probabilities Web-based semantic similarity Combined approach

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 1: Identifying Candidate Cognates

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 1: Finding Candidate Cognates Extract all word pairs (w 1, w 2 ) such that w 1 ε first language w 2 ε second language Calculate a modified minimum edit distance ratio MMEDR(w 1, w 2 ) Apply a set of transformation rules and measure a weighted Levenshtein distance Candidates for cognates are pairs (w 1, w 2 ) such that MMEDR(w 1, w 2 ) > α

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 1: Finding Candidate Cognates Orthographic Similarity: MEDR Minimum Edit Distance Ratio (MEDR) MED(s 1, s 2 ) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s 1 to s 2 MEDR MEDR is also known as normalized edit distance (NED)

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 1: Finding Candidate Cognates Orthographic Similarity: MMEDR Modified Minimum Edit Distance Ratio (MMEDR) for Bulgarian / Russian 1.Transliterate from Russian to Bulgarian 2.Lemmatize 3.Replace some Bulgarian letter-sequences with Russian ones (e.g. strip some endings) 4.Assign weights to the edit operations

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 1: Finding Candidate Cognates The MMEDR Algorithm Transliterate from Russian to Bulgarian Strip the Russian letters "ь" and "ъ" Replace "э" with "e", "ы" with "и", … Lemmatize Replace inflected wordforms with their lemmata Optional step: performed or skipped Replace some letter-sequences Hand-crafted rules Example: remove the definite article in Bulgarian words (e.g. "ът", "ят")

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 1: Finding Candidate Cognates The MMEDR Algorithm (2) Assign weights to the edit operations: for vowel to vowel substitutions, e.g. 0.5 for е о for some consonant-consonant substitutions, e.g. с з 1.0 for all other edit operations MMEDR Example: the Bulgarian първият and the Russian первый (first) Previous steps produce първи and перви, thus MMED = 0.5 (weight 0.5 for ъ о)

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Step 2: Distinguishing between Cognates and False Friends

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Method Our method for false friends extraction from parallel bi-text works in two steps: 1.Find candidate cognates / false friends Modified orthographic similarity measure 2.Distinguish cognates from false friends Sentence-level co-occurrences Word alignment probabilities Web-based semantic similarity Combined approach

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Sentence-Level Co-occurrences Idea: cognates are likely to co-occur in parallel sentences (unlike false friends) Previous work - Nakov & Pacovski (2006): #(w bg ) – the number of Bulgarian sentences containing the word w bg #(w ru ) – the number of Russian sentences containing the word w ru #(w bg, w ru ) – the number of aligned sentences containing w bg and w ru

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria New Formulas for Sentence-Level Co-occurrences New formulas for measuring similarity based on sentence-level co-occurrences where

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Method Our method for false friends extraction from parallel bi-text works in two steps: 1.Find candidate cognates / false friends Modified orthographic similarity measure 2.Distinguish cognates from false friends Sentence-level co-occurrences Word alignment probabilities Web-based semantic similarity Combined approach

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Word Alignments Measure the semantic relatedness between words that co-occur in aligned sentences Build directed word alignments for the aligned sentences in the bi-text Using IBM Model 4 Average the translation probabilities Pr(W bg |W ru ) and Pr(W bg |W ru ): Drawback: words that never co-occur in corresponding sentences have lex = 0

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Method Our method for false friends extraction from parallel bi-text works in two steps: 1.Find candidate cognates / false friends Modified orthographic similarity measure 2.Distinguish cognates from false friends Sentence-level co-occurrences Word alignment probabilities Web-based semantic similarity Combined approach

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity What is local context? Few words before and after the target word The words in the local context of a given word are semantically related to it Need to exclude stop words: prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need for a sufficiently large corpus Same day delivery of fresh flowers, roses, and unique gift baskets from our online boutique. Flower delivery online by local florists for birthday flowers.

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity (2) Web as a corpus The Web can be used as a corpus to extract the local context for a given word The Web is the largest available corpus Contains large corpora in many languages A query for a word in Google can return up to 1,000 text snippets The target word is given along with its local context: few words before and after it The target language can be specified

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity (3) Web as a corpus Example: Google query for "flower" Flowers, Plants, Gift Baskets FLOWERS.COM - Your Florist... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by FLOWERS.COM, Your Florist of Choice for over 30 years. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, plants, roses, & gifts. Flowers delivery with fewer... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity (4) Measuring semantic similarity Given two words, their local contexts are extracted from the Web A set of words and their frequencies Lemmatization is applied Semantic similarity is measured using these local contexts Vector-space model: build frequency vectors Cosine: between these vectors

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity (5) Example of contextual word frequencies wordcount fresh217 order204 rose183 delivery165 gift124 welcome98 red87... word: flower wordcount Internet291 PC286 technology252 order185 new174 Web159 site word: computer

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity (6) Example of frequency vectors Similarity = cosine(v 1, v 2 ) #wordfreq. 0alias3 1alligator2 2amateur0 3apple zap0 5000zoo6 v 1 : flower #wordfreq. 0alias7 1alligator0 2amateur8 3apple zap3 5000zoo0 v 2 : computer

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Web-based Semantic Similarity: Cross-Lingual Semantic Similarity Given two words in different languages L 1 and L 2 a bilingual glossary G of known translation pairs {p L 1, q L 2 } Measure cross-lingual similarity as follows 1. Extract the local contexts of the target words from the Web: C 1 L 1 and C 2 L 2 2. Translate the local context: 3. Measure the similarity between C 1 * and C 2 vector-space model cosine C1*C1* C1C1 G

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Method Our method for false friends extraction from parallel bi-text works in two steps: 1.Find candidate cognates / false friends Modified orthographic similarity measure 2.Distinguish cognates from false friends Sentence-level co-occurrences Word alignment probabilities Web-based semantic similarity Combined approach

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Combined Approach Sentence-level co-occurrences Problems with infrequent words Word alignments Works well only when the statistics for the target words are reliable Problems with infrequent words Web-based semantic similarity Quite reliable for unrelated words Sometimes assigns very low scores to highly- related word pairs Works well for infrequent words We combine all three approaches by adding up their similarity values

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Experiments and Evaluation

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Evaluation: Methodology We extract all pairs of cognates / false friends from a Bulgarian-Russian bi-text: MMEDR(w 1, w 2 ) > pairs of words: 577 cognates and 35 false friends We order the pairs by their similarity score according to 18 different algorithms We calculate 11-point interpolated average precision on the ordered pairs

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Resources Bi-text The first seven chapters of the Russian novel "Lord of the World" + its Bulgarian translation Sentence-level aligned with MARK ALISTeR (using the Gale-Church algorithm) 759 parallel sentences Morphological dictionaries Bulgarian: 1M wordforms (70,000 lemmata) Russian: 1.5M wordforms (100,000 lemmata)

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Resources (2) Bilingual glossary Bulgarian / Russian glossary 3,794 pairs of translation words Stop words A list of 599 Bulgarian stop words A list of 508 Russian stop words Web as a corpus Google queries for 557 Bulgarian and 550 Russian words Up to 1,000 text snippets for each word

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Algorithms BASELINE – word pairs in alphabetical order COOC – the sentence-level co-occurrence algorithm with formula F6 COOC+L – COOC with lemmatization COOC+E1 – COOC with the formula E1 COOC+E1+L – COOC with the formula E1 and lemmatization COOC+E2 – COOC with the formula E2 COOC+E2+L – COOC with the formula E2 and lemmatization COOC+SMT+L – average of COOC+L and translation probability

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Algorithms (2) WEB+L – Web-based semantic similarity with lemmatization WEB+COOC+L – average of WEB+L and COOC+L WEB+E1+L – average of WEB+L and E1+L WEB+E2+L – average of WEB+L and E2+L WEB+SMT+L – average of WEB+L and translation probability E1+SMT+L – average of E1+L and translation probability E2+SMT+L – average of E2+L and translation probability WEB+COOC+SMT+L – average of WEB+L, COOC+L and translation probability WEB+E1+SMT+L – average of WEB+L, E1+L, and translation probability WEB+E2+SMT+L – average of WEB+L, E2+L and translation probability

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Results Algorithm 11-pt Avg. Precision BASELINE4.17% E238.60% E139.50% COOC43.81% COOC+L53.20% COOC+SMT+L56.22% WEB+COOC+L61.28% WEB+COOC+SMT+L61.67% WEB+L63.68% Algorithm 11-pt Avg. Precision E1+L63.98% E1+SMT+L65.36% E2+L66.82% WEB+SMT+L69.88% E2+SMT+L70.62% WEB+E2+L76.15% WEB+E1+SMT+L76.35% WEB+E1+L77.50% WEB+E2+SMT+L78.24%

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Conclusion and Future Work

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Conclusion We improved the accuracy of the best known algorithm by nearly 35% Lemmatization is a must for highly- inflectional languages like Bulgarian and Russian Combining multiple information sources works much better than any individual source

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Future Work Take into account the part of speech e.g. a verb and a noun cannot be cognates Improve the formulas for the sentence- level approaches Improved Web-based similarity measure e.g. only use context words in certain syntactic relationships with the target word New resources Wikipedia, EuroWordNet, etc. Large parallel bi-texts as a source of semantic information

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Thank you! Questions? Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus