RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav.

Slides:

Advertisements

Similar presentations

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus Preslav.

Advertisements

DSC 2008 – June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment.

Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta.

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Chapter 5: Introduction to Information Retrieval

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

Information Retrieval

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Chapter 5: Information Retrieval and Web Search

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Chapter 6: Information Retrieval and Web Search

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley.

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Automatic Writing Evaluation

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Information Retrieval and Web Search

ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network

Improved Word Alignments Using the Web as a Corpus

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Presentation transcript:

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav Nakov, University of California, Berkeley Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and Management of Multilingual Lexicons

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Introduction Cognates and false friends Cognates are pair of words in different languages that sound similar and are translations of each other False friends are pairs of words in two languages that sound similar but differ in their meanings The problem Design an algorithm that can distinguish between cognates and false friends

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognates and False Friends Examples of cognates ден in Bulgarian = день in Russian (day) idea in English = идея in Bulgarian (idea) Examples of false friends майка in Bulgarian (mother) майка in Russian (vest) prost in German (cheers) прост in Bulgarian (stupid) gift in German (poison) gift in English (present)

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria The Paper in One Slide Measuring semantic similarity Analyze the words local contexts Use the Web as a corpus Similarities contexts similar words Context translation cross-lingual similarity Evaluation 200 pairs of words 100 cognates and 100 false friends 11pt average precision: 95.84%

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Contextual Web Similarity What is local context? Few words before and after the target word The words in the local context of given word are semantically related to it Need to exclude the stop words: prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need of sufficiently big corpus Same day delivery of fresh flowers, roses, and unique gift baskets from our online boutique. Flower delivery online by local florists for birthday flowers.

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Contextual Web Similarity Web as a corpus The Web can be used as a corpus to extract the local context for given word The Web is the largest possible corpus Contains big corpora in any language Searching some word in Google can return up to excerpts of texts The target word is given along with its local context: few words before and after it Target language can be specified

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Contextual Web Similarity Web as a corpus Example: Google query for "flower" Flowers, Plants, Gift Baskets FLOWERS.COM - Your Florist... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by FLOWERS.COM, Your Florist of Choice for over 30 years. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, plants, roses, & gifts. Flowers delivery with fewer... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Contextual Web Similarity Measuring semantic similarity For given two words their local contexts are extracted from the Web A set of words and their frequencies Semantic similarity is measured as similarity between these local contexts Local contexts are represented as frequency vectors for given set of words Cosine between the frequency vectors in the Euclidean space is calculated

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Contextual Web Similarity Example of context words frequencies wordcount fresh217 order204 rose183 delivery165 gift124 welcome98 red87... word: flower wordcount Internet291 PC286 technology252 order185 new174 Web159 site word: computer

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Contextual Web Similarity Example of frequency vectors Similarity = cosine(v 1, v 2 ) #wordfreq. 0alias3 1alligator2 2amateur0 3apple zap0 5000zoo6 v 1 : flower #wordfreq. 0alias7 1alligator0 2amateur8 3apple zap3 5000zoo0 v 2 : computer

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cross-Lingual Similarity We are given two words in different languages L 1 and L 2 We have a bilingual glossary G of translation pairs {p L 1, q L 2 } Measuring cross-lingual similarity: 1. We extract the local contexts of the target words from the Web: C 1 L 1 and C 2 L 2 2. We translate the context 3. We measure distance between C 1 * and C 2 C1*C1* C1C1 G

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Reverse Context Lookup Local context extracted from the Web can contain arbitrary parasite words like "online", "home", "search", "click", etc. Internet terms appear in any Web page Such words are not likely to be associated with the target word Example (for the word flowers) "send flowers online", "flowers here", "order flowers here" Will the word "flowers" appear in the local context of "send", "online" and "here"?

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Reverse Context Lookup If two words are semantically related both should appear in the local contexts of each other Let #{x,y} = number of occurrences of x in the local context of y For any word w and a word from its local context w c, we define their strength of semantic association p(w,w c ) as follows: p(w, w c ) = min{ #(w, w c ), #(w c,w) } We use p(w,wc) as vector coordinates when measuring semantic similarity

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm* We have a bilingual glossary G: L 1 L 2 of translation pairs and target words w 1, w 2 We search in Google the co-occurrences of the target words with the glossary entries Compare the co-occurrence vectors for each {p,q} G compare max (google#("w 1 p") and google#("p w 1 ")) with max (google#"w 2 q") and google#("q w 2 ")) * P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Evaluation Data Set We use 200 Bulgarian/Russian pairs of words: 100 cognates and 100 false friends Manually assembled by a linguist Manually checked in several large monolingual and bilingual dictionaries Limited to nouns only

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Experiments We tested few modifications of our contextual Web similarity algorithm Use of TF.IDF weighting Preserve the stop words Use of lemmatization of the context words Use different context size (2, 3, 4 and 5) Use small and large bilingual glossary Compared it with the seed words algorithm Compared with traditional orthographic similarity measures: LCSR and MEDR

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Experiments BASELINE: random MEDR: minimum edit distance ratio LCSR: longest common subsequence ration SEED: the "seed words" algorithm WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting NO-STOP: WEB3 without stop words removal WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5 LEMMA: WEB3 with lemmatization HUGEDICT: WEB3 with the huge glossary REVERSE: the "reverse context lookup" algorithm COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Resources We used the following resources: Bilingual Bulgarian / Russian glossary: pairs of translation words Huge bilingual glossary: word pairs A list of 599 Bulgarian stop words A list of 508 Russian stop words Bulgarian lemma dictionary: wordforms and lemmata Russian lemma dictionary: wordforms and lemmata

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Evaluation We order the pairs of words from the testing dataset by the calculated similarity False friends are expected to appear on the top and the cognates on the bottom We evaluate the 11pt average precision of the obtained ordering

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Results (11pt Average Precision) Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Results (11pt Average Precision) Comparing different context sizes; keeping the stop words

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Results (11pt Average Precision) Comparing different improvements of the WEB3 algorithm

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Results: The Ordering for WEB3 rCandidateBG SenseRU 1 муфта gratismuff0,0085no100.00%1.00% 2 багрене / багренье mottlegaff0,0130no100.00%2.00% 3 добитък / добыток livestockincome0,0143no100.00%3.00% 4 мраз / мразь chillcrud0,0175no100.00%4.00% 5 плет / плеть hedgewhip0,0182no100.00%5.00% … … ……………… 99 вулкан volcano 0,2099yes81.82%81.00% 100 година yeartime0,2101no82.00% 101 бут legrubble0,2130no82.18%83.00% … … ……………… 196 финанси / финансы finance 0,8017yes51.28%100.00% 197 сребро / серебро silver 0,8916yes50.76%100.00% 198 наука science 0,9028yes50.51%100.00% 199 флора flora 0,9171yes50.25%100.00% 200 красота beauty 0,9684yes50.00%100.00%

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Discussion Our approach is original because: Introduces semantic similarity measure Not orthographic or phonetic Uses the Web as a corpus Does not rely on any preexisting corpora Uses reverse-context lookup Significant improvement in quality Is applied to original problem Classification of almost identically spelled true/false friends

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Discussion Very good accuracy: over 95% It is not 100% accurate Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences The Web as a corpus introduces noise Google returns the first results only Google ranks higher news portals, travel agencies and retail sites than books, articles and forums posts Local context could contains noise

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Conclusion and Future Work Conclusion Algorithm that can distinguish between cognates and false friends Analyzes words local contexts, using the Web as a corpus Future Work Better glossaries Automatic augmenting the glossary Different language pairs

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Questions? Cognate or False Friend? Ask the Web!