Improved Word Alignments Using the Web as a Corpus

Slides:



Advertisements
Similar presentations
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus Preslav.
Advertisements

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav.
Statistical Machine Translation
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Chapter 5: Information Retrieval and Web Search
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Text Analysis Everything Data CompSci Spring 2014.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Korea Maritime and Ocean University NLP Jung Tae LEE
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Language Identification and Part-of-Speech Tagging
Statistical Machine Translation Part II: Word Alignments and EM
Tolerant Retrieval Review Questions
Erasmus University Rotterdam
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
CSCI 5832 Natural Language Processing
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
The CoNLL-2014 Shared Task on Grammatical Error Correction
Statistical Machine Translation Papers from COLING 2004
Statistical n-gram David ling.
Improving IBM Word-Alignment Model 1(Robert C. MOORE)
A Path-based Transfer Model for Machine Translation
Statistical NLP Spring 2011
Presentation transcript:

Improved Word Alignments Using the Web as a Corpus International Conference RANLP 2007 (Recent Advances in Natural Language Processing) Preslav Nakov, University of California, Berkeley Svetlin Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Statistical Machine Translation (SMT) 1988 – IBM models 1, 2, 3, 4 and 5 Start with bilingual parallel sentence-aligned corpus Learn translation probabilities of individual words 2004 – PHARAOH model Learn translation probabilities for phrases Alignment template approach – extracts translation phrases from word alignments Improved word alignments in sentences improve translation quality! RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Word Alignments The word alignments problem Given a bilingual parallel sentence-aligned corpus align the words in each sentence with corresponding words in its translation Example English sentence Example Bulgarian sentence Try our same day delivery of fresh flowers, roses, and unique gift baskets. Опитайте нашите свежи цветя, рози и уникални кошници с подаръци с доставка на същия ден. RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Word Alignments – Example опитайте нашите свежи цветя рози и уникални кошници с подаръци доставка на същия ден try our same day delivery of fresh flowers roses and unique gift baskets RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Our Method Use combination of Orthographic similarity measure Semantic similarity measure Competitive linking Modified weighted minimum-edit-distance Analyses the co-occurring words in the local contexts of the target words using the Web as a corpus RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Orthographic Similarity Minimum Edit Distance Ratio (MEDR) MED(s1, s2) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s1 to s2 Longest Common Subsequence Ratio (LCSR) LCS(s1, s2) = the longest common subsequence of s1 and s2 RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Orthographic Similarity Modified Minimum Edit Distance Ratio (MMEDR) for Bulgarian / Russian Normalize the strings Assign weights for the edit operations Normalizing the strings Hand-crafted rules Strip the Russian letters "ь" and "ъ" Remove the Russian "й" at the endings Remove the definite article in Bulgarian (e.g. "ът", "ят" at the endings) RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Orthographic Similarity Assigning weights for the edit operations 0.5-0.9 for the vowel to vowel substitutions, e.g. 0.5 for е  о 0.5-0.9 for some consonant-consonant replacements, e.g. с  з 1.0 for all other edit operations Example: Bulgarian първият and the Russian первый (first) Normalization produces първи and перви, thus MMED = 0.5 (weight 0.5 for ъ  о) RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Semantic Similarity What is local context? Few words before and after the target word The words in the local context of given word are semantically related to it Need to exclude the stop words: prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need of sufficiently big corpus Same day delivery of fresh flowers, roses, and unique gift baskets from our online boutique. Flower delivery online by local florists for birthday flowers. RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Semantic Similarity Web as a corpus The Web can be used as a corpus to extract the local context for given word The Web is the largest possible corpus Contains big corpora in any language Searching some word in Google can return up to 1 000 excerpts of texts The target word is given along with its local context: few words before and after it Target language can be specified RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Semantic Similarity Web as a corpus Example: Google query for "flower" Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, plants, roses, & gifts. Flowers delivery with fewer ... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again. RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Semantic Similarity Measuring semantic similarity For given two words their local contexts are extracted from the Web A set of words and their frequencies Apply lemmatization Semantic similarity is measured as similarity between these local contexts Local contexts are represented as frequency vectors for given set of words Cosine between the frequency vectors in the Euclidean space is calculated RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Semantic Similarity Example of context words frequencies word: flower word: computer word count fresh 217 order 204 rose 183 delivery 165 gift 124 welcome 98 red 87 ... word count Internet 291 PC 286 technology 252 order 185 new 174 Web 159 site 146 ... RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Semantic Similarity Example of frequency vectors Similarity = cosine(v1, v2) v1: flower v2: computer # word freq. alias 3 1 alligator 2 amateur apple 5 ... 4999 zap 5000 zoo 6 # word freq. alias 7 1 alligator 2 amateur 8 3 apple 133 ... 4999 zap 5000 zoo RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Cross-Lingual Semantic Similarity We are given two words in different languages L1 and L2 We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2} Measuring cross-lingual similarity: We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2 We translate the context We measure similarity between C1* and C2 C1* C1 G RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Competitive Linking What is competitive linking? One-to-one bi-directional word alignments algorithm Greedy "best first" approach Links the most probable pair first, removes it, and repeats the same for the rest RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Applying Competitive Linking Make all words lowercase Remove punctuation Remove the stop words: prepositions, pronouns, conjunctions, etc. We don't align them Align the most similar pair of words Using the orthographic similarity combined with the semantic similarity Remove the aligned words Align the rest of the sentences RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Our Method – Example Bulgarian sentence Russian sentence Процесът на създаването на такива рефлекси е по-сложен, но същността им е еднаква. Процесс создания таких рефлексов сложнее, но существо то же. RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Out Method – Example Remove the stop words Bulgarian: на, на, такива, е, но, им, е Russian: таких, но, то Align рефлекси and рефлексов (semantic similarity = 0.989) Align по-сложен and сложнее (orthographic similarity = 0.750) Align процесът and процесс (orthographic similarity = 0.714) Align създаването and создания (orthographic similarity = 0.544) Align процесът and процесс (orthographic similarity = 0.536) Not aligned: еднаква RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Our Method – Example процесът на създаването такива рефлекси е по-сложен но същността им еднаква процесс создания таких рефлексов сложнее но существо то же RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Evaluation We evaluated the following algorithms BASELINE: the traditional alignment algorithm (IBM model 4) LCSR, MEDR, MMEDR: orthographic similarity algorithms WEB-ONLY: semantic similarity algorithm WEB-AVG: average of WEB-ONLY and MMEDR WEB-MAX: maximum of WEB-ONLY and MMEDR WEB-CUT: 1 if MMEDR(s1, s2) >= α (0 < α < 1), or WEB-ONLY(s1, s2) otherwise RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Testing Data and Experiments Testing data set A corpus of 5 827 parallel sentences Training set: 4 827 sentences Tuning set: 500 sentences Testing set: 500 sentences Experiments Manual evaluation of WEB-CUT AER for competitive linking Translation quality: BLEU / NIST RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Manual Evaluation of WEB-CUT Aligned the texts of the testing data set Used competitive linking and WEB-CUT for α=0.62 Obtained 14,246 distinct word pairs Manually evaluated the aligned pairs as: Correct Rough (considered incorrect) Wrong (considered incorrect) Calculated precision and recall For the case MMEDR < 0.62 RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Manual Evaluation of WEB-CUT Precision-recall curve RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Alignment Error Rate Gold standard for alignment For the first 100 sentences Created manually by a linguist Stop words and punctuation were removed Evaluated the alignment error rate (AER) for competitive linking Evaluated for all the algorithms LCSR, MEDR, MMEDR, WEB-ONLY, WEB-AVG, WEB-MAX and WEB-CUT RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Alignment Error Rate AER for competitive linking RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Translation Quality Built a Russian  Bulgarian statistical machine translation (SMT) system Extracted from the training set the distinct word pairs aligned with competitive linking Added them twice as additional “sentence” pairs to the training corpus Trained log-linear model for SMT with standard feature functions Used minimum error rate training on the tuning set Evaluated BLUE and NIST score on the testing set RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Translation Quality Translation quality: BLEU RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Translation Quality Translation quality: NIST RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Resources We used the following resources: Bulgarian-Russian parallel corpus: 5 827 sentences Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words A list of 599 Bulgarian / 508 Russian stop words Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Conclusion and Future Work Semantic similarity extracted from the Web can improve statistical machine translation For similar languages like Bulgarian and Russian orthographic similarity is useful Future Work Improve MMED with automatic leaned rules Improve the semantic similarity algorithm Filter parasite words like "site", "click", etc. Replace competitive linking with maximum weight bipartite matching RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Questions? Improved Word Alignments Using the Web as a Corpus RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria