Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Slides:



Advertisements
Similar presentations
Problem Semi supervised sarcasm identification using SASI
Advertisements

Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
A Method of Rating the Credibility of News Documents on the Web Ryosuke Nagura, Yohei Seki Toyohashi University of Technology Aichi, , Japan
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Statistical NLP: Lecture 13
Presentation transcript:

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center, Department of Electrical & Electronic Engineering, HKUST Coling 2004

Abstract A method for mining parallel sentences from quasi-comparable bilingual text. We use multi-level bootstrapping to improve the alignments between documents, sentences and bilingual word pairs iteratively. Our method is the first one that does not rely on any supervised training data, such as a sentence-aligned corpus, or temporal information, such as the publishing date of a news article.

1 Introduction Existing work extract parallel sentences from parallel, noisy parallel or comparable corpora based on the assumption that parallel sentences should be similar in sentence length, sentence order and bilexical context. In our task, many of assumptions in previous work are no longer applicable. Alternatively, we propose an effective, multi-level bootstrapping approach to accomplish this task (Figure 1).

2 Bilingual Sentence Alignment A parallel corpus is a sentence-aligned corpus containing bilingual translations of the same document (ex: The Hong Kong Laws Corpus). A noisy parallel and comparable corpus contains non-aligned sentences that are nevertheless mostly bilingual translations of the same document (ex: The Hong Kong News Corpus).

Another type of comparable corpus is one that contains non-sentence-aligned, non- translated bilingual documents that are topic-aligned. A quasi-comparable corpus is one that contains non-aligned, and non-translated bilingual documents that could either be on the same topic (in-topic) or not (off- topic) (ex: TDT3). 2 Bilingual Sentence Alignment

TDT3  News stories from radio broadcasting or TV news report from in English and Chinese.  7,500 Chinese and 12,400 English documents, covering more than 60 different topics.  1,200 Chinese and 4,500 English documents are manually marked as being in-topic. 2 Bilingual Sentence Alignment

2.1 Comparing bilingual corpora We argue the usability of bilingual corpus depends how well the sentences are aligned. Lexical alignment score is defined as: f(W c,W e ) is the co-occurrence frequency of bilexicon pair (W c, W e ) in the aligned sentence pairs. f(W c ) and f(W e ) are the occurrence frequencies of Chinese word W c and English word W e, in the bilingual corpus.

Table 1 shows the lexical alignment scores of a parallel corpus (Hong Kong Law), a comparable noisy parallel corpus (Hong Kong News), and a quasi-comparable corpus (TDT3). 2.1 Comparing bilingual corpora

2.2 Comparing alignment assumptions For parallel corpus :  1. There are no missing translations in the target document;  2. Sentence lengths: a bilingual sentence pair are similarly long in the two languages;  3. Sentence position: Sentences are assumed to correspond to those roughly at the same position in the other language;  4. Bi-lexical context: A pair of bilingual sentences which contain more words that are translations of each other tend to be translations themselves.

2.2 Comparing alignment assumptions For noisy parallel corpus :  5. Occurrence frequencies of bilingual word pairs are similar;  6. The positions of bilingual word pairs are similar;  7. Words have one sense per corpus;  8. Following 7, words have a single translation per corpus;  9. Following 4, the sentence contexts in two languages of a bilingual word pair are similar.

For comparable corpora, previous sentence or word pair extraction works are based soly on bilexical context assumption. Similarly, for quasi-comparable corpus, we can only rely on bilexical context assumption and one additional assumption:  10. Seed parallel sentences: Documents and passages that are found to contain at least one pair of parallel sentences are likely to contain more parallel sentences. 2.2 Comparing alignment assumptions

3 Our approach: Multi-level Bootstrapping We use cosine similarity and we dispense with using parallel corpora to train an alignment classifier. Algorithm outline  1. Extract comparable documents  2. Extract parallel sentences  3. Update bilingual lexicon  4. Update comparable documents

3.1 Extract comparable documents For all documents in the comparable corpus D:  a. Gloss Chinese documents using the bilingual lexicon (Bilex); Language Data Consortium Ch-Eng dictionary 2.0  b. For every pair of glossed Chinese and English documents, compute document similarity → S(i,j) ; Term weighting: inverse document frequency  c. Obtain all matched bilingual document pairs whose S(i,j) > threshold1 → C

3.2 Extract parallel sentences For each document pair in C:  a. For every pair of glossed Chinese sentence and English sentence, compute sentence similarity → S2(i,j);  b. Obtain all matched bilingual sentence pairs whose S2(i,j) > threshold2 → C2

3.3 Update bilingual lexicon For each bilingual word pair in C2:  a. Compute correlation scores of all bilingual word pairs → S3(i,j);  b. Obtain all bilingual word pairs previously unseen in Bilex and whose S3(i,j) > threshold3 → C3 and update Bilex;  c. Compute alignment score → S4; if (S4 > threshold4 ) return C3 otherwise continue;

Learning unknown words (1) 1. Extract new Chinese name entities (Zhai et al 2004); 2. For each new Chinese name entity:  Extract all sentences that it appears in from the original Chinese corpus, and build a context vector;  For all English words, collect all sentences it appears in from the original corpus, and build the context vectors;

Learning unknown words (2)  Calculate the similarity between the Chinese word and each of the English word vectors where A is the aligned bilexicon pair between the two word vectors;  Rank the English candidate according to the similarity score.

Learning unknown words (3) Below are some example of unknown name entities that have been translated (or transliterated) correctly:

 a. Find all pairs of glossed Chinese and English documents which contain parallel sentences (anchor sentences) from C2 → C4;  b. Expand C4 by finding documents similar to each of the document in C4;  c. C := C4;  d. Goto step 2; 3.4 Update comparable documents

4 Evaluation (1) Baseline method  The baseline method shares the same preprocessing, document matching and sentence matching without iteration.  The precision of the parallel sentences extracted is 43% for the top 2,500 pairs, ranked by sentence similarity scores.

4 Evaluation (2) Multi-level bootstrapping  The precision of parallel sentences extraction is 67% for the top 2,500 paris using our method, which is 24% higher than the baseline.  The precision increases steadily over each iteration, until convergence.

4 Evaluation (3) Using the bilingual lexical score (see Section 2.1) for evaluation:

5 Conclusion The main contributions of our work lie in steps 3 and 4 and in the iterative process. By using the correct alignment assumptions, we have demonstrated that a bootstrapping iterative process is also possible for finding parallel sentences and new word translations from comparable corpus.