1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

Identifying Translations Philip Resnik, Noah Smith University of Maryland.

1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003.

Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Michael Mohler, Rada Mihalcea Department of Computer Science University of North Texas BABYLON Parallel Text Builder:

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Machine translation Context-based approach Lucia Otoyo.

Statistical Alignment and Machine Translation

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

The Web as a Parallel Corpus A paper by Philip Resnik and Noah A. Smith (2003, Computational Linguistics) My interpretation of their research.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Multi-Layer Filtering algorithm Bilingual Chunk Alignment In Statistical Machine Translation An introduction of Multi-Layer Filtering (MLF) algorithm Dawei.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 – 380, MIT Press,2004. University of Maryland, Johns.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

General Architecture of Retrieval Systems 1Adrienn Skrop.

Mining The Web for Bilingual Text

Improved Word Alignments Using the Web as a Corpus

Statistical Machine Translation Papers from COLING 2004

Presentation transcript:

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early work: Hansards  Canadian parliamentary proceedings  French/English only  Still most resources are in formal newspaper style only

2 Harvesting parallel text from web  Strand: use similar structure to find likely translations  Using similar content to find translations  Applying methods to the Internet Archive, dramatically increasing quantity

3 STRAND  Structural Translation Recognition Acquiring Natural Data  Architecture  Location of possible translations  Generation of candidate translations  Filtering of candidates based on structure

4  Search for language in anchors (anchor: “English” OR anchor: “French”)

5 Structural Filtering  Linearize HTML and discard content  Run through transducer to produce:  [START element-label]  [END element-label]  [CHUNK length]

6  Align sequences using dynamic programming

7 Scalar values  Dp: difference in # structural items that have no match  N: number of aligned non-markup chunks of different lengths  R: correlation of chunk lengths  P: significance level of the correlations

8 Evaluation  Human judgments on 326 English- French paired pages  Using manually set thresholds on dp and n  100% precision  68.6% recall  Similar results on English/Chinese; English/Spanish  Typically throws out 1/3 data  Using machine learning: recall: 84% precision: 96%

9 Drawbacks of structural matching  Not all translations have similar structures  Not all texts use HTML markup

10 Content-based matching  Seed: bilingual lexicon  Link: pair x is in L1 and y in L2  Probability that x a translation of y given by bilingual lexicon  Want most probable link sequence that could account for a pair of texts  Product of the probability of links  Best set of links using Maximum Weighted Bipartite Matching

11  Cross-language similarity score: tsim  Computed on first 500 words of a document for efficiency

12 Experiment  Dictionary  English/French dictionary: 34,808 entries  Dictionary of English/French cognates: 35,513 pairs  Additional web pairs: 11,264 from Bible  Final lexicon: 132,155 pairs  Trained threshold for t-sim on 32 pairs from Strand test set  Strand (manual): Fmeasure of.81  Tsim: F-measure of.88  Combined model: F-measure.977