LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Advertisements

Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Abdelghani Bellaachia and Mohammed Al-Dhelaan 2012, WIIAT NE-Rank: A Novel Graph-based.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Using Percolated Dependencies in PBSMT Ankit K. Srivastava and Andy Way Dublin City University CLUKI XII: April 24, 2009.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Research methods in corpus linguistics Xiaofei Lu.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Webpage Understanding: an Integrated Approach
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt( ) Janardhan Singh( ) Ashutosh Nirala( ) Brijesh Bhatt( ) 1.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Approaches to Machine Translation
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
PRESENTED BY: PEAR A BHUIYAN
Computational and Statistical Methods for Corpus Analysis: Overview
Statistical NLP: Lecture 13
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008 N ON P ARALLEL C ORPORA

July 24, 2008Lexicon Extraction ~ Ankit2 Learning a Translation Lexicon from non Parallel Corpora Motivation Methodology Implementation Experiments Conclusion Master’s Project AT University of Washington Seattle, USA JUNE 2008

July 24, 2008Lexicon Extraction ~ Ankit3 { lexicon } Word – to – word mapping between 2 languages Invaluable resource in multilingual applications like CLIR, CL resource, CALL, etc. MOTIVATION Wahl election0.85 ballot0.10 option0.02 selection0.02 choice0.01 Sheridan & Ballerini 1996 McCarley 1999 Yarowsky & Ngai 2001 Cucerzan & Yarowsky 2002 Nerbonne et al. 1997

July 24, 2008Lexicon Extraction ~ Ankit4 { corpora } Parallel, comparable, non-comparable text More monolingual text than bitext 5 dimensions of nonparallelness Most statistical clues no longer applicable MOTIVATION time period topic domain author language

July 24, 2008Lexicon Extraction ~ Ankit5 { task } Given any two pieces of text in any two languages… …Can we extract word translations? MOTIVATION

July 24, 2008Lexicon Extraction ~ Ankit6 { insight } If two words are mutual translations, then their more frequent collocates (context window) are likely to be mutual translations as well. Counting co occurrences within a window of size N is less precise than counting co occurrences within local syntactic contexts [Harris 1985]. 2 types of context windows – Positional (window size 4) and Syntactic (head, dependent) METHODOLOGY

July 24, 2008Lexicon Extraction ~ Ankit7 { context } METHODOLOGY Vinken will join the board as a nonexecutive director Nov 29. POSITIONAL: SYNTACTIC: Vinken will join the board as a nonexecutive director Nov 29.

July 24, 2008Lexicon Extraction ~ Ankit8 { algorithm } For each unknown word in the SL & TL, define the context in which that word occurs. Using an initial seed lexicon, translate as many source context words into the target language. Use a similarity metric to compute the translation of each unknown source word. It will be the target word with the most similar context. METHODOLOGY Koehn & Knight 2002 Rapp 1995, 1999 Fung & Yee 1998 Otero & Campos 2005

July 24, 2008Lexicon Extraction ~ Ankit9 { system } IMPLEMENTATION CORPUS CLEANING PCFG PARSING 1 2 CORPUS

July 24, 2008Lexicon Extraction ~ Ankit10 { pre-process } EXPERIMENTS ENGLISHGERMAN DATAWall Street Journal (WSJ)Deutsche Presse Agentur (DPA) YEARS1990,1991 and and 1996 COVERAGE446 days of news text530 days of news text Raw Text Corpora: Phrase Structures: 1 2 Stanford Parser (Lexicalized PCFG) for English and German [Klein & Manning 2003]

July 24, 2008Lexicon Extraction ~ Ankit11 { system } IMPLEMENTATION CORPUS CLEANING PCFG PARSING PS TO DS CONVERSION DATA SETS 4 CORPUS

July 24, 2008Lexicon Extraction ~ Ankit12 { pre-process } EXPERIMENTS ENGLISHGERMAN TEXT1,521,998 sentences808,146 sentences TOKENS36,251,168 words14,311,788 words TYPES276,402 words388,291 words 3 Dependency Structures: Head Percolation Table [Magerman 1995; Collins 1997] was used to extract head-dependent relations from each parse tree. 4 Data Sets:

July 24, 2008Lexicon Extraction ~ Ankit13 SEED LEXICON { system } IMPLEMENTATION CORPUS CLEANING PCFG PARSING PS TO DS CONVERSION CONTEXT GENERATOR PARSED TEXT RAW TEXT 5 DATA SETS 4 SYN VECTORS POS VECTORS CORPUS

July 24, 2008Lexicon Extraction ~ Ankit14 { vector } Seed lexicon obtained from a dictionary, identically spelled words, spelling transformation rules. Context vectors have dimension values (co occurrence of word with seed) normalized on seed frequency. EXPERIMENTS ENGLISHGERMAN DIMENSION2,376 words SEED2,350 words2,376 words UNKNOWN74,434 words106,366 words 5 Context Vectors:

July 24, 2008Lexicon Extraction ~ Ankit15 SEED LEXICON { system } IMPLEMENTATION CORPUS CLEANING PCFG PARSING PS TO DS CONVERSION CONTEXT GENERATOR PARSED TEXT RAW TEXT 5 DATA SETS 4 SYN VECTORS POS VECTORS VECTOR SIMILARITY 6 L1 L2 RANK CORPUS TRANS. LIST

July 24, 2008Lexicon Extraction ~ Ankit16 { evaluate } Vector similarity metrics used are city block [Rapp 1999] and cosine. Translations sorted in descending order of scores. Evaluation data extracted from online bilingual dictionaries (364 translations). EXPERIMENTS CITY BLOCKCOSINE POSITIONAL CONTEXT63 out of out of 364 SYNTACTIC CONTEXT301 out of out of Ranked Translations Predictor:

July 24, 2008Lexicon Extraction ~ Ankit17 { endnote } Extraction from non parallel corpora useful for compiling lexicon from new domains. Syntactic context helps in focusing the context window, more impact on longer sentences. Non parallel corpora involves more filtering, search heuristics than in parallel. Future directions include using syntactic only on one side, extending coverage through stemming. CONCLUSION

July 24, 2008Lexicon Extraction ~ Ankit18 { thanks }