Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.

Slides:



Advertisements
Similar presentations
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Improved TF-IDF Ranker
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Research methods in corpus linguistics Xiaofei Lu.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative Analysis of Automatic Term and Collocation Extraction Sanja.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Language Identification and Part-of-Speech Tagging
Statistical NLP: Lecture 7
Computational and Statistical Methods for Corpus Analysis: Overview
Statistical NLP: Lecture 13
The CoNLL-2014 Shared Task on Grammatical Error Correction
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation University of Ljubljana

Overview Term identification in a monolingual context: Some known approaches Slovene-English setup: Corpora, tools, resources Multi-word terms: Collocations, nested terms and term variants Bilingual lexicon extraction and term equivalence Extracting semantic information Evaluation & things to improve

Bilingual term extraction: Usual processing sequence L1 L2 parallel corpus term candidates L1 term candidates L2 bilingual term lexicon finding translation equivalents

Corpus Slovene-English parallel corpus of terminological texts (TRANS), ca. 1 million tokens; created within a student project at our Department; aligned with DejaVu (hand- validated); on-line concordancing at here: 2 subcorpora Nuclear Engineering25,000 tokens Economic legislation166,000 tokens

Linguistic processing Slovene – tokenization – part-of-speech tagging TnT (Brants 2000) training corpus creation tagger training & error correction – lemmatization Amebis (thanks!) proprietory lemmatization tool; non-disambiguated: je  biti, jesti, on lemma disambiguation though self-made rules

Linguistic processing II English – using DFKI tools (thanks!) – POS-tagging (TnT) – lemmatization (MMorph) – chunking (Chunkie)

What is a term: “keywordness” Measures of keywordness: subcorpus vs. general language corpus relative corpus frequency document vs. document collection tf.idf Applied to single or multi-word units. N df i weight(i, j) = (1 + log(tf i,j )) log —

Other indicators of termness Acronyms (NPP, SG, RBB...) Unknown words – not found in the reference corpus – unknown to the lemmatizer Cognates & Named entities JE KrškoKrško NPP KonzorcijConsortium Siemens/FramatomeSiemens/Framatome

Identifying multi-word units Collocation extraction techniques – Mutual Information (Church & Hanks 1990) – Log-likelihood ratio (Dunning 1993) – Entropy-based (Shimohata et al. 1997) – Semantic non-compositionality (Pearce 2001) According to Daille (1994), LL is the most appropriate measure for n > 3: n-gram frequency (+ stopword filtering) also works

N-gram term weighting statistically extracted n-grams are not necessarily terms  need for filtering / weighting 1. Stopword filtering 2. Weighting with tf.idf, ll-rank/core frequency weight(t w1, w2, w3 ) = tf.idf w1 tf.idf w2 tf.idf w3 /n * 1/rank

Treatment of nested terms Local Max of bigram LL-scores previous steam generator replacements

Treatment of nested terms Local Max of bigram LL-scores previous steam generator replacements 34,17 602,05 77,88

Treatment of nested terms Local Max of bigram LL-scores previous steam generator replacements 34,17 602,05 77,88 steam generator replacement requires 602,05 77,88 20,44

Treatment of nested terms Local Max of bigram LL-scores C-value (Frantzi & Ananiadou 1996) C-value(a) = (length(a) –1)(freq(a) – t(a)/c(a)) n-gramC-value compressive force10,3 axial compressive 5,2 axial compressive force16,4

Extracting multi-word terms: Syntactic patterns Extraction of terminologically relevant part-of-speech patterns (applied as regular expressions or finite state automata) (Heid 1998, 2001; Bourigault 1996; Jacquemin 2001) Patterns enable extraction of single occurrences Patterns facilitate treatment of term variation (replacement of steam generator = steam generator replacement)NN 1 of NN 2 NN 3 = NN 2 NN 3 NN 1 Patterns facilitate treatment of nesting – head of phrase may be easily established

Bilingual lexicon extraction word-alignment tools: Twente (Hiemstra 1998), Egypt/Giza (Och 2000), PLUG (Tiedemann 1999) etc. comparison planned; currently using Twente – based on the Iterative Proportional Fitting Procedure (IPFP), word-to-word translation model – outputs translation candidates + scores for each word in the corpus; both ways – using stopword-filtered corpora to improve results

Output of Twente lexicon extraction

Extraction of cognates string comparison on the level of types in two parallel segments: Perl module String::Approx (Hietanainen 2002) high precision  cognates override bilingual lexicon term relevance! Nr. of extracted cognate pairs: NE364 EL776 informatikainformatics infrastruktureinfrastructure instrumentacijainstrumentation integracijaintegrating integralaintegral iterativeniterative karakteristikcharacteristics kaskadecascade koeficientcoefficient komponentacomponent koncentracijoconcentration konceptconcept konstantaconstant konvergencaconvergence koordinatcoordinates linearnelinear logisticxnalogistic materialimaterials matrikaMatrix

Term alignment for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov project 1.00 [null] 0.00steam 0.49 generator 0.33 generators 0.18

Term alignment for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov among extracted target terms we choose the one with highest match of words scores are added up into equivalence score steam generator replacement project1.82 project 1.00 [null] 0.00steam 0.49 generator 0.33 generators 0.18

Bilingual term extraction: Statistical model L1 L2 parallel corpus single-word terms contiguous n-grams (2-4) contiguous n-grams (2-4) tf.idf, cognates, unknown words log-likelihood stopword filtering collapsing nesting term weighting multi-word terms bilingual lexicon cognate pairs term alignment bilingual term candidates

Bilingual term extraction: Pattern-based model L1 L2 tagged & lemmatized parallel corpus single-word terms multi-word pattern instances nouns only; tf.idf of lemmas, cognates pattern grammar stopword filtering term weighting multi-word terms bilingual lexicon cognate pairs term alignment bilingual term candidates

....

High precision term pairs

Evaluation & Results Evaluation data Slovene: one hand-tagged document (by a group of translation students, not domain expert!) – 181 terms (including nestings) Pattern-based term-tagging correctly detects 71 (precision x, recall 39.2%) Reasons for missed terms: – term length > 4 – term variation (low tf.idf) – automatic filtering of nestings too rigid – tagging/lemmatization mistakes (pattern not extracted) Fine-tuning of the weighting scheme needed (currently set too highest possible precision)