GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Slides:

Advertisements

Similar presentations

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.

Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.

Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Research methods in corpus linguistics Xiaofei Lu.

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

Multilingual Synchronization focusing on Wikipedia

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

A hybrid method for Mining Concepts from text CSCE 566 semester project.

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.

National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.

Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.

Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Language Identification and Part-of-Speech Tagging

EXTRACTING COMPLEX PREDICATES IN HINDI ACROSS PARALLEL CORPORA

Measuring Monolinguality

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Content Introduction Multilingual Terminology Mining Chain  Term Extraction  Term Alignment  Direct Context-Vector Method  Translation of Lexical Units Linguistic Resources  Comparable Corpora  Bilingual Dictionary Conclusion

Introduction Text mining research generally adopts “big is beautiful” approach  Justified by the need of large amount of data in order to make use of statistic or stochastic methods [1] Hypothesis : The quality rather than the quantity of the corpus matters more in terminology mining Web is used as a Comparable Corpus The comparability of the corpus should not only be based on the domain or the sub-domain, but also on the type of discourse 1 : Manning and Schütze, 1999

Multi lingual Terminology Mining Chain Architecture : WEB Source language documents Target language documents Terminology Extraction Lexical alignment process Bilingual Dictionary Terms to be translated Translated terms

Architecture A comparable corpora is taken as input Output is a list of single- and multi-word candidate terms along with their candidate translations Processes involved :  Term Extraction  Term Alignment  Direct Context-Vector Method  Translation of lexical units

Term Extraction Terminological units extracted are MW terms whose syntactic patterns, expressed using POS tags, correspond either to a canonical or a variation structure For French main patterns are  N N  N Prep N et N adj For Japanese main patterns are  N N  N Suff  Adj N  Pref N Variants handled are  Morphological for both French and Japanese  Syntactical for French  Compounding for Japanese

Variants handled in French Morphological Variant : Morphological modification of one of the components of base form Syntactical Variant : the insertion of another word into the components of the base form Compounding Variant : the agglutination of another word to one of the components of base form Example : sécrétion d’insuline (insulin secretion)  Base form : N Prep N pattern  Morphological variant : sécrétions d’insuline (insulin secretions)  Syntactic variant : sécrétion pancréatique d’insuline (pancreatic insulin secretion)  Syntactic variant : sécrétions de peptide et d’insuline (insulin and peptide secretion)

Variants handled in Japanese Example the MWT (insulin secretion) appears in following form:  Base form : N N pattern :  Compounding variant : agglutination of a word at the end of the base form : (insulin secretion ability)

Term Alignment It aligns source MWT’s with target SWT’s or MWT’s Direct Context-Vector method:  Collect all lexical units in the context of lexical unit ‘i’ in a window of size ‘n’ words around ‘i’  For each lexical unit of source and target language  Obtain a context-vector Vi, which gathers the set of co-occurrences units j associated with the number of times that j and i occur together  Normalize context vector  Mutual information  Log-likelihood

Term Alignment: Direct Context-Vector Method  Using Bilingual Dictionary, translate the lexical units of the source context-vector  For a word to be translated, compute the similarity between the translated context-vector and all target vectors through vector distance  Candidate translations of a lexical units are the target lexical units closest to the translated context-vector acc. to the vector distance

Term Alignment: Translation Translation of lexical units  Depends on the coverage of bilingual dictionary  If bilingual dictionary provides several translations for a lexical unit, consider all of them but weight the different translations by their frequency in the target language  For a MW, possible translations are generated by using compositional method  If it is not possible to translate all compositions of MW, MWT is not taken into account in the translation process

Composition methods for French and Japanese For Japanese  Fatigue chronique (chronic fatigue) for fatigue four translations are possible : two translations for chronique:  We generate all combinations [2] of translated elements and select those which refer to an existing MWT in the target language For French  For a multi-word of length n [3], produce all the combinations of MW Unit elements of length less than or equal to n  Syndrome de fatigue chronique (chronic fatigue disease) yields the four possible combinations: [Syndrome de fatigue chronique] [Syndrome de fatigue] [chronique] [Syndrome] [fatigue chronique] [Syndrome] [fatigue] [chronique]  A direct translation of subpart of the MW is done if present in bilingual dictionary  90% of the candidate terms provided by term extraction are composed of only two content words, so limiting to the combination 4th 2: Grefenstette, : Robitaille et al.,2006

Comparable Corpora Are sets of texts in different languages, that are not translations of each other Share some characteristics or features : topic, period, media, author, discourse One of the clearest is ICE -- the International Corpus of English=1, ; (Greenbaum1991) Corpora of around one million words in each of many varieties of English around the world

Bilingual Dictionary A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another. Bilingual dictionaries can be  Unidirectional, meaning that they list the meanings of words of one language in another  Bidirectional, allowing translation to and from both languages.

Conclusion More frequent a term and its translation, the better is the quality of alignment  The discourse categorization of documents allows lexical acquisition to increase precision  Including discourse, results in candidate translations of better quality even if the corpus size is reduced by half  Gives rise to data sparsity problem Data sparsity problem can be partially solved by using comparable corpora of high quality

References ml ml Bilingual terminology mining - using brain, not brawn comparable corpora [Emmanuel Morin, B´eatrice Daille, Koichi Takeuchi, and Kyo Kageura 2007] Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain X. Saralegi, I. San Vicente, A. Gurrutxaga

Thank You