Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10.

Slides:



Advertisements
Similar presentations
Natural Language and Text Processing Laboratory Projects and Research Directions Head: Alexander Gelbukh
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Browsing by phrases: terminological information in interactive multilingual text retrieval Anselmo Peñas, Julio Gonzalo and Felisa Verdejo NLP Group, Dpto.
The Tiger Project: Korea Culture and Heritage DL Kim, Sung Hyuk Division of Information Science Sookmyung Women’s University, Seoul, Korea.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Semantic Annotation for Multilingual Search Shibamouli Lahiri
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Using TF-IDF to Determine Word Relevance in Document Queries
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Design of a Multi-lingual MT for Real-time Broadcast Captioning Course Project for Ying Zhang (Joy) Advisor: Eric.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
MIT Lincoln Laboratory CTIDES and Its Applications to US-Korea Joint Digital Libraries Initiatives CTIDES and Its Applications to US-Korea Joint Digital.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective by Ying Alvarado ( ) CSE 8337 Lecturer : Dr. Margaret.
Francisco Viveros-Jiménez Alexander Gelbukh Grigori Sidorov.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Knowledge Creation for an Educational Use of Digital Libraries across Language Boundaries US-Korea Joint Workshop on Digital Libraries August 10-11, 2000.
NLP Related Activities in Thailand Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Use of WordNet and on-line dictionaries to build EN-SK synsets (experimental tool) Ján GENČI Technical University of Košice, Slovakia
Summary Report Survey on Research and Development of Machine Translation in Asian Countries Virach Sornlertlamvanich Information Research and Development.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
8. ONLINE REFERENCE TOOLS Dictionaries and Thesauruses Concordancers and corpuses for language analysis Translators for language analysis Encyclopedias.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
The Unreasonable Effectiveness of Data
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Multilingual Search Shibamouli Lahiri
Types of Dictionaries A. Types of Dictionaries in terms of form/medium: - Books (advantages & disadvantages) - CDs (advantages & disadvantages) - Internet/Online.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Improvement of Semantic Interoperability based on Metadata Registry(MDR) Doo-Kwon Baik Dept. of CSE Korea University.
Removing the Language Barrier Machine Translation And Digital Libraries.
Cross-Language Information Retrieval (CLIR)

Statistical NLP: Lecture 9
Multilingual Information Access in a Digital Library
Information Retrieval
Presentation transcript:

Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University

Contents Introduction Bilingual resources –bilingual dictionary –bilingual corpus –bilingual thesaurus Our experience –bilingual dictionary –bilingual corpus –bilingual thesaurus Summary

Introduction What is the problem? –language barrier at multilingual digital library. How to solve the problem? –machine translation(MT) –cross-language information retrieval(CLIR) Why bilingual resources? MT and CLIR are based on bilingual resources. What shall we do? –constructing Korean-English bilingual dictionary Korean-English bilingual corpus Korean-English bilingual thesaurus

Overview DL language barrier CLIRMT bilingual resources

Bilingual Resources Bilingual dictionary Bilingual corpus Bilingual thesaurus

Definition –dictionary containing words and their translated words. Application field –CLIR [Oard 98], [Fujii et al. 99], [Myaeng et al. 99] –MT Utilization Bilingual Dictionary word “ 대기 ” word “ 대기 ” bilingual dictionary “ 대기 1 ” – “atmosphere” “ 대기 2 ” – “waiting” bilingual dictionary “ 대기 1 ” – “atmosphere” “ 대기 2 ” – “waiting” translated words “atmosphere” “waiting” translated words “atmosphere” “waiting” CLIR MT

Bilingual Corpus (1) Definition –comparable corpus a collection of similar texts in different languages –parallel corpus a collection of texts which have been translated into one or more other language(s). Ex) Canadian Hansard corpus Application field –CLIR [Yang et al. 98] –MT Example-Based Machine Translation –[Brown 96], [Murata et al. 99], [Shirai et al.97] –[Turcato et al 99]

Utilization Bilingual Corpus (2) translated words “ 대기 ” - “atmosphere” - “waiting” “ 오염 ” - “pollution” “ 대기 오염 ” “atmosphere pollution” ? “waiting pollution” ? translated words “ 대기 ” - “atmosphere” - “waiting” “ 오염 ” - “pollution” “ 대기 오염 ” “atmosphere pollution” ? “waiting pollution” ? CLIR MT bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “ 대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “ 대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” translated phrase “ 대기 오염 ” “atmosphere pollution” translated phrase “ 대기 오염 ” “atmosphere pollution”

Bilingual Thesaurus (1) Definition –a collection of words in two languages that are put into groups together according to connections between their meanings –Ex) EuroWordNet Application field –CLIR concept-based CLIR –[Gonzalo et al. 98], [Gilarranz et al. 97]

bilingual thesaurus {region, part} {atmosphere, 대기 } {air} {inactivity} {wait,waiting, 대기 } {pause} Utilization Bilingual Thesaurus (2) word “ 대기 ” word “ 대기 ” CLIR word concept “region” “inactivity” word concept “region” “inactivity”

Our Experience Bilingual dictionary Bilingual corpus Bilingual thesaurus

Bilingual Dictionary Korean-English bilingual dictionary –size 2 million entries –application person’s name “ 링컨 ” person’s name “ 링컨 ” bilingualbiographicaldictionary - “Lincoln” - “Lincoln”bilingualbiographicaldictionary “ 링컨 ” - “Lincoln” - “Lincoln” translated person’s name “Lincoln” translated person’s name “Lincoln” CLIR MT

Bilingual Corpus Korean-English bilingual corpus –parallel corpus containing 250,000 words –based on CES(Corpus Encoding Standard) Corpus construction tools –corpus refining tools –corpus annotating tools –bilingual concordancer

Goal –Constructing a Korean-English bilingual thesaurus Approach –assigning Korean words to corresponding English words in WordNet Bilingual Thesaurus (1) {air} {region, part} {atmosphere, 대기 } Korean word “ 대기 ” Korean word “ 대기 ” WordNet [ Korean-English bilingual thesaurus ] {air} {region, part} {atmosphere}

Bilingual Thesaurus (2) Current status of the task –under construction Korean thesaurusWordNet word count concept count (synset count) word sense count

Summary Surmounting the language barrier –using bilingual resources Korean-English bilingual resources –Korean-English bilingual dictionary –Korean-English bilingual corpus –Korean-English bilingual thesaurus Our experience –Korean-English bilingual dictionary –Korean-English bilingual corpus –Korean-English bilingual thesaurus

reference(1) [Oard 98] Douglas W. Oard, “ A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval ”, the Third Conference of the Association for Machine Translation in the Americas (AMTA), Philadelphia, PA, October, [Fujii et al. 99] Atsushi Fujii, Tetsuya Ishikawa, "Cross- Language Information Retrieval for Technical Documents", Proceedings of the joint ACL SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.29-37, [Myaeng et al. 99] Sung Hyon Myaeng and Myung-gil Jang, "Complementing Dictionary-Based Query Translations with Corpus Statistics for Cross-Language IR", Machine Translation Summit VII, 1999.

reference(2) [Yang et al. 98] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E.F rederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp IJCAI [Brown 96] Ralf D. Brown, “ Example-Based Machine Translation in the Pangloss System ”, In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp , Copenhagen, Denmark, August 5-9, [Murata et al. 99] Murata, M, Q. Ma, K.Uchimoto, H. Isahara, "An Example-Based Approach to Japanese-to-English Translation of Tense, Aspect, and Modality", in TMI'99, Chester, UK, August 23, 1999.

reference(3) [Shirai et al. 97] Shirai, S., F. Bond, and Y. Takahashi “ A Hybrid Rule and Example based Method for Machine Translation. ” In Natural Language Processing Pacific Rim Symposium '97: NLPRS-97. [Turcato et al. 99] Davide Turcato, Paul McFetridge, Fred Popowich, Janine Toole, "A Unified Example-Based and Lexicalist Approach to Machine Translation", at the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99) [Gonzalo et al. 98] Julio Gonzalo, Felisa Verdejo, Carol Peters and Nicoletta Calzolari, “ Applying EuroWordNet to Cross- Language Text Retrieval ”, Computers and the Humanities, Vol 32, Nos. 2-3, pp , 1998.

reference(4) [Gilarranz et al. 97] Julio Gilarranz, Julio Gonzalo and Felisa Verdejo, "An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", AAAI 97.