Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University
Contents Introduction Bilingual resources –bilingual dictionary –bilingual corpus –bilingual thesaurus Our experience –bilingual dictionary –bilingual corpus –bilingual thesaurus Summary
Introduction What is the problem? –language barrier at multilingual digital library. How to solve the problem? –machine translation(MT) –cross-language information retrieval(CLIR) Why bilingual resources? MT and CLIR are based on bilingual resources. What shall we do? –constructing Korean-English bilingual dictionary Korean-English bilingual corpus Korean-English bilingual thesaurus
Overview DL language barrier CLIRMT bilingual resources
Bilingual Resources Bilingual dictionary Bilingual corpus Bilingual thesaurus
Definition –dictionary containing words and their translated words. Application field –CLIR [Oard 98], [Fujii et al. 99], [Myaeng et al. 99] –MT Utilization Bilingual Dictionary word “ 대기 ” word “ 대기 ” bilingual dictionary “ 대기 1 ” – “atmosphere” “ 대기 2 ” – “waiting” bilingual dictionary “ 대기 1 ” – “atmosphere” “ 대기 2 ” – “waiting” translated words “atmosphere” “waiting” translated words “atmosphere” “waiting” CLIR MT
Bilingual Corpus (1) Definition –comparable corpus a collection of similar texts in different languages –parallel corpus a collection of texts which have been translated into one or more other language(s). Ex) Canadian Hansard corpus Application field –CLIR [Yang et al. 98] –MT Example-Based Machine Translation –[Brown 96], [Murata et al. 99], [Shirai et al.97] –[Turcato et al 99]
Utilization Bilingual Corpus (2) translated words “ 대기 ” - “atmosphere” - “waiting” “ 오염 ” - “pollution” “ 대기 오염 ” “atmosphere pollution” ? “waiting pollution” ? translated words “ 대기 ” - “atmosphere” - “waiting” “ 오염 ” - “pollution” “ 대기 오염 ” “atmosphere pollution” ? “waiting pollution” ? CLIR MT bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “ 대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “ 대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” translated phrase “ 대기 오염 ” “atmosphere pollution” translated phrase “ 대기 오염 ” “atmosphere pollution”
Bilingual Thesaurus (1) Definition –a collection of words in two languages that are put into groups together according to connections between their meanings –Ex) EuroWordNet Application field –CLIR concept-based CLIR –[Gonzalo et al. 98], [Gilarranz et al. 97]
bilingual thesaurus {region, part} {atmosphere, 대기 } {air} {inactivity} {wait,waiting, 대기 } {pause} Utilization Bilingual Thesaurus (2) word “ 대기 ” word “ 대기 ” CLIR word concept “region” “inactivity” word concept “region” “inactivity”
Our Experience Bilingual dictionary Bilingual corpus Bilingual thesaurus
Bilingual Dictionary Korean-English bilingual dictionary –size 2 million entries –application person’s name “ 링컨 ” person’s name “ 링컨 ” bilingualbiographicaldictionary - “Lincoln” - “Lincoln”bilingualbiographicaldictionary “ 링컨 ” - “Lincoln” - “Lincoln” translated person’s name “Lincoln” translated person’s name “Lincoln” CLIR MT
Bilingual Corpus Korean-English bilingual corpus –parallel corpus containing 250,000 words –based on CES(Corpus Encoding Standard) Corpus construction tools –corpus refining tools –corpus annotating tools –bilingual concordancer
Goal –Constructing a Korean-English bilingual thesaurus Approach –assigning Korean words to corresponding English words in WordNet Bilingual Thesaurus (1) {air} {region, part} {atmosphere, 대기 } Korean word “ 대기 ” Korean word “ 대기 ” WordNet [ Korean-English bilingual thesaurus ] {air} {region, part} {atmosphere}
Bilingual Thesaurus (2) Current status of the task –under construction Korean thesaurusWordNet word count concept count (synset count) word sense count
Summary Surmounting the language barrier –using bilingual resources Korean-English bilingual resources –Korean-English bilingual dictionary –Korean-English bilingual corpus –Korean-English bilingual thesaurus Our experience –Korean-English bilingual dictionary –Korean-English bilingual corpus –Korean-English bilingual thesaurus
reference(1) [Oard 98] Douglas W. Oard, “ A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval ”, the Third Conference of the Association for Machine Translation in the Americas (AMTA), Philadelphia, PA, October, [Fujii et al. 99] Atsushi Fujii, Tetsuya Ishikawa, "Cross- Language Information Retrieval for Technical Documents", Proceedings of the joint ACL SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.29-37, [Myaeng et al. 99] Sung Hyon Myaeng and Myung-gil Jang, "Complementing Dictionary-Based Query Translations with Corpus Statistics for Cross-Language IR", Machine Translation Summit VII, 1999.
reference(2) [Yang et al. 98] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E.F rederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp IJCAI [Brown 96] Ralf D. Brown, “ Example-Based Machine Translation in the Pangloss System ”, In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp , Copenhagen, Denmark, August 5-9, [Murata et al. 99] Murata, M, Q. Ma, K.Uchimoto, H. Isahara, "An Example-Based Approach to Japanese-to-English Translation of Tense, Aspect, and Modality", in TMI'99, Chester, UK, August 23, 1999.
reference(3) [Shirai et al. 97] Shirai, S., F. Bond, and Y. Takahashi “ A Hybrid Rule and Example based Method for Machine Translation. ” In Natural Language Processing Pacific Rim Symposium '97: NLPRS-97. [Turcato et al. 99] Davide Turcato, Paul McFetridge, Fred Popowich, Janine Toole, "A Unified Example-Based and Lexicalist Approach to Machine Translation", at the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99) [Gonzalo et al. 98] Julio Gonzalo, Felisa Verdejo, Carol Peters and Nicoletta Calzolari, “ Applying EuroWordNet to Cross- Language Text Retrieval ”, Computers and the Humanities, Vol 32, Nos. 2-3, pp , 1998.
reference(4) [Gilarranz et al. 97] Julio Gilarranz, Julio Gonzalo and Felisa Verdejo, "An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", AAAI 97.