Download presentation
Presentation is loading. Please wait.
Published byCatherine Bennett Modified over 9 years ago
1
Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10
2
Contents Introduction Bilingual resources –bilingual dictionary –bilingual corpus –bilingual thesaurus Our experience –bilingual dictionary –bilingual corpus –bilingual thesaurus Summary
3
Introduction What is the problem? –language barrier at multilingual digital library. How to solve the problem? –machine translation(MT) –cross-language information retrieval(CLIR) Why bilingual resources? MT and CLIR are based on bilingual resources. What shall we do? –constructing Korean-English bilingual dictionary Korean-English bilingual corpus Korean-English bilingual thesaurus
4
Overview DL language barrier CLIRMT bilingual resources
5
Bilingual Resources Bilingual dictionary Bilingual corpus Bilingual thesaurus
6
Definition –dictionary containing words and their translated words. Application field –CLIR [Oard 98], [Fujii et al. 99], [Myaeng et al. 99] –MT Utilization Bilingual Dictionary word “ 대기 ” word “ 대기 ” bilingual dictionary “ 대기 1 ” – “atmosphere” “ 대기 2 ” – “waiting” bilingual dictionary “ 대기 1 ” – “atmosphere” “ 대기 2 ” – “waiting” translated words “atmosphere” “waiting” translated words “atmosphere” “waiting” CLIR MT
7
Bilingual Corpus (1) Definition –comparable corpus a collection of similar texts in different languages –parallel corpus a collection of texts which have been translated into one or more other language(s). Ex) Canadian Hansard corpus Application field –CLIR [Yang et al. 98] –MT Example-Based Machine Translation –[Brown 96], [Murata et al. 99], [Shirai et al.97] –[Turcato et al 99]
8
Utilization Bilingual Corpus (2) translated words “ 대기 ” - “atmosphere” - “waiting” “ 오염 ” - “pollution” “ 대기 오염 ” “atmosphere pollution” ? “waiting pollution” ? translated words “ 대기 ” - “atmosphere” - “waiting” “ 오염 ” - “pollution” “ 대기 오염 ” “atmosphere pollution” ? “waiting pollution” ? CLIR MT bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “ 대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “ 대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” translated phrase “ 대기 오염 ” “atmosphere pollution” translated phrase “ 대기 오염 ” “atmosphere pollution”
9
Bilingual Thesaurus (1) Definition –a collection of words in two languages that are put into groups together according to connections between their meanings –Ex) EuroWordNet Application field –CLIR concept-based CLIR –[Gonzalo et al. 98], [Gilarranz et al. 97]
10
bilingual thesaurus {region, part} {atmosphere, 대기 } {air} {inactivity} {wait,waiting, 대기 } {pause} Utilization Bilingual Thesaurus (2) word “ 대기 ” word “ 대기 ” CLIR word concept “region” “inactivity” word concept “region” “inactivity”
11
Our Experience Bilingual dictionary Bilingual corpus Bilingual thesaurus
12
Bilingual Dictionary Korean-English bilingual dictionary –size 2 million entries –application person’s name “ 링컨 ” person’s name “ 링컨 ” bilingualbiographicaldictionary - “Lincoln” - “Lincoln”bilingualbiographicaldictionary “ 링컨 ” - “Lincoln” - “Lincoln” translated person’s name “Lincoln” translated person’s name “Lincoln” CLIR MT
13
Bilingual Corpus Korean-English bilingual corpus –parallel corpus containing 250,000 words –based on CES(Corpus Encoding Standard) Corpus construction tools –corpus refining tools –corpus annotating tools –bilingual concordancer
14
Goal –Constructing a Korean-English bilingual thesaurus Approach –assigning Korean words to corresponding English words in WordNet Bilingual Thesaurus (1) {air} {region, part} {atmosphere, 대기 } Korean word “ 대기 ” Korean word “ 대기 ” WordNet [ Korean-English bilingual thesaurus ] {air} {region, part} {atmosphere}
15
Bilingual Thesaurus (2) Current status of the task –under construction Korean thesaurusWordNet word count2014994473 concept count (synset count) 1321168046 word sense count23838116317
16
Summary Surmounting the language barrier –using bilingual resources Korean-English bilingual resources –Korean-English bilingual dictionary –Korean-English bilingual corpus –Korean-English bilingual thesaurus Our experience –Korean-English bilingual dictionary –Korean-English bilingual corpus –Korean-English bilingual thesaurus
17
reference(1) [Oard 98] Douglas W. Oard, “ A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval ”, the Third Conference of the Association for Machine Translation in the Americas (AMTA), Philadelphia, PA, October, 1998. [Fujii et al. 99] Atsushi Fujii, Tetsuya Ishikawa, "Cross- Language Information Retrieval for Technical Documents", Proceedings of the joint ACL SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.29-37, 1999. [Myaeng et al. 99] Sung Hyon Myaeng and Myung-gil Jang, "Complementing Dictionary-Based Query Translations with Corpus Statistics for Cross-Language IR", Machine Translation Summit VII, 1999.
18
reference(2) [Yang et al. 98] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E.F rederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp. 323-345IJCAI [Brown 96] Ralf D. Brown, “ Example-Based Machine Translation in the Pangloss System ”, In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp.169-174, Copenhagen, Denmark, August 5-9, 1996. [Murata et al. 99] Murata, M, Q. Ma, K.Uchimoto, H. Isahara, "An Example-Based Approach to Japanese-to-English Translation of Tense, Aspect, and Modality", in TMI'99, Chester, UK, August 23, 1999.
19
reference(3) [Shirai et al. 97] Shirai, S., F. Bond, and Y. Takahashi. 1997. “ A Hybrid Rule and Example based Method for Machine Translation. ” In Natural Language Processing Pacific Rim Symposium '97: NLPRS-97. [Turcato et al. 99] Davide Turcato, Paul McFetridge, Fred Popowich, Janine Toole, "A Unified Example-Based and Lexicalist Approach to Machine Translation", at the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99) [Gonzalo et al. 98] Julio Gonzalo, Felisa Verdejo, Carol Peters and Nicoletta Calzolari, “ Applying EuroWordNet to Cross- Language Text Retrieval ”, Computers and the Humanities, Vol 32, Nos. 2-3, pp. 73-89, 1998.
20
reference(4) [Gilarranz et al. 97] Julio Gilarranz, Julio Gonzalo and Felisa Verdejo, "An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", AAAI 97.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.