The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China 2006. 11. 20.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Cloud platforms Lead to Open and Universal access for people with Disabilities and for All WP Federating repositories of Solutions.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Collection and Service of CADAL Project Huang Chen Zhejiang Uni. Libraries ALA.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Using language services to enrich the LOs' descriptions Dr. Vassilis Protonotarios University of Alcala, Spain 10 th Strategic Seminar / Conference 6-7.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
MACHINE TRANSLATION TRANSLATION(5) LECTURE[1-1] Eman Baghlaf.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Machine translation Context-based approach Lucia Otoyo.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
University of Florida CTSI: Consuming and disambiguating publications data from Microsoft Academic Search in VIVO. Nicholas Rejack 1, Erik Schmidt 1, Michael.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Scenarios for a Learning GRID Online Educa Nov 30 – Dec 2, 2005, Berlin, Germany Nicola Capuano, Agathe Merceron, PierLuigi Ritrovato
Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
IR&NLP Coursework P1 Text Analysis Within The Fields Of Information Retrieval and Natural Language Processing By Ben Addley Academic Year 2004.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
English-Lithuanian-English Lexicon Database Management System for MT Gintaras Barisevicius and Elvinas Cernys Kaunas University of Technology, Department.
Summon® 2.0 Discovery Reinvented
TextCrowd – Collaborative semantic enrichment of text-based datasets
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
LACONEC A Large-scale Multilingual Semantics-based Dictionary
College of Information
Exploring Scholarly Data with Rexplore
Multilingual Information Access in a Digital Library
المكتبة العربية الرقمية
Presentation transcript:

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

About the CADAL The China-America Digital Academic Library (CADAL) Project was launched by China-US scientists with a goal of digitizing one million books for a digital library.

The Aim of the CADAL The ideal information service should provide the knowledge that the user seeks, as well as a solution to the user s problem. CADAL not only provides digitized books, but also processes the digitized resources to extract relevant information, and provides more service to the user. Machine translation (MT) is a service that CADAL intends to adopt to provide bilingual or multi-lingual translations.

Machine Translation (MT) MT is a process used to translate one natural language into another. The software that completes such a task is named Machine Translation System.

Category of MT Knowledge Based MT (KBMT) Specialists construct linguistic rules which cover a wider domain than the training corpus. These rules and their resulting systems tend to make more sense for human beings and can be adjusted quickly.

Category of MT Example-based MT Given an input passage S in a source language and a bilingual text archive, where text passage S in the source language are stored, aligned with their translations into a target language T, S is compared with the source-language side of the archive. The closest match for passage S is selected and the translation of this closest match, th e passage T is accepted as the translation of S.

Category of MT Statistical based MT (SBMT) The translation is based on the statistical probability of the words of the same text in two languages (parallel corpora). When such texts in two languages exist, the probabilities of the words can be counted, and the translation system can be taught to translate" by using the probabilities.

Application of MT in CADAL CADAL is making use of MT in a number of ways: Important information, such as a book s title or authors, is translated manually, or first translated by MT systems and then verified manually; As the cornerstone of CADAL s system, MT provides instant service such as translation of contents indexed by XML; Integrating MT with other services, such as multilingual information retrieval and special words retrieval.

Bilingual service engine We applied a bilingual service engine to support the metadata retrieval between English and Chinese. This engine provides instant translation of book profiles.

A book profile in both Chinese and English

MT evaluation in CADAL We evaluated a number of existing MT systems. These include systems developed by IBM, Carnegie Mellon University, USC/ISI, RWTH Aachen University, Microsoft (Redmond) and the Institute of Computing Technology, and the Chinese Academy of Sciences.

Results of evaluation Results show that the performance of MT Systems created by RWTH Aachen University, CMU and ISI is superior to even that by SYSTRAN, RWTH Aachen University adopted the SBMT model, and improved the traditional noise channel based paradigm into the maximum entropy model, their MT System also further enhanced the words- based alignment model to a phrase-based alignment model.

Results of evaluation Mega2RADD by CMU integrates SBMT with EBMT through a translation engine, and provides the optimized translation result. Re2Write by ISI takes IBM-4 statistical model as the prototype; the translation quality is improved by adding grammar analysis and KBMT. The models used and the improvement of quality in those systems show that a single translation strategy, whether rule-based or based on statistical data, is only a partial solution, and integration of multiple translation strategies is the common feature of those systems.

MT strategy in CADAL In light of the foregoing evaluation and current research in MT, we believe that the hybrid translation strategy is the most appropriate for MT in CADAL. We intend to collaborate with CMU by using their Mega2RADD system as the basic framework, and adopting the idea of RWTH Aachen University, which is to improve the source-channel based paradigm into the maximum entropy model.

MT strategy in CADAL Under the framework of multiple engines, CADAL will take mtSDK as the standard to provide translation services at different levels. From automatic machine translation to human translation, there are human- assisted machine translations and machine- assisted human translations, to which CADAL will pay more attention. Human intervention is allowed to improve the translation quality in CADAL.

Conclusions CADAL will adopt multiple translation strategies, including rule-based, example-based and statistics-based strategies; manage various information used during the translation by employment of an object-oriented multiple type database; and provide a user interface which allows manual intervention to the resultant translation of MT. In order to obtain the linguistic resources required by KBMT, CADAL will also pay attention to the construction of its word library based on ontology, drawing on the research of Semantic Web.

Thank you!