The LC-STAR project (IST-2001-32216) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

CODE/ CODE SWITCHING.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
Stemming, tagging and chunking Text analysis short of parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
Creation of a Russian-English Translation Program Karen Shiells.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
ELN – Natural Language Processing Giuseppe Attardi
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
Survey of Semantic Annotation Platforms
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Language Resources College 11 th ECESS meeting 11th ECESS Meeting College Language Resources 0. Minutes making for College ‘Language Resources’ 1. Goal.
LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January Goal of meeting 2. Status members of College 3.
Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real-life Case Perspective Folkert de Vriend 1 & Giulio Maltese 2 1 Speech.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
University of Maribor Faculty of Electrical Engineering and Computer Science AST ’04, July 7-9, 2004 Slovenian Lexica and Corpora in the Scope of the LC-STAR.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
LR College Maribor: 9 th ECESS meeting 1.Goal of meeting 2.Status members of College 3.Interests and acceptance of associated members Activities of Microsoft.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Approaches to Machine Translation
Part of the Multilingual Web-LT Program
Approaches to Machine Translation
Presentation transcript:

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible vocabulary speech recognition and high quality speech synthesis covering a wide range of domains. Track II (duration 3 years) Investigation of speech centered translation technologies focusing on requirements concerning language resources (LR) Specification and creation of corpora and lexica needed for speech centered translation Building a demonstrator for speech-to-speech translation Demonstration of language transfer in Catalan, Spanish and US-English Creation of Lexica for Speech Centered Translation Ute Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC) 1

4 Industrial Partners: Creation of Lexica for Speech Centered Translation Ute Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC) 1 2 Partners from Universities: 1 External Partner:

Two approaches: - Bi-lingual word by word translation lexica with enriched morphological information - Advantages: reduction of WER - Disadvantage: for more inflected languages lexicon size increases by a factor of 7 (at least); effort varies highly between languages -> only provided for Catalan and Spanish for statistical experiments - 'Phrasal' lexica consisting of bi-lingual short phrases typically found in a tourist domain environment - Advantages: reduction of OOV, better alignment and lexicon model - Disadvantage: selection of adequate corpora Creation of Lexica for Speech Centered Translation Ute Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC) 1

US-English corpora from Verbmobil (112,541 token): orthographic transcriptions of telephone conversations in US-English for an appointment scheduling domain US-English corpora from TALP corpus (408,452 token): US-English sentences translated from orthographic transcriptions of telephone conversations in Spanish and Catalan for a tourist domain Web corpus (2,640,562 token): Text corpora downloaded from tourist web pages in US-English Phrasal corpus: 1500 expressions in US-English selected from tourist phrasal books. Source Corpora 2

Procedure: 1. Create text corpora in a reference language (US-English) in a given domain 2. Select of the most frequent content words (i.e. nouns, verbs, adjectives, etc.) to create a representative word list of the domain 3. For each word in the word list, provide the syntactic context in which the words are embedded 4. Cut the sentence into a segment that contains the word. The segment have usually been shortened to nominal phrases (in case of nouns and adjectives) or to subject plus verb plus short complement (for verbs) 5.Manually correct the phrases (e.g. typing and orthographic errors, meaningless or offensive phrases, proper names etc.) 6.Add a set of typical phrasal expressions commonly used in the semantic domain. the set is manually choosen from several tourist text books Building a demonstrator for speech-to-speech translation -> Result: 'phrasal' reference lexicon consisting of short phrases Creation of Reference Corpus 3

Format: Textual format will be used with XML-based mark-up in accordance with a common and language specific Document Type Definition (DTD) Advantages of using XML are: - Widely known technique - Many tools supporting it are available - Supports Unicode (useful for languages with non-Latin writing systems) - Allows easy and concise representation of one-to-many relations multiple translations, multiple PoS, etc.) - Easily definable and flexible syntax - Easy well-formedness tests are possible using publicly available tools Format 4

Set of segments: - Source language segment: orthography of the source phrase - Target language segment: target language translation + orthography, one PoS (NOM, VER, ADJ, PRO…) and lemma - Additional information possible (e.g. tags for foreign words, etc.) Example: Content 5

Partners and Languages 6

1. Translate as literal as possible to the source text, while preserving the syntactic correctness, semantic meaning and naturalness 2. Idiomatic expressions will be translated and marked as such 3. Ambiguities: select most plausible translation with respect to semantic domain; otherwise provide more than one translation 4. Proper nouns are marked and translated only in case when they are used in target language (e.g. AIDS -> SIDA) 5. Punctuation marks are separated from words and should be kept. 6. Digits should be kept unless a transcription is required in the target language. 7.Abbreviations should be expanded or kept abbreviated depending on the use in target language. 8.Foreign words can be optionally labeled with a tag 9.Parts of word: (e.g. due to false starts etc.) if the reference phrase does not provide enough context to disambiguate generate the partial target word followed by the + mark. Translation Methodology 7

Approach: - Phrases occuring in all three languages are added to the training corpus - Training corpus consists of selected dialogues from Verbmobil and TALP tourism corpus Preliminary Results: - Reduced OOV rate (13% relative for Spanish and 23% for Catalan) - Overall better translation of certain phrases from touristic domain - No significant change in translation error rates yet References: Asuncion Moreno et al. (2004): Language Independent Specificaiton of LR for Translation. D5.5. of the LC- STAR project, IST , to be published. Nicola Ueffing (2004): Results on Different Structured LR for Speech-to-Speech Translation. D4.5. of the LC- STAR project, IST , to be published. Maja Popović, Hermann Ney (2004): Towards the Use of Word Stems & Suffixes for Statistical Machine Translation. LREC 2004, Lissabon. First Experiments and Preliminary Results Contact: Ute Ziegenhain,