WP4: Normalization of Transcriptions. From Transcriptions to Subtitles Erik Tjong Kim Sang University of Antwerp.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
WP4-22. Final Evaluation of Subtitle Generator Vincent Vandeghinste, Pan Yi CCL – KULeuven.
CNTS LTG (UA) (i) Phoneme-to-Grapheme (ii) Transcription-to-Subtitles Bart Decadt Erik Tjong Kim Sang Walter Daelemans.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
S1S1 S2S2 S3S3 ATraNoS Workshop 12 April 2002 Patrick Wambacq.
Voice Recognition Technology Kathleen Kennedy COMP 1631 Winter 2010.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Natural Language Processing Expectation Maximization.
ELN – Natural Language Processing Giuseppe Attardi
A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 23 rd, 2013 Zhongxiu Liu CS 14’ Yidi Zhang CS 13’
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Programming Fundamentals. Today’s Lecture Why do we need Object Oriented Language C++ and C Basics of a typical C++ Environment Basic Program Construction.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
MedKAT Medical Knowledge Analysis Tool December 2009.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
ONZEminer Margaret Maclagan, ONZE director Robert Fromont, designer.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Chapter 1: Introduction to Visual Basic.NET: Background and Perspective Visual Basic.NET Programming: From Problem Analysis to Program Design.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
General Architecture of Retrieval Systems 1Adrienn Skrop.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Language Identification and Part-of-Speech Tagging
Statistical NLP: Lecture 13
Learning to Sportscast: A Test of Grounded Language Acquisition
Automatic Handwriting Generation
Presentation transcript:

WP4: Normalization of Transcriptions. From Transcriptions to Subtitles Erik Tjong Kim Sang University of Antwerp

Subtasks of this work package Data collection and automatic alignment Input/output specification Automatic subtitling: statistical approach Automatic subtitling: linguistic approach The third and the fourth task will be combined. The project will produce a hybrid system.

Data Collection and Automatic Alignment Goal: collecting training data for other parts of this work package and developing, testing and applying a sentence alignment method to this data. Output: alignment software and parallel corpus.

Example Over deze man gaat het, Noël Slangen, communicatieadviseur van premier Verhofstadt, en eigenaar van een communicatiebedrijf. Noel Slangen is communicatie-adviseur van Verhofstadt. Hij heeft een communicatiebedrijf

Why is sentence alignment not trivial? A single sentence in one file may need to be combined with more than one sentence in the corresponding file. Sentences in the transcribed text may not be present in the subtitle text for space reasons. Parts of the text in either of the files may be missing (interviews, foreign language, on-screen lists).

Alignment The standard method for aligning translated sentences (Gale and Curch) does not work because there are many gaps in the text. We used a method which estimates the probability that sentences belong together based on the words they contain. The system benefits from making several passes over the data.

Collecting data Teletext subtitles have been stored daily since December 2001 for the main Flemish news broadcast (VRT 19:00) and the Flemish soap Thuis. Some transcripts of these programmes have been supplied to us by the VRT. A year of Dutch news (NOS 20:00), both subtitles and autocues, has been donated by the University of Twente in The Netherlands.

Processing the data All files, except the VRT news transcriptions (HTML) have been converted to XML. Punctuation signs have been separated from words and sentence boundaries have been marked. Sentences in available transcript files have been aligned with corresponding subtitles. All alignment structures have been manually checked.

Alignment software performance Precision and recall have been computed for pairs of sentences. F ß=1 is the harmonic mean of these. CorpusPrecisionRecallF ß=1 VRT93.5%85.3%89.2 NOS %87.3%87.4 NOS %86.8%88.4 Thuis %45.1%56.5 Thuis %95.6%93.5

Corpus size Sizes have been measured in number of words. A word has been defined as a string containing at least one of the characters in [A-Za-z0-9]. CorpusParallelComp.rateSubtitles VRT189, %993,102 NOS , %- NOS , %- Thuis , %- Thuis 20025, %200,358

Automatic Subtitling from Transcriptions: Statistical Approach A baseline experiment has been performed A memory-based learner was trained to predict subtitle words given the words in transcripts The learner obtained an accuracy of 71.7% on two files of Thuis, an improvement of the strategy of keeping all words (67.3%) Problem: the learner required word-aligned texts

Output example Wel, onze collega’s in Sankt-Vith die hebben het onderzoek daar afgesloten. Onze collega’s in Sankt-Vith hebben het onderzoek daar afgesloten. onze collega’s in Sankt-Vith die hebben het onderzoek daar afgesloten.

What will the subtitle generator use? Automatically generated word-class information and phrase boundaries. Significance scores for words, word classes and phrases, computed from the ATraNoS corpus and other Dutch text. Antwerp has software available for the first but most of it is for English. A module for Dutch named-entity recognition has been developed in this project year. Other modules will follow.

Future work Expanding the size of the corpora, provided that more transcriptions become available. Adapting the Antwerp shallow parsing software for Dutch. Adding shallow parsing information to the corpora. Developing a significance scoring system for words, word classes and phrases.  Improving the summarization system

Publications 2002 Memory-Based Shallow Parsing. In Journal of Machine Learning Research, volume 2 (March), Introduction to the CoNLL-2002 shared task: Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, Memory-Based Named Entity Recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002.