W3C Workshop, Beijing, 2nd of November 2005 An extension to the SSML for diacritics auto-completion R&D Centre Vocal Services Section.

Slides:



Advertisements
Similar presentations
Introduction to Xaira Part One: All about Xaira Andrew Hardie.
Advertisements

June 2004 Adil Allawi Technical Director
What is XML? a meta language that allows you to create and format your own document markups a method for putting structured data into a text file; these.
Mr Greenhalgh S4 Computing Int 1 Things you could do with knowing before the Exam…
From Disabled to Abled Web Today and Tomorrow’s Solution Kenneth Lau December 6, 2002.
SSML extensions for multi-language usage Davide Bonardo W3C Workshop on Internationalizing SSML Crete, May 2006.
Standard Grade Computing Electronic Communication.
1 Introducing Collaboration to Single User Applications A Survey and Analysis of Recent Work by Brian Cornell For Collaborative Systems Fall 2006.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Lecture 2B: HTML and CSS IT 202—Internet Applications Based on notes developed by Morgan Benton.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
1/25 Writing Character sets Unicode Input methods.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Under the Guidance of: Mr S.Karthikeyan.MCA..  The project is entitled as “ SMS Based Student Information System” created by using Visual Basic.  Flexible.
Assistive Technology Ability to be free. Quick Facts  Assistive technology is technology used by individuals with disabilities in order to perform functions.
1 SSML Extensions for TTS in Indian Languages II workshop on Internationalizing SSML May 2006, Greece Nixon Patel and Kishore Prahallad Bhrigus.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Creating a Simple Page: HTML Overview
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
PHP : Hypertext Preprocessor
Pronunciation Lexicon Background Paolo Baggia, Loquendo W3C SSML Workshop Beijing – 2-3 Nov 2005.
Computing Fundamentals Module A Unit 2: Using Windows Vista LessonTopic 8Looking at Operating Systems 9Looking at the Windows Desktop 10Starting Application.
Public 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Development Challenges of Multilingual Text-to-Speech Systems Kimmo Pärssinen
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
It is pronounced ‘askee’
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
26.1 Chapter 26 Remote Logging, Electronic Mail, and File Transfer Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or.
WORKING WITH XSLT AND XPATH
File Formats Chapter 9 Bit Literacy. File formats are often ignored by users Applications automatically save files in the application’s format All formats.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
Section 4.1 Format HTML tags Identify HTML guidelines Section 4.2 Organize Web site files and folder Use a text editor Use HTML tags and attributes Create.
Chapter 4 – Slide 1 Effective Communication for Colleges, 10 th ed., by Brantley & Miller, 2005© Technology and Electronic Communication.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
Section 2 Section 2.1 Identify hardware Describe processing components Compare and contrast input and output devices Compare and contrast storage devices.
CS151 Introduction to Digital Design
These days, usually used synonymously with Random Access Memory or Read- Only Memory, but in the general sense it can be any device that can hold data.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Lecture 6: Sun: 8/5/1435 Distributed Applications Lecturer/ Kawther Abas CS- 492 : Distributed system & Parallel Processing.
What it is and how it works
XML Introduction. Markup Language A markup language must specify What markup is allowed What markup is required How markup is to be distinguished from.
Data Representation Conversion 24/04/2017.
Chapter Three The UNIX Editors.
Department of Industrial Engineering Sharif University of Technology Session# 10.
Web Technologies Lecture 4 XML and XHTML. XML Extensible Markup Language Set of rules for encoding a document in a format readable – By humans, and –
Jabber Technical Overview Presenter: Ming-Wei Lin.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
The idea of adding markup instructions to documents is not new. Before computers, authors would make annotations by hand in their written or typed documents.
Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:
What is a Computer An electronic, digital device that stores and processes information. A machine that accepts input, processes it according to specified.
Writing System Implementation On-the-Fly Extensibility for the common man Sharon Correll, SIL International Copyright © 2001.
PLS for SSML Paolo Baggia Loquendo Workshop II on Internationalizing SSML.
1 Standardization, Internationalization Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section.
Binary Representation in Text
Binary Representation in Text
TOPICS Information Representation Characters and Images
Representing Characters
Fundamentals of Data Representation
Cooper Part II Making Well-Behaved Products Different Needs
Presentation transcript:

W3C Workshop, Beijing, 2nd of November 2005 An extension to the SSML for diacritics auto-completion R&D Centre Vocal Services Section

2 Regarding: Diacritics auto-completion Plan of the presentation ●The nature of the problem ●Similarities among other languages ●Possible solutions ●Discussion

3 Regarding: Diacritics auto-completion Diacritics diacritical mark or diacritic, sometimes called an accent mark, is a mark added to a letter to alter a word's pronunciation or to distinguish between similar words. Example: Polish letters with diacritics: Ą ą Ć ć Ę ę Ł ł Ń ń Ó ó Ś ś Ź ź Ż ż ●Polish alphabet contains 35 letters=26 basic + 9 with diacritics ●Different pronunciation from letters without diacritics ●Included in ISO , UNICODE, CP-1250, DOS 852… ●Not included in US ASCII 7-bit codepage ?

4 Regarding: Diacritics auto-completion Why Polish diacritics sometimes disappear? ●No possibility to obtain while typing o Application / hardware does not support non US-ASCII characters o Improper regional settings in OS or firmware ●The codepages hell o All the codepages differs from each other o Unicode (utf-8) is still not very popular ●Pruned on WWW - SMS gateways ●A little bit hard to type o As a combination Alt gr+ on a PC keyboard („Polish programmer” variant of US keyboard) o As the 5th or further letter on a key of mobile phone keypad (key 2 sequence=„ABC2ĄĆ”

5 Regarding: Diacritics auto-completion quasi-Polish text (without diacritics) ●Is not orthographically correct ●Is not up to netiquette ●Is not Polish (in fact) ●Can not be transformed into Polish with simple substitution rules ●Speech synthesised from this text may be incomprehensible …but: ●Sometimes it is the only possibility to represent text ●Is easier to write = can be written faster ●Can be quite easily read by human as if it was written correctly (because of nature of „human reading device”) thus: is widespread in Polish s, SMSes, news posts and chats

6 Regarding: Diacritics auto-completion Examples slonce –> słońce ( Eng. the Sun) - unambiguous mapping maki –> maki ( Eng. poppies Nominative, plural ) or mąki ( Eng. flour Genitive, singular ) Question: add a diacritic or not ? zeby –> zęby ( Eng. teeth Nominative, plural ) or żeby ( Eng. in order that ) Question: Where to add a diacritic ?

7 Regarding: Diacritics auto-completion Other languages ●Czech, Slovak ●Problem with diacritics is very similar to Polish ●German ●Umlaut „ ä ü ö ” and sharp „s” = β ●Russian ●Volapuk encoding – informal romanization used in SMSes ●e.g.: „Ж” = „}” + „|” + „{” ●French ●Accents strongly affecting pronunciation, e.g.: „è” „é” „ê” ●Other diacritics: „ë” „ï” „ô” „û” ●… and many other

8 Regarding: Diacritics auto-completion How to classify the problem? ●a new dialect? ●an alternative spelling (context dependent orthography)? ●an erroneous text that requires correction (jargon)?

9 Regarding: Diacritics auto-completion Example: Multi-channel access to Instant Messaging user text Visually impaired user Message IM Server IM Server SMS gateway Speech Synthesis Text Processing Mobile user Home IM user From: chris Date: 2nd Nov 05 Time: 10h15 Msg: Correct text Text without diacritics

10 Regarding: Diacritics auto-completion Variant 1: correction by IM server ●Do everything on server side -SSML content developer takes care about correct spelling in text send to TTS -Text processing (correction software) is tight to the IM Server vendor which may lead to proprietary solutions +TTS is given correct text so has no problem to render it No need for data exchange format standardization Message in SSML 1.0 IM Server IM Server Speech Synthesis Built-in Text Processing TTS engine Proprietary Text Processing Rules

11 Regarding: Diacritics auto-completion Variant 2: correction by TTS engine ●IM does not do anything – lets the TTS engine render the text +No additional work of SSML content developer required -TTS must recognize scope of the quasi-correct part of text (no tags in current SSML) -TTS must complete diacritics to correctly pronounce text Message in SSML 1.0 IM Server IM Server Speech Synthesis Built-in Text Processing TTS engine Proprietary Text Processing Rules

12 Regarding: Diacritics auto-completion Variant 3 – use external lexicons ●Use special lexicon file to properly render text: +Quite simple and easy for SSML developer -Lexicon affects the whole file: correct and quasi-correct parts -No context dependent rules in PLS (req. 7.3) -No prefix/suffix morphological rules in PLS (req. 7.2) -The lack of diacritics is not a pronunciation exception but a spelling error Message in SSML 1.0 IM Server IM Server Speech Synthesis Lexicon-based built-in Text Processing TTS engine Lexicons in PLS 1.0 Text Processing Lexicons

13 Regarding: Diacritics auto-completion Recommendation ●Use separate correction unit for jargon (external) ●Enclose quasi-correct text with tags +Still easy for SSML developers +Text Correction software knows which part of text should be specifically pre- processed +For diacritics completion an external program can be used +For simpler cases, just dedicated lexicon can be used -SSML needs to be extended Message in enhanced SSML 1.0 IM Server IM Server Speech Synthesis Lexicon-based built-in Text Processing TTS engine Lexicons in PLS 1.0 Text Processing Lexicons Jargon Text Correction

14 Regarding: Diacritics auto-completion Example of SSML document (jsp)... User writes: The message has been sent : at

15 Regarding: Diacritics auto-completion Another example... User writes: The message has been sent : at

16 Regarding: Diacritics auto-completion Conclusions ●In modern communication services people use specific language, frequently not conforming to orthographic rules (e.g. without diacritics) ●Applying standard phonetization rules to erroneous text may result in incomprehensible speech ●TTS for best rendering results should have complete information about the text ●One SSML document can have both correct and erroneous text; there is a need to mark it ●Correcting erroneous text can be context and application dependent

17 Regarding: Diacritics auto-completion Questions and doubts 1.How many types of erroneous input should we consider? 2.How to handle jargon evolution? 3.How does input device affect the text? 4.New interpret-as value or a new tag? 5.Scope and structure of the new tag (if applicable)? 6.Will future TTS be a software composed of complex text processor and acoustic synthesis engine, or will we have a possibility to freely choose these modules from different vendors?

18 Regarding: Diacritics auto-completion Dziękujemy Thank you

19 Regarding: Diacritics auto-completion Prepared by: Name:Przemyslaw Zdroik Division:Vocal Services Secion Department: TP S.A. Research and Development Centre Phone#: (+ 48) Name:Krzysztof Majewski Division:Vocal Services Section Department: TP S.A. Research and Development Centre Phone#: (+ 48)