Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.

Similar presentations


Presentation on theme: "1 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated."— Presentation transcript:

1 1 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated Norwegian Broadcast News Speech Corpus

2 2 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Overview Purpose of Rundkast An overview of the database Rundkast Structure of annotation Orthographic transcription Broad phonetic annotation

3 3 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Purpose of Rundkast Databases of broadcast news can be used for a number of research topics in speech technology such as: Supplement to existing databases of read speech for training and testing automatic speech recognition and speaker adaptation. Research on recognition of spontaneous speech. Research on automatic indexing of audio data. Research on topic and/or speaker segmentation. Research on speech/non-speech detection (e.g. background music). International research cooperation involving speech technology for broadcast news applications. A corpus of this kind is necessary for language technology research, but has not been available for Norwegian

4 4 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Overview of Rundkast http://www.iet.ntnu.no/projects/rundkast/ Database of 77 hours radio broadcast news from the Norwegian Broadcasting Corporation (NRK): Read and spontaneous speech, as well as spontaneous dialogs and multipart discussions There is large variation between speakers, speaking styles and topics Speaker turns may be rapid and several speakers may talk simultaneously The quality of the recordings include studio and telephone (mobile, satellite etc) Frequent occurrences of background noise, jingles, music and audio illustrations Funded by the Norwegian University of Science and Technology (NTNU)

5 5 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Structure of annotation Rundkast is hierarchically organized and orthographically annotated: Name of programme, type and date Name of speaker (if known) and dialect (5 regions) Type of speech: spontaneity, channel, recording quality Segmented in speaker turns of app. 2-5 seconds Orthographic transcription (standard Norwegian) Labels for noise (speaker noise, background noise etc.) Labels for pronunciation mistakes, foreign words, unintelligible speech etc. ~70 hrs work per hour of recording Transcriber used for annotation: ”standard”-tool

6 6 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Hierarchy of annotation levels [i] blah blah...more blah...[lp] speaker 1speaker 2 no speaker speaker 1 reportfillernontrans report one episode file [b-]noisy blah[-b]... annotation level: 1 2 3 levels: 1=section, 2=speaker turn, and 3=segment

7 7 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Orthographic transcription The lowest level in the annotation hierarchy, segments, are transcribed orthographically. Orthographic transcription of spoken language is a challenge, especially for Norwegian. Using dialect also in official circumstances is more and more accepted. The majority of RUNDKAST is not compliant to any standard pronunciation. The aim of the conventions for the orthographic transcription in RUNDKAST is to minimize uncertainty about pronunciations and facilitate consistency.

8 8 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Orthographic transcription: Main conventions Words are transcribed with the written forms closest to actual pronunciations. A limited number of interjections are allowed. Text codes are used to mark mispronunciations, truncations, and unknown words. Numbers and symbols are written out as words. Abbreviations are not used. Punctuation marks are restricted to comma, period, and question mark. Space is used between spelled letters, also when acronyms have spelled pronunciation. Capital letters are used in proper names, spellings, and acronyms, but not at the start of sentences.

9 9 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Example annotation in Transcriber

10 10 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Broad phonetic annotation Part of the data were to be phonetically annotated –Use for low-level experiments in ASR (new methods), smaller Norwegian counterpart to TIMIT –Auto-segmentation for e.g. unit selection TTS Annotation to be based on existing standards – with necessary adjustments Exploit experience and specifications from development of Norwegian speech synthesis databases ”Suitable” level of detail: Acoustic boundaries should be labeled, but more phonemic than phonetic Consistency of utmost importance!

11 11 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Broad phonetic annotation: Selected data 10 speakers (5 male and 5 female) Amount of speech per speaker: –app 5 min ”planned” speech and 1 min spontaneous speech –discard noisy parts (as far as possible) –from more than one programme –use turn segmentation from orthographic annotation All in all 1 hour of speech Approximately 1000 hours of work

12 12 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Broad phonetic annotation: Main principles The annotation is mainly phonemic using the phoneme symbols closest to the perceived sound Acoustic boundaries should be marked; some acoustically motivated symbols are included A transcription as close as possible to the citation form is preferred Norwegian standard SAMPA is preferred –Some English phonemes included as well as dialect variants –Example: 3 variants of the /r/-sound /r/ (tap/trill) /R/ (uvular fricative) /r\/ (approximant)

13 13 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Broad phonetic annotation: Annotation procedure 1.Conversion of orthographic transcription to a format suitable for automatic transcription. 2.Automatic segmentation with a phonotypical transcription using a speech recognizer. 3.Manual correction of both segments and labels by four phonetics students using Praat. 4.Format check. 5.Control of all annotation by one supervisor.

14 14 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Broad phonetic annotation: Comments on deviations Always cases of uncertainty, need a log for these. Problem: will the log be read? Solution: Codes for deviations! Additional Praat tier for deviations Synchronous with the phoneme tier Easy to utilize automatically Examples: –creaky voice –unexpected voiced/unvoiced –uncertain boundary or symbol... in addition a log file with whatever deviations left

15 15 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Example annotation in Praat

16 16 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech Concluding remarks Availability: –Planned to be included for non-commercial use in a future Norwegian language bank –Will complement other corpora also intended to be included To be validated by Spex Planned use at NTNU: SIRKUS project –Investigation in new paradigms for ASR –Low-level phone recognition experiments initially multi-linguality aspects –Spoken information retrieval


Download ppt "1 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated."

Similar presentations


Ads by Google