Download presentation
Presentation is loading. Please wait.
1
A Fully Annotated Corpus of Russian Speech
A Fully Annotated Corpus of Russian Speech Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova Department of Phonetics, Saint-Petersburg State University Ladies and Gentlemen. My name is Daniil Kocharov and I present here a new corpus of Russian speech, CORPRES, developed at the Department of Phonetics of Saint-Petersburg State University by my colleagues and me.
2
CORPRES fully annotated COrpus of Russian Professionally REad Speech
CORPRES fully annotated COrpus of Russian Professionally REad Speech developed at the Department of Phonetics, Saint-Petersburg State University developed for: unit-selection TTS possible linguistic use: research in the Russian phonetics and inter- and intra-speaker speech variability The corpus name is an acronym that means Corpus of Russian Professionally Read Speech. It was originally developed for a unit-selection text-to-speech synthesis of Russian, but our department also uses it as material for our phonetic research of Russian speech and also of inter- and intra-speaker speech variability. It is possible due to comprehensive and very precise segmentation of speech and its labeling. D. Kocharov, Fully Annotated Corpus of Russian Speech
3
Corpus Description 8 speakers (4 women and 4 men)
Corpus Description 8 speakers (4 women and 4 men) 60 hours of read speech (7.5 hours from each speaker). Texts of different styles: fiction narrative texts, a play with emotionally expressive dialogues informational texts on IT, politics, economy 6 levels of annotation. The corpus consists of recordings of eight professional speakers, four women and four men. There are seven and a half hours of speech recorded from each speaker, which makes a total of sixty hours of speech. The corpus contains only read speech. Different styles of texts were selected for recording with specific characteristics of those styles in mind: fiction narrative, a play containing emotionally expressive dialogues, purely informational neutral texts on IT, politics and economy. We use six levels of annotation that allow us to comprehensively describe speech data. These are… D. Kocharov, Fully Annotated Corpus of Russian Speech
4
Annotation Level 1: pitch period boundaries Level 2: phonetic events
Annotation Level 1: pitch period boundaries Level 2: phonetic events Level 3: real phonetic transcription Level 4: ideal phonetic transcription Level 5: orthographic transcription Level 6: prosodic transcription The first annotation level contains boundaries of fundamental frequency periods. The second annotation level contains boundaries and labels of various phonetic events: epenthetic vowels, voice onsets, voiced plosures, stationary parts of voiceless consonants, laryngalization, and glottalization. The third annotation level contains real phonetic transcription that reflects the sounds actually pronounced by the speakers. The forth annotation level contains ideal phonetic transcription which was automatically generated by a linguistic transcriber in accordance with a canonical set of rules. The fifth annotation level contains orthographic transcription. And the sixth annotation level contains prosodic transcription which includes labels for different types of pauses, types of tone unit, and non-speech events. D. Kocharov, Fully Annotated Corpus of Russian Speech
5
Annotated Speech Sample
Annotated Speech Sample Annotation file format: 0,1, 0,2,h 0,8,хотелось 0,32,h 0,64,12 286,16,- 968,16, … This slide contains a sample of annotated speech signal and a piece of annotation file introducing annotation file format. Here you can see all the annotation levels I introduced earlier. D. Kocharov, Fully Annotated Corpus of Russian Speech
6
Labeling Periods of Fundamental Frequency and Phonetic Events
Labeling Periods of Fundamental Frequency and Phonetic Events F0 periods were detected automatically. The efficiency of automatic F0 detection and F0 period labeling was up to 98%. The results of the automatic procedure were checked and corrected manually. Phonetic events were detected manually: epenthetic vowels, voice onsets, voiced plosures, stationary parts of voiceless consonants, glottalization. The fundamental frequency periods were detected automatically. A linear combination of a number of methods was used for this purpose The efficiency of automatic pitch detection and pitch periods labeling was about 98%. To make the labeling absolutely correct, the results of the automatic procedure were checked and corrected manually. D. Kocharov, Fully Annotated Corpus of Russian Speech
7
Phonetic Transcription
Phonetic Transcription Version of SAMPA for Russian was used for transcription. 18 symbols were used to mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/. They contained indication of the vowel’s position regarding stress: 0 – stressed accented vowel, 1 – unstressed vowel in a pretonic syllable, 4 – unstressed one in a post-tonic syllable. The set of consonant symbols included 41 symbols: 36 Russian consonant phonemes 5 voiced allophones of voiceless consonants The transcription symbols used were a version of SAMPA for the Russian language. To mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/ 18 symbols were used. Each vowel symbol contained indication of the sound’s position regarding stress. Thus 0 was used to for a stressed accented vowel, 1 - for an unstressed vowel in a pretonic syllable, 4 – an unstressed one in a post-tonic syllable. The set of consonant symbols included 41 symbols to cover 36 Russian consonant phonemes and 5 voiced allophones of voiceless consonants which occur frequently at word junctions. There are two levels of phonetic transcription in CORPRES. We called them “real phonetic transcription” and “ideal phonetic transcription”. We call the first one ‘real’ because it reflects the sounds actually pronounced by the speakers. It is a fully manual transcription, made by expert phoneticians. The ‘ideal’ transcription was generated from orthographic texts which were read by the speaker in accordance with a set of phonological rules. D. Kocharov, Fully Annotated Corpus of Russian Speech
8
Real Phonetic Transcription
Real Phonetic Transcription Speech signal was manually: segmented transcribed peer-revised. Huge work! Time efficiency: ~1 sound per minute. => 1 minute of speech per 1-2 hours. To produce the real phonetic transcription, the speech signal was manually segmented, transcribed and peer-revised by expert phoneticians. This transcription is a fully manual transcription made by means of acoustic and perceptual analysis of sound by sound. It was a REALLY huge work. It takes from one to two hours to transcribe one minute of speech. One minute of speech contains about 80 sounds. D. Kocharov, Fully Annotated Corpus of Russian Speech
9
Ideal Phonetic Transcription
Ideal Phonetic Transcription Transcription: is generated from texts. Labels: placed automatically to coincide with the label positions produced manually on the real transcription level. Automatic labeling: not perfect due to the mismatch of ideal and real phonetic transcriptions. => results of the automatic procedure were further manually corrected. The ‘ideal’ transcription was generated from the texts read by the speakers in accordance with a set of phonological rules without reference to the actual sound. Thus it is made without reference to the actual sound. The labels were placed automatically to coincide with the label positions produced manually on the real transcription level. Procedure of automatic labeling is based on calculating the Levenshtein distance. Automatic labeling is not perfect due to the mismatch of ideal and real phonetic transcriptions. Therefore, the results of the automatic procedure were further manually corrected. D. Kocharov, Fully Annotated Corpus of Russian Speech
10
Orthographic and Prosodic Transcription
Orthographic and Prosodic Transcription Orthographic transcription (Level 5) contains the boundaries of words and word labels. prosodically prominent words are labeled with special symbols. Prosodic transcription (Level 6) contains boundaries of tone units and pauses and their labels. Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription. Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels. Orthographic transcription was stored on Level 5, it contains the boundaries of words and word labels. Besides the prosodically prominent words are labeled with special symbols. Prosodic information was stored on Level 6, it contains the boundaries of tone units and pauses and their labels. Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription. Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels. D. Kocharov, Fully Annotated Corpus of Russian Speech
11
Corpus Data Description
Corpus Data Description Fully Annotated Data Partly Annotated Data Total Amount Phonemes 1 – Words Tone Units 64 055 86 546 Hours 24 36 60 40% of the corpus: manually segmented and fully annotated on all six levels. 60% of the corpus: partly annotated labels for pitch period and phonetic event labels: no manual phonetic transcription orthographic, prosodic transcription, ideal phonetic transcription are done, but not aligned with speech signal. 40% of the corpus (24 hours of speech) was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic and prosodic transcription, as well as the ideal phonetic transcription (see Section 3 for detail) for this part was generated and then stored as text files, but was not transferred to sound file labels. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. Table 1 shows general corpus statistics. D. Kocharov, Fully Annotated Corpus of Russian Speech
12
Mismatch Between Ideal and Real Transcriptions
Mismatch Between Ideal and Real Transcriptions Total Correctly Mispronounced Elided Count 1 70 033 Percent 100 84.7 9.05 6.25 The table reveals that despite the fact that as many as 84.7% of the ideal transcription reflects the actual pronunciation, 9.05% of the expected sounds are replaced by other sounds, and 6.25% of the expected sounds are actually not pronounced at all. SIGNIFICANT MISMATCH! About a half of the mismatches is more or less consistent and could be formalized, but the other part is almost unpredictable and depends on instant speaker preference. D. Kocharov, Fully Annotated Corpus of Russian Speech
13
Conclusions It is the only available corpus for Russian TTS.
Conclusions It is the only available corpus for Russian TTS. Precise annotation provides an especially valuable resource for both linguistic research and speech applications development. The corpus is large enough for: high-quality TTS phonetic research of intra- and inter-speaker pronunciation variability Our experience shows: manual transcription is very expensive, but worth doing. D. Kocharov, Fully Annotated Corpus of Russian Speech
14
Thank you! D. Kocharov, Fully Annotated Corpus of Russian Speech
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.