Download presentation
Presentation is loading. Please wait.
Published byEstefania Hanly Modified over 10 years ago
1
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand
2
Background on Thai speech recognition research 2 1987 Isolated syllable recognition 1995 Isolated word recognition Connected sub-word recognition 1999 Small task continuous speech recognition 2003 LVCSR 2005 Broadcast news transcription system 2007 Difficulty Thienlikit et al., 2004 Newspaper read-speech recognition
3
Development of Thai Broadcast News Transcription System Research on broadcast news transcription system for Thai falls behind other languages English: 1995 (Stern, 1997 ) Japanese: 1997 (Matsuoka et al., 1997 ) Mandarin: 1998 (Guo et al., 1998 ) Italian: 2000 (Federico et al., 2000 ) We need to speed up our research activities to catch up with others 3 Targets 1.Development of Thai broadcast news corpus Speech corpus: training and testing data Text corpus: language modeling 2.Development of a prototype system
4
Speech corpus Structure information of broadcast news was annotated Section, Speaker’s turn, Segments Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise Only speech from announcers speaking in the studio was transcribed Transcription and annotation was created by one transcriber and checked by another transcriber 4
5
Episode : one broadcast news session Structure of broadcast news 5 Section 1 : one news topic Section 2 Section 3
6
Episode : one broadcast news session Section 1 : one news topic Structure of broadcast news 5 Speaker’s turn : speaker A Speaker’s turn : speaker BSpeaker’s turn : speaker A
7
Episode : one broadcast news session Structure of broadcast news 7 Section 1 : one news topic Speaker’s turn : speaker A Segment : one sentence or clause
8
Speech corpus Structure information of broadcast news was annotated Section, Speaker’s turn, Segments Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise Only speech from announcers speaking in the studio was transcribed Transcription and annotation was created by one transcriber and checked by another transcriber 8
9
Episode : one broadcast news session Example of structure information 9 Section 1 : Speaker’s turn : Segment : sentence A Segment : sentence B Segment : sentence C Sports Mr. A, male, planned speech, clean speech
10
Speech corpus Structure information of broadcast news was annotated Section, Speaker’s turn, Segments Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise Only speech from announcers speaking in the studio was transcribed Transcription and annotation was created by one transcriber and checked by another transcriber 10
11
Text corpus No structure information was annotated Additional information Speaking mode: planned / spontaneous 11
12
Problems of Thai transcription text No space between words Definition of word is very ambiguous No good morphological analyzer Difficulties in transcription and checking process Manually word-segmented transcription was made Instruction was created for transcribers Automatically segmented transcription 12 Future target
13
Broadcast news collection News programs from one public TV station in Thailand were recorded Total of 105 news episodes Speech corpus : 35 news episodes 17 hours Text corpus : 70 news episodes 13
14
Analysis of speech corpus 14
15
Information of speech & text corpora AttributeSpeech corpusText corpus No. of sentences 13k32k No. of words 224k573k No. of unique words 10k14k No. of phonemes 899k- No. of speakers 8 female, 4 male - 15
16
Data used in experiments Test set data Randomly selected from the speech corpus 3,000 utterances Acoustic model training data for the baseline system Phonetically balanced sentence speech corpora LOTUS (Kasuriya et al., 2003 ) and the corpus developed internally Read speech corpora 40.3 hours ( 68 male and 68 female) Acoustic model adaptation data Selected from the speech corpus No overlap between adaptation data and test set data Language model training data Text corpus + transcript from speech corpus excluded test set 16
17
Experimental condition Acoustic model Gender-dependent acoustic model 12 MFCCs, delta, and delta energy Triphones, 1000 tied - states, 8 Gaussian mixtures Language model Tri-grams Dictionary size: about 18 k words TITech WFST speech recognition system (Dixon et al., 2007 ) was used as a speech decoder 17
18
Acoustic model adaptation Supervised adaptation using MLLR F-condition adaptation F 0 : clean, plannedF 1 : clean, spontaneous F 3 : music noiseF 4 : other noise Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus Speaker adaptation Adaptation data: 200 utterances regardless of F-condition randomly selected from the speech corpus 18
19
WER results 19 Speaker adaptation yielded better WER F-condition Proportion Time#words F035.3%17160 F11.0%629 F314.0%7882 F449.7%27542
20
Discussion High WER Mismatch recording condition The speech corpus was only used as testing and adaptation data Small text corpus Inefficient language model 20
21
Conclusion Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented Speech corpus was annotated with structure information which is useful for further research purpose An LVCSR system was setup and tested with the corpus 21
22
Future work Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007 ) Compound pseudo-morpheme (CPM) unit Pseudo-morpheme error rate (F 0 condition) Manually-segmented word unit system: 20.5 % CPM unit system: 19.9 % Improving language model by using newspaper text Collaboration with NECTEC: additional 50 hours of speech corpus 22
23
Thank you 23
24
Thank you 24
25
Thank you 25
26
Background 26 1987 Isolated syllable recognition 1995 Isolated word recognition Connected sub-word recognition 1999 Small task continuous speech recognition 2003 LVCSR 2005 Broadcast news LVCSR 2007 Difficulty Thienlikit, 2004 Newspaper read-speech recognition
27
Development of Thai Broadcast News LVCSR System Development of an LVCSR system requires speech and text corpora Existing speech corpora for Thai LVCSR research NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU) 27 Newspaper read-speech 1.Development of Thai broadcast news corpus Speech corpus: training and testing data Text corpus: language modeling 2.Development of a prototype of LVCSR system
28
Experiments & Developed corpora Speech corpus The size of the speech corpus is still rather small It was used in three ways Test data Adaptation data A part of transcription text was used for training LM Text corpus It was used for training LM 28
29
Perplexity & OOV rates F-condition PerplexityOOV rate MaleFemaleMaleFemale F0107.5106.90.90.8 F1126.4100.10.90.6 F3145.2100.00.70.9 F4141.6157.61.51.9 Overall126.9125.61.21.3 29
30
Transcription process Text corpus transcribing 7 persons Guideline 30 Speech corpus transcribing 4 persons Speech corpus checking 2 persons Lexical entries checking 1 person Speech corpus Lexical entries checking 1 person Text corpus
31
Speech corpus Transcription and annotation of about 17 hours of TV broadcast news Tool: “Transcriber” (Barras et al., 2001 ) Additional information speaker information: name, gender speaking mode: planned/spontaneous speech Speech from announcers speaking in the studio 31
32
Transcription conventions Guideline for the transcription process Segment segmentation Word segmentation Repeating word Thai/English abbreviation Number entity Special tags 32
33
Introduction Thai speech processing research in TokyoTech Dialogue system [Whittiwiwattchai, 2003] LVCSR system Dictation system [Tianlikid,2005] Broadcast news recognition system 33
34
Overview Introduction Corpus description Recording and transcription processes Corpus evaluation Conclusion 34
35
Thai language corpora Large language corpora are crucial to a state- of-the-art natural language processing system Thai speech resources for speech processing NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU) TSynC- 1 (NECTEC) 35 Newspaper read-speech Unit-selection speech synthesis
36
WER Result F-condition Time proportion WER (%) MaleFemale F028.1%44.440.8 F11.5%62.460.2 F311.5%82.272.4 F458.9%54.957.5 Overall100%56.845.5 36
37
Text corpus Text transcribed from 35 hours of TV broadcast news Additional information Speaking mode: planned/spontaneous 37
38
Transcription conventions (1) Sentence segmentation No sentence marker in Thai language Ambiguous Grammatically, there are 3 types of sentence Simple sentence Compound sentence Complex sentence Sentence was defined as a simple sentence or clause with the help of delimited breaths 38 Composed from several of clauses or simple sentences
39
Transcription conventions (2) Word segmentation No word boundary marker in Thai language Lead to difficulties in transcription and data checking processes Too ambiguous to define all rules A few rules of simple segmentation patterns were defined Undefined patterns were left to the decision of transcribers 39
40
Transcription conventions (3) Repeating word Thai/English abbreviation Number entity Special tags Disfluencies, filled-pauses, exclamations Foreign words Some other events: uncertainly transcribed part, etc. 40
41
Recorded programs News programs from one public TV station in Thailand was recorded Total of 105 news episodes Speech corpus 35 news episodes About 17 hours of speech data Text corpus: 70 news episodes 41
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.