Presentation is loading. Please wait.

Presentation is loading. Please wait.

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Similar presentations


Presentation on theme: "Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand."— Presentation transcript:

1 Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand

2 Background on Thai speech recognition research 2 1987 Isolated syllable recognition 1995 Isolated word recognition Connected sub-word recognition 1999 Small task continuous speech recognition 2003 LVCSR 2005 Broadcast news transcription system 2007 Difficulty Thienlikit et al., 2004 Newspaper read-speech recognition

3 Development of Thai Broadcast News Transcription System Research on broadcast news transcription system for Thai falls behind other languages English: 1995 (Stern, 1997 ) Japanese: 1997 (Matsuoka et al., 1997 ) Mandarin: 1998 (Guo et al., 1998 ) Italian: 2000 (Federico et al., 2000 ) We need to speed up our research activities to catch up with others 3 Targets 1.Development of Thai broadcast news corpus Speech corpus: training and testing data Text corpus: language modeling 2.Development of a prototype system

4 Speech corpus  Structure information of broadcast news was annotated  Section, Speaker’s turn, Segments  Property tags were annotated to each speaker’s turn  Speaker’s name, if known  Speaker’s gender: male / female  Speaking mode: planned / spontaneous  Background noise: clean / music / noise  Only speech from announcers speaking in the studio was transcribed  Transcription and annotation was created by one transcriber and checked by another transcriber 4

5 Episode : one broadcast news session Structure of broadcast news 5 Section 1 : one news topic Section 2 Section 3

6 Episode : one broadcast news session Section 1 : one news topic Structure of broadcast news 5 Speaker’s turn : speaker A Speaker’s turn : speaker BSpeaker’s turn : speaker A

7 Episode : one broadcast news session Structure of broadcast news 7 Section 1 : one news topic Speaker’s turn : speaker A Segment : one sentence or clause

8 Speech corpus  Structure information of broadcast news was annotated  Section, Speaker’s turn, Segments  Property tags were annotated to each speaker’s turn  Speaker’s name, if known  Speaker’s gender: male / female  Speaking mode: planned / spontaneous  Background noise: clean / music / noise  Only speech from announcers speaking in the studio was transcribed  Transcription and annotation was created by one transcriber and checked by another transcriber 8

9 Episode : one broadcast news session Example of structure information 9 Section 1 : Speaker’s turn : Segment : sentence A Segment : sentence B Segment : sentence C Sports Mr. A, male, planned speech, clean speech

10 Speech corpus  Structure information of broadcast news was annotated  Section, Speaker’s turn, Segments  Property tags were annotated to each speaker’s turn  Speaker’s name, if known  Speaker’s gender: male / female  Speaking mode: planned / spontaneous  Background noise: clean / music / noise  Only speech from announcers speaking in the studio was transcribed  Transcription and annotation was created by one transcriber and checked by another transcriber 10

11 Text corpus  No structure information was annotated  Additional information  Speaking mode: planned / spontaneous 11

12 Problems of Thai transcription text  No space between words  Definition of word is very ambiguous  No good morphological analyzer  Difficulties in transcription and checking process  Manually word-segmented transcription was made  Instruction was created for transcribers  Automatically segmented transcription 12 Future target

13 Broadcast news collection  News programs from one public TV station in Thailand were recorded  Total of 105 news episodes  Speech corpus : 35 news episodes  17 hours  Text corpus : 70 news episodes 13

14 Analysis of speech corpus 14

15 Information of speech & text corpora AttributeSpeech corpusText corpus No. of sentences 13k32k No. of words 224k573k No. of unique words 10k14k No. of phonemes 899k- No. of speakers 8 female, 4 male - 15

16 Data used in experiments  Test set data  Randomly selected from the speech corpus  3,000 utterances  Acoustic model training data for the baseline system  Phonetically balanced sentence speech corpora  LOTUS (Kasuriya et al., 2003 ) and the corpus developed internally  Read speech corpora  40.3 hours ( 68 male and 68 female)  Acoustic model adaptation data  Selected from the speech corpus  No overlap between adaptation data and test set data  Language model training data  Text corpus + transcript from speech corpus excluded test set 16

17 Experimental condition  Acoustic model  Gender-dependent acoustic model  12 MFCCs, delta, and delta energy  Triphones, 1000 tied - states, 8 Gaussian mixtures  Language model  Tri-grams  Dictionary size: about 18 k words  TITech WFST speech recognition system (Dixon et al., 2007 ) was used as a speech decoder 17

18 Acoustic model adaptation  Supervised adaptation using MLLR  F-condition adaptation F 0 : clean, plannedF 1 : clean, spontaneous F 3 : music noiseF 4 : other noise  Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus  Speaker adaptation  Adaptation data: 200 utterances regardless of F-condition randomly selected from the speech corpus 18

19 WER results 19 Speaker adaptation yielded better WER F-condition Proportion Time#words F035.3%17160 F11.0%629 F314.0%7882 F449.7%27542

20 Discussion  High WER  Mismatch recording condition  The speech corpus was only used as testing and adaptation data  Small text corpus  Inefficient language model 20

21 Conclusion  Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented  Speech corpus was annotated with structure information which is useful for further research purpose  An LVCSR system was setup and tested with the corpus 21

22 Future work  Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007 )  Compound pseudo-morpheme (CPM) unit  Pseudo-morpheme error rate (F 0 condition)  Manually-segmented word unit system: 20.5 %  CPM unit system: 19.9 %  Improving language model by using newspaper text  Collaboration with NECTEC: additional 50 hours of speech corpus 22

23 Thank you 23

24 Thank you 24

25 Thank you 25

26 Background 26 1987 Isolated syllable recognition 1995 Isolated word recognition Connected sub-word recognition 1999 Small task continuous speech recognition 2003 LVCSR 2005 Broadcast news LVCSR 2007 Difficulty Thienlikit, 2004 Newspaper read-speech recognition

27 Development of Thai Broadcast News LVCSR System  Development of an LVCSR system requires speech and text corpora  Existing speech corpora for Thai LVCSR research  NECTEC-ATR  LOTUS (NECTEC)  GlobalPhone (CMU) 27 Newspaper read-speech 1.Development of Thai broadcast news corpus Speech corpus: training and testing data Text corpus: language modeling 2.Development of a prototype of LVCSR system

28 Experiments & Developed corpora  Speech corpus  The size of the speech corpus is still rather small  It was used in three ways  Test data  Adaptation data  A part of transcription text was used for training LM  Text corpus  It was used for training LM 28

29 Perplexity & OOV rates F-condition PerplexityOOV rate MaleFemaleMaleFemale F0107.5106.90.90.8 F1126.4100.10.90.6 F3145.2100.00.70.9 F4141.6157.61.51.9 Overall126.9125.61.21.3 29

30 Transcription process Text corpus transcribing 7 persons Guideline 30 Speech corpus transcribing 4 persons Speech corpus checking 2 persons Lexical entries checking 1 person Speech corpus Lexical entries checking 1 person Text corpus

31 Speech corpus  Transcription and annotation of about 17 hours of TV broadcast news  Tool: “Transcriber” (Barras et al., 2001 )  Additional information  speaker information: name, gender  speaking mode: planned/spontaneous speech  Speech from announcers speaking in the studio 31

32 Transcription conventions  Guideline for the transcription process  Segment segmentation  Word segmentation  Repeating word  Thai/English abbreviation  Number entity  Special tags 32

33 Introduction  Thai speech processing research in TokyoTech  Dialogue system [Whittiwiwattchai, 2003]  LVCSR system  Dictation system [Tianlikid,2005]  Broadcast news recognition system 33

34 Overview  Introduction  Corpus description  Recording and transcription processes  Corpus evaluation  Conclusion 34

35 Thai language corpora  Large language corpora are crucial to a state- of-the-art natural language processing system  Thai speech resources for speech processing  NECTEC-ATR  LOTUS (NECTEC)  GlobalPhone (CMU)  TSynC- 1 (NECTEC) 35 Newspaper read-speech Unit-selection speech synthesis

36 WER Result F-condition Time proportion WER (%) MaleFemale F028.1%44.440.8 F11.5%62.460.2 F311.5%82.272.4 F458.9%54.957.5 Overall100%56.845.5 36

37 Text corpus  Text transcribed from 35 hours of TV broadcast news  Additional information  Speaking mode: planned/spontaneous 37

38 Transcription conventions (1)  Sentence segmentation  No sentence marker in Thai language  Ambiguous  Grammatically, there are 3 types of sentence  Simple sentence  Compound sentence  Complex sentence  Sentence was defined as a simple sentence or clause with the help of delimited breaths 38 Composed from several of clauses or simple sentences

39 Transcription conventions (2)  Word segmentation  No word boundary marker in Thai language  Lead to difficulties in transcription and data checking processes  Too ambiguous to define all rules  A few rules of simple segmentation patterns were defined  Undefined patterns were left to the decision of transcribers 39

40 Transcription conventions (3)  Repeating word  Thai/English abbreviation  Number entity  Special tags  Disfluencies, filled-pauses, exclamations  Foreign words  Some other events: uncertainly transcribed part, etc. 40

41 Recorded programs  News programs from one public TV station in Thailand was recorded  Total of 105 news episodes  Speech corpus  35 news episodes  About 17 hours of speech data  Text corpus:  70 news episodes 41


Download ppt "Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand."

Similar presentations


Ads by Google