Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

Similar presentations


Presentation on theme: "Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe."— Presentation transcript:

1

2 Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe

3 2 France Telecom Group restricted Towards optimal TTS corpora Unit-selection TTS This is an example. Linguistic modules Unit selection Unit concatenatio n Speaker database

4 3 France Telecom Group restricted Towards optimal TTS corpora Unit-selection TTS This is an example. Linguistic modules Unit selection Unit concatenatio n How to prepare the recording script ?

5 4 France Telecom Group restricted Towards optimal TTS corpora Preparation of the recording script Criterion= diphones and triphones coverage Algorithm= greedy, corpus condensation Classic optimization approach

6 5 France Telecom Group restricted Towards optimal TTS corpora Preparation of the recording script Criterion= diphones and triphones coverage Algorithm= greedy, corpus condensation Classic optimization approach The link between di- or triphones coverage and the final TTS quality is not clear The process is constrained by the limited combinations encountered in the finite reference corpus

7 6 France Telecom Group restricted Towards optimal TTS corpora Preparation of the recording script Criterion= diphones and triphones coverage Algorithm= greedy, corpus condensation Classic optimization approach Criterion= vocalic sandwiches coverage Algorithm= greedy, sentence construction Our optimization approach

8 7 France Telecom Group restricted Towards optimal TTS corpora Vocalic sandwiches (Cadic et al, Interspeech 2009)

9 8 France Telecom Group restricted Towards optimal TTS corpora Sentence construction Finite State Transducers compute "optimal" sequences of sandwiches, so that: - the coverage increment is maximized (greedy approach) - only sandwich transitions observed in a reference corpus are allowed  Neither syntactic nor semantic consideration  generated sequences are likely to be nonsense Towards optimality Towards readability Development of a semi-automatic tool, allowing an operator to iteratively correct generated sequences, in order to build an acceptable and almost optimal sentence.

10 9 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't the week of the six.)

11 10 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't…)

12 11 France Telecom Group restricted Towards optimal TTS corpora Sentence construction

13 12 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't take it out…)

14 13 France Telecom Group restricted Towards optimal TTS corpora Sentence construction

15 14 France Telecom Group restricted Towards optimal TTS corpora Sentence construction

16 15 France Telecom Group restricted Towards optimal TTS corpora Sentence construction

17 16 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't take it out the weeks…)

18 17 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't take it out the weeks like you.)

19 18 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't take it out the black weeks,)

20 19 France Telecom Group restricted Towards optimal TTS corpora Sentence construction (I don't take it out the black weeks,)

21 20 France Telecom Group restricted Towards optimal TTS corpora Sentence construction The procedure is time-consuming (around 3 min – 50 steps – to build a plausible sentence) Most built sentences lack semantic coherence (redundancy is minimized at the price of semantics) Built scripts are much denser than with corpus condensation

22 21 France Telecom Group restricted Towards optimal TTS corpora Sentence construction Density increase of 30 to 40% compared to condensation Sandwich coverage rate (%)

23 22 France Telecom Group restricted Towards optimal TTS corpora Conclusion For the creation of unit-selection TTS recording scripts: We suggested using the Vocalic Sandwiches Coverage Rate as optimization criterion (since it is a convenient symbolic approximation of the selection cost) We presented a novel corpus building technique, based on sentence construction rather than sentence selection. The procedure is time- consuming and built sentences tend to lack semantic coherence, but a density increase of 30 to 40% can be otained. Recent work (SSW7 submission) Extensive evaluation of the vocalic sandwiches as optimization criterion Construction of full recordings scripts. Density estimations seem to be confirmed. However semantic limitations had significant repercussions on the reading stage.

24 23 France Telecom Group restricted Towards optimal TTS corpora

25 24 France Telecom Group restricted Towards optimal TTS corpora Database constitution: two ways Rushes from DVD, websites… Unique way to inaccessible voices Expensive process, poor TTS quality Control of the content  best TTS quality OR Dedicated recordings (script read by a speaker)

26 25 France Telecom Group restricted Towards optimal TTS corpora Database constitution: two ways Rushes from DVD, websites… Unique way to inaccessible voices Expensive process, poor TTS quality Control of the content  best TTS quality OR Dedicated recordings (script read by a speaker)

27 26 France Telecom Group restricted Towards optimal TTS corpora Vocalic sandwiches (Cadic et al, Interspeech 2009) Given an input sentence, the selection module searches the database for units presenting:  Maximum adequation to the target sequence (target cost)  Minimum distorsion between consecutive units (concatenation cost) Illustration

28 27 France Telecom Group restricted Towards optimal TTS corpora Vocalic sandwiches (Cadic et al, Interspeech 2009) Given an input sentence, the selection module searches the database for units presenting:  Maximum adequation to the target sequence (target cost)  Minimum distorsion between consecutive units (concatenation cost) Illustration

29 28 France Telecom Group restricted Towards optimal TTS corpora Vocalic sandwiches (Cadic et al, Interspeech 2009) Given an input sentence, the selection module searches the database for units presenting:  Maximum adequation to the target sequence (target cost)  Minimum distorsion between consecutive units (concatenation cost) Illustration

30 29 France Telecom Group restricted Towards optimal TTS corpora Vocalic sandwiches (Cadic et al, Interspeech 2009) Correlations of coverage rates with the selection cost:  Vocalic sandwiches-0.78  Diphones-0.44  Triphones-0.64 Illustration

31 30 France Telecom Group restricted Towards optimal TTS corpora Sentence construction Finite State Transducers compute "optimal" sequences of sandwiches, so that: - the coverage increment is maximized (greedy approach) - only sandwich transitions observed in a reference corpus are allowed Optimal sequence of length 1 Coverage increment is averaged over the sequence length 15 FST give 15 optimal sandwich sequences for each length ≦ 15 Optimal sequence of length 2 Optimal sequence of length 3 Optimal sequence of length 4 … Optimal sequence of length 15


Download ppt "Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe."

Similar presentations


Ads by Google