Download presentation
Presentation is loading. Please wait.
Published bySabrina Johnston Modified over 8 years ago
2
1 Current Interests 2007~2008 (Unfinished papers & Premature ideas) 1.Identifying frication & aspiration noise in the frequency domain: The case of Korean alveolar lax fricatives 2.The role of prosody in dialect synthesis and authentication 3.Synthesis & evaluation of prosodically exaggerated utterances 4.Determining the weights of prosodic components in prosody evaluation 5.Difference database of prosodic features for automatic prosody evaluation 6.Transforming Korean alveolar lax fricatives into tense 7.Gender transformation of utterances
3
1. Identifying frication & aspiration noise in the frequency domain: The case of Korean alveolar lax fricatives Kyuchul Yoon School of English Language & Literature Yeungnam University Spring 2008 Joint Conference of KSPS & KASS
4
3 Korean lax alveolar fricatives Two different types of noise
5
4 Algorithm
6
5 Change of energy distribution in the frequency domain over time Energy distribution on a frame-by-frame basis (e.g. 5 msec) Sums of band energy across the reference (e.g. low cutoff) frequency criterionValue variable determines the boundary Assumption: Same criteronValue for same speaker
7
6 How Praat script works See Demo
8
7 How Praat script works
9
8 Experiment The list of words used in the experiment. The words marked with * was also used in the repeated series experiment. The numbers in parentheses represent the number of repetition during the recording.
10
9 Results & Conclusion The histogram of differences between the manually inserted and automatically inserted boundaries for the repeated series experiment. X-axis in msec. Human 1 vs. Script 1 Repeated
11
10 Results & Conclusion The outlier from. The difference was 6.4 msec. The m and a represents manual and automatic respectively.
12
11 Results & Conclusion The histogram of differences between the manually inserted and automatically inserted boundaries for the non-repeated series experiment with 53 words. X-axis in msec. Human 1 vs. Script 1 Non-repeated Human 2 vs. Script 2 Non-repeated The same-speaker-same-criterionValue assumption holds!
13
12 Results & Conclusion The histogram of differences between the two phoneticians and the two automated scripts for the non-repeated series experiment with 53 words. X-axis in msec. Human 1 vs. Human 2 Non-repeated Script 1 vs. Script 2 Non-repeated
14
13 Results & Conclusion The summary of the means and the standard deviations of the differences from the two experiments. The numbers are given in msec.
15
14 Results & Conclusion The automated identification of the boundary (labeled auto) between /s/ and /h/ in the phrase Miss Henry produced by a female native speaker of English. The f and v represent the beginnings of /s/ and the vowel following /h/.
16
15 References [1] Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot International 5(9/10). pp.341-345. [2] Yoon, Kyuchul. 2002. A production and perception experiment of Korean alveolar fricatives. Speech Sciences. 9(3). pp.169-184. [3] Yoon, Kyuchul. 2005. Durational correlates of prosodic categories: The case of two Korean voiceless coronal fricatives. Speech Sciences. 12(1). pp.89-105.
17
2. The role of prosody in dialect synthesis and authentication Kyuchul Yoon School of English Language & Literature Yeungnam University Spring 2008 Joint Conference of KSPS & KASS
18
17 Goals 1.Synthesize Masan utterances from matching Seoul utterances by prosody cloning 2.Examine the role of prosody in the authentication of synthetic Masan utterances (Listening experiment)
19
18 Background Differences among dialects –Segmental differences Fricative differences in the time domain (Lee, 2002) –Busan fricatives have shorter frication/aspiration intervals than for Seoul Fricative differences in the frequency domain (Kim et al., 2002) –The low cutoff frequency of Kyungsang fricatives was higher than for Cholla fricatives (> 1,000 Hz) –Non-segmental or prosodic differences Intonation or fundamental frequency (F0) contour difference Intensity contour difference Segment durational difference Voice quality difference
20
19 Synthesis Simulating (by prosody cloning) Masan dialect from Seoul dialect The simulated Masan utterances will have –the speech segments of Seoul dialect –the prosody of Masan dialect F0 contour Intensity contour Segmental duration
21
20 Evaluation Through a listening experiment Stimuli consist of –#1. Authentic, but synthetic, Masan utterance –#2. Seoul utterance with Masan segmental durations (D) –#3. Seoul utterance with Masan F0 contour (F) –#4. Seoul utterance with Masan intensity contour (I) –#5. Seoul utterance with Masan durations and F0 contour (D+F) –#6. Seoul utterance with Masan durations and intensity contour (D+I) –#7. Seoul utterance with Masan F0 contour and intensity contour (F+I) –#8. Seoul utterance with Masan durations, F0 contour and intensity contour (D+F+I) (1) 동대구에 볼 일이 없습니다. (2) 바다에 보물섬이 없다 Listen to Stimuli
22
21 Prosody transfer (PSOLA algorithm) Three aspects of the prosody –Fundamental frequency (F0) contour –Intensity contour –Segmental durations Pitch-Synchronous OverLap and Add (PSOLA) algorithm (Mouline & Charpentier, 1990) –Implemented in Praat (Boersma, 2005) –Use of a script for semi-automatic segment-by-segment manipulation (Yoon, 2007)
23
22 Prosody transfer (PSOLA algorithm) Procedures for full prosody transfer –Align segments btw/ Masan and Seoul utterances –Make the segment durations of the two identical –Make the two F0 contours identical –Make the two intensity contours identical
24
23 Prosody transfer (PSOLA algorithm) Align segments btw/ Masan and Seoul utterances Make the segment durations of the two utterances identical ㅂㅏㄹㅏㅁ “… 바람 …” Masan ㅏㅏ Seoul stretch shrink ㅂㄹㅁ
25
24 Prosody transfer (PSOLA algorithm) ㅂㅏㄹㅏㅁ Masan Seoul ㅂㅏㄹㅏㅁ Masan F0 Seoul F0 Make the two F0 contours identical
26
25 Prosody transfer (PSOLA algorithm) Seoul intensity ㅂㅏㄹㅏㅁ Masan Seoul ㅂㅏㄹㅏㅁ Masan intensity Make the two intensity contours identical
27
26 Synthetic (simulated) Masan stimulus
28
27 Synthetic authentic Masan stimulus
29
28 Listening experiment 16 stimuli (8 + 8) Presented to 13 Masan/Changwon listeners –On a scale of 1 (worst) to 10 (best) –Used Praat ExperimentMFC object –Allowed repetition of stimulus: up to 10 times
30
29 Listening experiment See Demo
31
30 Results & Conclusion Histogram of listener responses
32
31 Results & Conclusion F0 contour transfer 1 … listener responses … 10
33
32 Results & Conclusion Seoul utterances with Masan prosody D F I DF DI FI DFI Masan
34
33 Results & Conclusion Main effects of –Segmental durations; F(1,12)=11.53, p=0.005 –F0 contour; F(1,12)=141.12, p=0.00000005 Regression analysis
35
34 Results & Conclusion Prosody cloning not sufficient for dialect simulation –(Sub)Segmental differences may be at work –Quality of synthetic stimuli F0 contour transfer (from Masan to Seoul) –Most influential on shifting perception from Seoul to Masan utterances
36
35 References [1] Kyung-Hee Lee, “Comparison of acoustic characteristics between Seoul and Busan dialect on fricatives”, Speech Sciences, Vol.9/3, pp.223-235, 2002. [2] Hyun-Gi Kim, Eun-Young Lee, and Ki-Hwan Hong, “Experimental phonetic study of Kyungsang and Cholla dialect using power spectrum and laryngeal fiberscope”, Speech Sciences, Vol.9/2, pp.25-47, 2002. [3] Kyuchul Yoon, “Imposing native speakers’ prosody on non-native speakers’ utterances: The technique of cloning prosody”, Journal of the Modern British & American Language & Literature, Vol.25(4). pp.197-215, 2007. [4] E. Moulines and F. Charpentier, “Pitch synchronouswaveform processing techniques for text-to-speech synthesis using diphones”, Speech Communication, 9 5-6, 1990. [5] P. Boersma, “Praat, a system for doing phonetics by computer”, Glot International, Vol.5, 9/10, pp.341-345, 2005.
37
3. Synthesis & evaluation of prosodically exaggerated utterances: A preliminary study Kyuchul Yoon School of English Language & Literature Yeungnam University Spring 2008 Joint Conference of KSPS & KASS
38
37 Contents Synthesis & evaluation of human utterances with exaggerated prosody Synthesis of exaggerated prosody –Useful for presenting native utterances to students –The definition of prosody “exaggeration” –The algorithm Evaluation of exaggerated prosody –Useful for evaluating learner utterances –The algorithm & an experiment
39
38 Teaching & evaluating prosody Teaching language prosody –The need for “exaggeration” of native utterances –How to define “exaggeration” Evaluating language prosody –Given the native version of an utterance, evaluate learner’s atypical prosody –How to measure the differences btw/ the native and learner utterances
40
39 Exaggerating native prosody Exaggeration of the F0 contour –One way would be to make the pitch peaks/valleys higher/lower Exaggeration of the intensity contour –One way would be to manipulate the intensity contour of the pitch peaks(or valleys) Exaggeration of the segmental durations –One way would be to manipulate the segmental durations of the pitch peaks(or valleys) See Demo
41
40 Exaggerating native prosody The fundamental frequency (F0) contour of an utterance Marianna!. F0
42
41 Exaggerating native prosody Intensity The intensity contour of an utterance Marianna!.
43
42 Exaggerating native prosody Duration The segmental durations of an utterance Marianna! before and after the exaggeration.
44
43 Algorithm: prosody exaggeration Definition of prosody exaggeration –F0 contour Make pitch peaks/valleys higher/lower in Hz values –Intensity contour Make pitch peaks higher in dB values –Segmental durations Make pitch peaks longer in times values
45
44 Algorithm: prosody exaggeration F0
46
45 Algorithm: prosody exaggeration Intensity
47
46 Algorithm: prosody exaggeration Durations
48
47 How Praat script works
49
48 How Praat script works F0 Intensity Durations
50
49 How Praat script works Original F0 Durations Intensity F0 Durations
51
50 Evaluating learner prosody Assumes the existence of the native version Evaluates the learner versions Evaluation of the F0 & intensity contours –Is preceded by duration manipulation: The durations of the matching segments of the two utterances are made identical [3] –Is preceded by F0/intensity normalization & F0 smoothing The mean difference is added/subtracted to/from learner utterance –Is followed by pitch/intensity point-to-point comparison Evaluation of segmental durations –Done without any duration manipulation. Segment-to- segment comparison Evaluation measure: Euclidean distance metric
52
51 Algorithm: prosody evaluation Before & after duration manipulation native learner before learner after
53
52 Algorithm: prosody evaluation F0 point-to-point comparison btw/ native and learner native learner after Normalization & smoothing were performed in prior steps
54
53 Algorithm: prosody evaluation Intensity point-to-point comparison btw/ native and learner native learner after Normalization was performed in prior steps
55
54 Algorithm: prosody evaluation Duration segment-to-segment comparison btw/ native and learner native learner before P = (p1, p2, p3,..., pn) and Q = (q1, q2, q3,..., qn) in Euclidean n-space Euclidean distance metric for evaluation measure
56
55 A pilot experiment native learner after D/F/I cloning An ideal case: Three Euclidean distances (Ed) should be minimum Ed1: F0 contour Ed2: Intensity contour Ed3: Segment durations
57
56 Creation of Stimuli: F0 F0: -100Hz to +100Hz with a 10Hz interval 21 stimuli Evaluation of the stimuli against the F0 contour of the native utterance native learner after D cloning + + + + +
58
57 Creation of Stimuli Intensity: -25dB to +25dB with a 5dB interval 11 stimuli Evaluation of the stimuli against the intensity contour of the native utterance native learner after D cloning + +
59
58 Creation of Stimuli Duration: 0.25, 0.50, 0.75, 1.00, 1.50, 2.00, 2.50, 3.00 times the original 8 stimuli Evaluation of the stimuli against the segment durations of the native utterance native learner + +
60
59 Results & Conclusion
61
60 Results & Conclusion
62
61 Results & Conclusion
63
62 Results & Conclusion Prosody exaggeration –Can be a tool for teaching language prosody –Can be used to test measures for evaluating prosody Limitation of the current prosody evaluation –Native utterances should exist to yield measures TTS systems with advanced prosody models could be helpful to process any learner utterances –“Weights” of the three separate measures (F0/intensity/duration) need to be determined Experiments with human evaluators could provide the weights
64
63 References [1] Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot International 5(9/10). pp.341-345. [2] Moulines, E. & F. Charpentier. 1990. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9. pp.453-467. [3] Yoon, K. 2007. Imposing native speakers' prosody on non-native speakers' utterances: The technique of cloning prosody. Journal of the Modern British & American Language & Literature 25(4). pp.197-215.
65
64 4. Determining the weights of prosodic components in prosody evaluation Problem –Raw components vs. Abstracted concepts –F0, intensity, duration vs. Rhythm, tempo, etc. Determine the weights of prosodic components in prosody evaluation –Use raw units: F0, intensity, duration –Use cloning of prosody (problem of unequal number of segments) –Create an “other-things-being-equal” environment –Evaluation of Each raw prosodic component Overall prosodic fluency –Compare & Assess the weights of each component in prosody evaluation
66
65 Stimuli (4) Determining the weights of prosodic components in prosody evaluation Given (a) model native utterance(s) and its learner version –Human evaluator evaluates the learner utterance in terms of its prosodic fluency = Overall Prosody Score (from the unmodified learner utterance) Manipulate the learner utterance to create an “other-things-being-equal” environment so that the learner utterance is the same as its native version except for –(1) Its F0 contour (learner utterance version 1) –(2) Its intensity contour (learner utterance version 2) –(3) Its segmental durations (learner utterance version 3) Evaluate the manipulated learner utterances –(1) F0 score (from learner version 1) –(2) Intensity score (from learner version 2) –(3) Duration score (from learner version 3) Hypothesis: Overall prosody score = * (F0 score) + * (Intensity score) + * (Duration score) Repeat the evaluation for other utterances from the same learner to solve the equation Verify the coefficients with unevaluated utterances from the same learner If the hypothesis holds, make the prosody evaluation process automatic
67
66 Stimuli “The dancing queen likes only the apple pies” Native (5061_02) Learner (1047_02) Evaluate overall prosody with respect to the native version (Overall Prosody Score)
68
67 Stimuli “The dancing queen likes only the apple pies” Native Learner_DI Learner_DF Now has the native durations/intensity. Evaluate F0 contour (F0 Score) Now has the native durations/F0 contour. Evaluate intensity contour (Intensity Score)
69
68 Stimuli “The dancing queen likes only the apple pies” Native Learner_FI Now has the native F0/intensity. Evaluate segmental durations (Duration Score) Overall prosody score = * (F0 score) + * (Intensity score) + * (Duration score)
70
69 5. Difference database of prosodic features for automatic prosody evaluation Given (a) model native utterance(s) and its learner version, get difference values of –(1) F0 contour –(2) intensity contour –(3) segmental durations between the two utterances Use techniques & scripts used in –(3) Synthesis & evaluation of prosodically exaggerated utterances Store difference values of each prosodic feature for each learner utterance in a database Use the database to develop algorithms for automatic prosody scoring Pilot study: labeled sentences from KT_K-SEC corpus
71
70 Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
72
71 nativelearnernumFramesframeNotimenativedBlearnerdBdiffdB 5053_02.wav1044_02.wav48210.03531.8642.42-10.56 5053_02.wav1044_02.wav48220.04330.7342.45-11.72 5053_02.wav1044_02.wav48230.05129.3341.94-12.61 5053_02.wav1044_02.wav48240.05929.0341.00-11.97 5053_02.wav1044_02.wav48250.06729.1140.97-11.86 5053_02.wav1044_02.wav48260.07529.9241.97-12.05 5053_02.wav1044_02.wav48270.08330.2742.67-12.40 5053_02.wav1044_02.wav48280.09131.1442.63-11.49 5053_02.wav1044_02.wav48290.09930.2744.10-13.83 5053_02.wav1044_02.wav482100.10730.3545.12-14.77 5053_02.wav1044_02.wav482110.11530.7343.90-13.18 5053_02.wav1044_02.wav482120.12330.5343.15-12.62 5053_02.wav1044_02.wav482130.13132.4442.67-10.22 5053_02.wav1044_02.wav482140.13931.1240.94-9.82 5053_02.wav1044_02.wav482150.14730.9738.88-7.91 5053_02.wav1044_02.wav482160.15533.9238.15-4.24 5053_02.wav1044_02.wav482170.16333.7837.45-3.67 5053_02.wav1044_02.wav482180.17132.7235.75-3.03 Sums of squares of diffdB's is 42114 Square root of the sums is 205 Intensity difference Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
73
72 nativelearnernumSegssegNonativeSegIDlearnerSegIDtimeStartnativeDurrationormNativeDurlearnerDurnormDiffDur 5053_02.TextGrid1044_02.TextGrid331SILSIL03301.027321328-7 5053_02.TextGrid1044_02.TextGrid332dhdh0.330221.02722165 5053_02.TextGrid1044_02.TextGrid333axax0.353601.0275986-27 5053_02.TextGrid1044_02.TextGrid334SILSIL0.4131041.0271016734 5053_02.TextGrid1044_02.TextGrid335dddd0.517191.02719145 5053_02.TextGrid1044_02.TextGrid336aeae0.5361511.02714712621 5053_02.TextGrid1044_02.TextGrid337nnnn0.686571.0275592-37 5053_02.TextGrid1044_02.TextGrid338ssss0.743921.02789102-13 5053_02.TextGrid1044_02.TextGrid339ihih0.835671.02766111-45 5053_02.TextGrid1044_02.TextGrid3310ngng0.9021001.027987028 5053_02.TextGrid1044_02.TextGrid3311kkkk1.002147 Sums of squares of diffDur's is 59266 Square root of the sums is 243 Duration difference Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
74
73 nativelearnernumFramesframeNotimenativeF0learnerF0diffF0 5053_02.wav1044_02.wav38810.024--undefined----undefined----undefined-- 5053_02.wav1044_02.wav38820.034--undefined----undefined----undefined-- 5053_02.wav1044_02.wav38830.044--undefined----undefined----undefined-- 5053_02.wav1044_02.wav38840.054--undefined----undefined----undefined-- … 5053_02.wav1044_02.wav388350.36422019822 5053_02.wav1044_02.wav388360.37421319716 5053_02.wav1044_02.wav388370.38420719711 5053_02.wav1044_02.wav388380.3942031967 5053_02.wav1044_02.wav388390.4042001955 5053_02.wav1044_02.wav388400.4141981944 5053_02.wav1044_02.wav388410.4241971944 … Sums of squares of diffF0's is 236363 Square root of the sums is 486 F0 difference Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
75
74 6. Transforming Korean alveolar lax fricatives into tense Goal –Test factors that distinguish / ㅅ / from / ㅆ / Type of factors –Consonantal: noise durations, center of gravity –Vocalic: formant/bandwidth switching –Prosodic: clone F0/intensity/durations, switch source signals
76
75 Pilot data (6) Transforming Korean alveolar lax fricatives into tense 사자 vs. 싸자
77
76 Pilot data (6) Transforming Korean alveolar lax fricatives into tense 사자 싸자 사자 Prosody: Durations F0 Intensity
78
77 Pilot data (6) Transforming Korean alveolar lax fricatives into tense 사자 싸자 사자 Prosody + Formants Bandwidths
79
78 Things to do –Try the reverse: manipulate / ㅆ / to simulate / ㅅ / –Try this with other lax/tense pairs of stops 사 싸, 다 따, 바 빠, 가 까 –Try switching the source signal Listening experiments –[1] Render /ssa/ from /sha/ (1) prosody (2) formant/bandwidth(3) source –(1)+(2): shift?, (1)+(3): shift?, (1)+(2)+(3): shift?, (1)+(2)+undo(1): see effect of (2) only, (1)+(3)+undo(1): see effect of (3) only, (1)+(2)+(3)+undo(1): see the effects of (2) and (3) only –[2] Render /sha/ from /ssa/ (1) prosody(2) formant/bandwidth(3) source –(1)+(2): shift?, (1)+(3): shift?, (1)+(2)+(3): shift?, (1)+(2)+undo(1): ?, (1)+(3)+undo(1): ?, (1)+(2)+(3)+undo(1): ? –[3] Statistical analyses of formants/bandwidths Examine post-consonantal vowels in terms of their formants/bandwidths for any possible intra/inter-consonantal differences Identify the portion of the vowels that contributes to the distinction of lax/tense consonants, e.g ½, ¼ from the vowel onset Design (6) Transforming Korean alveolar lax fricatives into tense
80
79 7. Gender transformation of utterances Examine male vs. female utterances in terms of prosodic & segmental differences –Identify factors that differ –Refer to Praat’s change gender… under Convert button –Verify with synthesizing Prosody manipulation –F0/intensity/durations/source Segment manipulation –Formant frequencies & bandwidths
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.