Presentation is loading. Please wait.

Presentation is loading. Please wait.

Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering.

Similar presentations


Presentation on theme: "Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering."— Presentation transcript:

1 Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task
by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering master student and I have been working with Dr. Kepuska on speech recognition for about 2 years. Today, I am going to give a talk about my thesis work. investigation of prosodic features for Wake-Up-Word speech recognition task. Instructor: Dr. Veton Z. Këpuska Dept. of Electrical and Computer Engineering Florida Institute of Technology

2 Outline Introduction on Wake-Up-Word Speech Recognition Problem.
Prosodic features. Investigation of pitch based features. Pitch characteristic and pitch estimation method. Pitch estimation algorithm, eSRFD. Pitch features experiment result. Investigation of energy based features. Energy characteristic Energy extraction Data collection. In my presentation today. I will start with the introduction to the WUW speech recognition system. Follow by a brie introduction on the prosodic features. Then I will cover the investigations on two prosodic features measurement, pitch and energy Finally, I will talk about the ongoing speech data collection project. 12/4/2018 Chih-Ti Shih

3 Introduction to WUW Example:
Wake-Up-Word (WUW) speech recognition system was invented by Dr. Këpuska. Objective: To recognize a certain word used to request or to gain attention of the system (alerting context), the WUW(s), and Reject that word used in referential context, as well as all other words, sound and noise, the nonWUW(s). Let me start with a bit of background of the wake-up-word speech recognition system. The Wake-Up-Word speech recognition system was invented by Dr. Kepuska. One of the important purpose of the WUW system is to recognize a certain word used to request or gain attention of the system, we named, these type of word in alerting context as WUWs. For the same word but in referential context, we name them as non-WUWs. Here are the two examples In the first example sentence, operator, please go to the next slide. The word operator here is been used as an alerting context which requesting attention from the system. In the second example sentence, we are using the word operator as the WUW. The word operator here is been used as an referential context which no attention is needed from the system. My thesis perform investigations on using prosodic features to distinguish these two contexts. Example: Alerting Context: “Operator, please go to the next slide.” Referential Context: “We are using the word operator as the WUW.” 12/4/2018 Chih-Ti Shih

4 Alerting vs. Referential context
The empirical evidence suggests that a Wake-Up-Word can be distinguished based on its use (Alerting or Referential) from prosodic features. In this thesis, the result that attempt to evaluate the validity of this hypothesis is presented. I my thesis, I attempted to evaluate the validity of this hypothesis. 12/4/2018 Chih-Ti Shih

5 Prosodic Features The word prosody refers to the intonation and rhythmic aspect of a language. (Merriam-Webster Dictionary). In modern phonetics the word prosody is most often referred to those properties of speech that cannot be derived from the segmental sequence of phonemes underlying human utterances. (William J. Hardcastle, 1997). From the phonological aspect; the prosody maybe classified into: Structure Tune Prominence. Before we go into the investigation of prosodic features. Let’s me explain what is prosodic features. The word prosody refers to the intonation and rhythmic of a language. In the phonetics study, the prosody refers to the properties of speech that cannot be derived from the segmental sequence of phonemes. The prosody features can be classified into structure, tune and prominence. 12/4/2018 Chih-Ti Shih

6 Prosodic Features: Structure
The prosodic structure refers to the noticeable break or disjunctures between words in sentences which can also be interpreted as the duration of the silence between words as a person speaks. The silence period before the WUW is usually longer than the average silence period before any other word in the sentence. The prosodic structure refers to the noticeable break between words during a speech which can also be interpreted as the duration of the silence between words as a person speaks. This feature is considered in the original WUW speech recognition system by comparing the duration of silence just before the WUW with the duration of silences between non-WUWs. Here, let me emphases again, the non-WUWs refer to the WUW in referential context, and other words in a sentence. Example: Word1 Word2 Word3 Word 4 Wordn-1 WUW S2 S3 S4 Sn-1 SWUW 12/4/2018 Chih-Ti Shih

7 Prosodic Features: Tune
The tune refers to the intonational melody of an utterance (Jurafsky & Martin) The tune maybe quantified by pitch measurement. The tune refers to the intonational melody of an utterance which can be quantified by pitch measurement also known as fundamental frequency of the sound. Here is an example sentence from the Movie Taxi Driver by Robert Deniro, You talking to me? The continuous increase on the pitch on the last three words indicate the request. Example: You talking to me? Robert DeNiro in “Taxi Driver”: 12/4/2018 Chih-Ti Shih

8 Pitch Pitch is the fundamental frequency (FØ) or repetition frequency of a sound. Pitch is determined by rate of vibration of the vocal cords located in the larynx. The range of pitch individual can produce: Male: Hz Female: 180 – 400Hz Pitch is computed using fundamental-frequency determination algorithm (FDA). So what is pitch, pitch is the fundamental frequency or repetition frequency of a sound. Pitch is computed using fundamental frequency determinator algorithm. 12/4/2018 Chih-Ti Shih

9 Human Vocal As I mentioned, the vibration of vocal cord determine the pitch. So how is pitch varied as we spoke? The contraction of the vocal cord increases the vibration frequency which produce higher pitch. And the relief of the vocal cord reduce the vibration frequency which produce lower pitch. So the smaller the vocal cord, the higher the pitch. And that’s why, children and females usually have higher pitch because they have smaller vocal cords. 12/4/2018 Chih-Ti Shih

10 Pitch – FDA chart (Male).
There are many FDAs, according to the evaluation by Dr. Begshaw. The eSRFD get the smallest combined error rate compare to other FDAs. The similar result showing the eSRFD achieve the best performance can also be found in the paper by Dr. Veprek Scdorilis in 2002. In this evaluation chart, 4 types of errors have been included. The gross error low which refer to the halving error, the gross error high refer to the doubling error, the voiced error refers to the unvoiced frame been miss identified as voiced frame and finally the unvoiced error refers to the voiced frame been miss identified as unvoiced frame. FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994) 12/4/2018 Chih-Ti Shih

11 Pitch – FDA chart (Female)
As we can see from the two evaluation chart, the eSRFD achieves the best performance. So we are going to use it as our FDA. FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994) 12/4/2018 Chih-Ti Shih

12 Pitch – eSRFD Algorithm
The concept of eSRFD is to perform the normalized cross-correlation between two sections. The correlation will tell us how similar the two sections are, the more similar the higher possibility the window size is the fundamental period of that frame. In the eSRFD, two cross-correlation are used, one is x,y section and second is y,z section. 12/4/2018 Chih-Ti Shih

13 Pitch – eSRFD Algorithm, cont.
At beginning, the speech signal will be pass through a low pass filter. Then the speech signal is divided into frames, in our case, the each frame is 6.5 ms. The each frame is pass through the silence detector, if the frame is unvoiced, no further process will be performed, if the frame is voiced, cross-correlation of Pxy will be performed based on various window length from 12/4/2018 Chih-Ti Shih

14 Pitch – eSRFD Algorithm, cont.
For those window length with Pxy>Tsrfd, we considered they are the candidates of that frame. For those candidate with Pxy > Tsrfd. The second cross-correlation will be performed on those candidates. 12/4/2018 Chih-Ti Shih

15 Pitch – eSRFD Algorithm, cont.
Here is the part for assigning score, and determine the pitch. After both Pxy and Pyz are computed for each candidate, score are given. If both Pxy, Pyz are above the Tsrfd, score of 2 will be given the candidate, if only Pxy is above the Tsrfd, score 1 is given. After scoring, there will be 4 cases, first one only one candidate score 2, second, only one candidate score 1. And third, multiple candidate score 1, and finally multiple candidate score 2 and 1. In the case one, the candidate is considered as the optimal value of the pitch period of the frame. In case 2, we will check if the previous frame and the next frame are unvoiced. If both frame are silent, this frame is reclassified as unvoiced frame. If either frame is voiced, the candidate is considered as the optimal value of the pitch period of the frame. For both case 3 and 4, candidate will be sort in accending order and another cross-correlation qnm will be performed on each candidate with the window length from the largest candidate to the smallest candidate. Then the optimal candidate will be found from the Pnm measurement. 12/4/2018 Chih-Ti Shih

16 Pitch: Median Filter Doubling Error Doubling Error Halving Error
As I mentioned in the FDAs evaluation chart. Halving error and double errors may occur during the pitch estimation. We then apply the median filter to fix those errors. 12/4/2018 Chih-Ti Shih

17 Pitch Sample Non-WUW WUW
Previous Section Non-WUW WUW Here is one sample form the pitch estimation. In this sentence, Wildfire is used as the WUW. There are two wildfire in the sectence, the first one is between second and it’s a WUW in referential context. The second one is the at the end of the sentence which is marked by red lines. Here I would like to mention this section before the WUW, we name it as the previous section of the WUW. Hi. You know, I have this cool wildfire service and, you know, I'm gonna try to invoke it right now. Wildfire 12/4/2018 Chih-Ti Shih

18 Pitch: Pitch Based Features Definitions
APW_AP1SBW: The relative change of the average pitch of WUW to the average pitch of the previous section just before WUW. AP1sSW_AP1SBW: The relative change of the average pitch of the first section of WUW to the average pitch of previous section just before WUW. APW_APALL: The relative change of the average pitch of WUW to the average pitch of the entire speech sample excluding the WUW sections. AP1sSW_APALL: The relative change of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections. APW_APALLBW: The relative change of the average pitch of the WUW to the average pitch of entire speech sample before the WUW. AP1sSW_APALL: The relative changes of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections. After we computed the pitch values using the eSRFD, we derived a series of features based on the pitch measurement. The features we derived is based on the relative changes between two segments. So, for example, the first feature APW_AP1SBW, is the relative change of the average pitch of WUW to the average pitch of the previous section just before WUW. Let’s refer back to the previous image, that feature is refer to the relative change of the average pitch in the red lines segment to the average pitch in the green line segment. 12/4/2018 Chih-Ti Shih

19 Pitch: Pitch Based Features Definitions
MaxP_MaxP1SBW: The relative change of the maximum pitch in the WUW sections to the maximum pitch in the previous section just before the WUW. MaxP1sSW_MaxP1SBW: The relative change of the maximum pitch in the first section of the WUW to the maximum pitch of the previous section just before the WUW. MaxPW_MaxPAll: The relative change of the maximum pitch of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections. MaxP1sSW_MaxPAll: The relative change of the maximum pitch of the first section of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections. MaxP1sSW_MaxPAllBW: The percentage changes of the maximum pitch in the first section of the WUW to the maximum pitch of the entire speech before the WUW. MaxPW_MaxPAllBW: The percentage changes of the maximum pitch in the WUW sections to the maximum pitch of the entire speech sample before the WUW. 12/4/2018 Chih-Ti Shih

20 WUWII Corpus Contains 3410 speech samples.
5 different WUWs: “Operator”, “Wildfire”, “ThinkEngine”, “Onword”, “Voyager”. Each speech sample contains at least one WUW word. Contains time marker of the WUW. Before I go into the result. Let me talk about the corpus we used in this experiment. # Date | Time | Gender | Dialect | Phone type | File Name | Call NO | Utt. NO |Start Time | End Time | Ortho # | |male|native|landlinephone|00006|006|WUWII00006_006.ulaw|1.184|1.828|hello Operator give me my main phone number | |male|native|landlinephone|00006|007|WUWII00006_007.ulaw|1.280|2.133|[click] I work for _ThinkEngine_ Networks | |male|native|landlinephone|00006|008|WUWII00006_008.ulaw|1.228|1.856|hey does this _Onword_ thing work 12/4/2018 Chih-Ti Shih

21 Table A‑1 Pitch Features Result, All WUWs
Pitch: Result. All WUWs WUW: All Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0 APW_AP1SBW 1415 726 51 689 49 AP1sSW_AP1SBW 735 52 680 48 APW_APALL 2282 947 41 1335 59 AP1sSW_APALL 996 44 2 1284 56 APW_APALLBW 2188 962 1226 1003 46 1183 54 MaxPW_MaxP1SBW 948 67 53 4 414 29 MaxP1sSW_MaxP1SBW 719 642 45 MaxPW_MaxPAll 1020 109 5 1153 MaxP1sSW_MaxPAll 716 31 213 9 1353 MaxP1sSW_MaxPAllBW 1069 111 1008 MaxPW_MaxPAllBW 35 10 55 Here is the result of the experiment combine all 5 different WUWs, ‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. I have the detail individual result in the appendix A of my thesis. Our hypothesis is that, we thought the pitch in the WUW should be higher than the pitch of non-WUW. As we can see from the result table, there is no significant pattern here. I will explain why and how we are going to improve it in the later section. The best feature is MaxP_MaxP1SBW which means the relative change of the max pitch of the WUW to the max pitch of the pervious section just before WUW. Table A‑1 Pitch Features Result, All WUWs 12/4/2018 Chih-Ti Shih

22 Pitch: MaxPW_MaxP1SBW And here is the distribution and cumulative plot of the related pitch feature. Once again, I have the individual plot for each features for each different WUW. 12/4/2018 Chih-Ti Shih

23 Prosodic Features: Prominence
The prominence refers to the stress and accent in a speech. Object (noun): [`ab.dzekt ] Object ( verb ): [ ab. `dzekt ] We compute the energy measurement to quantify the prominence feature. Here we come to another prosodic feature, prominence. The prominence refers to the stress and accent in a speech. In the English language, the meaning of a word can be varied by the location of the accent. For example, when we use the word “Object” as a noun, the stress is on the voice a. And if we use the word as a verb, we the stress on the dz sound. So, we applied the same idea to the whole sentence. The hypothesis is the WUW will have more prominence compare to non-WUW. The prominence includes the stress and accent of a word. We use measurement of energy to quantify the prominence feature. 12/4/2018 Chih-Ti Shih

24 Energy Sample Non-WUW WUW
Previous Section Non-WUW WUW Here is the same sample I used for the pitch sample. I changed the third row to the energy reading. As we can see from this sample, there is a significant increase on the energy in the WUW section. Hi. You know, I have this cool wildfire service and, you know, I'm gonna try to invoke it right now. Wildfire 12/4/2018 Chih-Ti Shih

25 Energy based Features Definitions
AEW_AE1SBW: The relative change of the average energy of the WUW to the average energy of previous section just before the WUW. AE1sSW_AE1SBW: The relative change of the average energy of the first section of the WUW to the average energy of previous section just before the WUW. AEW_AEAll: The relative change of the average energy of the WUW to the average energy of the entire sample speech excluding the WUW sections. AE1sSW_AEAll: The relative change of the average energy of the first section in the WUW to the average energy of the entire utterance excluding the WUW sections. AEW_AEAllBW: The relative change of the average energy of the WUW to the average energy of all speech before the WUW. AE1sSW_AEAllBW: The relative change of the average energy of the first section in the WUW to the average energy of the entire sample speech before the WUW. We derived the energy based features the same way as we derived the pitch based features. 12/4/2018 Chih-Ti Shih

26 Energy based Features Definitions.
MaxEW_MaxE1SBW: The relative change of the maximum energy in the WUW sections to the maximum energy in the previous section of the WUW. MaxE1sSW_MaxEAllBW: The relative change of the maximum energy in the first section of WUW to the maximum energy in the entire speech before of the WUW. MaxEW_MaxEAll: The relative change of the maximum energy in the WUW to the maximum energy of the entire speech sample excluding the WUW section. MaxE1sSW_MaxEAll: The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech sample exclude the WUW section. MaxE1sSW_MaxEAllBW: The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech before the WUW. MaxEW_MaxEAllBW: The relative change of the maximum energy in the WUW sections to the maximum energy of the entire speech sample before the WUW. 12/4/2018 Chih-Ti Shih

27 Energy: Result All WUWs
WUW: All WUWs Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0 AEW_AE1SBW 1479 1164 79 315 21 AE1sSW_AE1SBW 1283 84 1 240 16 AEW_AEAll 2175 1059 49 9 1116 51 AE1sSW_AEAll 1155 53 2 1018 47 AEW_AEAllBW 1969 1427 72 542 28 AE1sSW_AEAllBW 1562 3 404 MaxEW_MaxE1SBW 1244 20 215 15 MaxE1sSW_MaxEAllBW 1221 83 13 245 17 MaxEW_MaxEAll 1373 63 MaxE1sSW_MaxEAll 1336 61 25 814 37 1209 744 38 MaxEW_MaxEAllBW 60 39 Here is the energy feature result table of all 5 WUWs. Table A‑1 Energy Feature Result of All WUW 12/4/2018 Chih-Ti Shih

28 Energy Result: AE1sSW_AE1SBW
This plot shows 94% of average energy of WUW is higher than the average energy in the previous section just before WUW. 12/4/2018 Chih-Ti Shih

29 Energy Result: MaxEW_MaxE1SBW
12/4/2018 Chih-Ti Shih

30 Energy Result: MaxE1sSW_MaxEAllBW
12/4/2018 Chih-Ti Shih

31 Energy: Result Wildfire
WUW: Wildfire Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0 AEW_AE1SBW 282 253 90 29 10 AE1sSW_AE1SBW 261 93 21 7 AEW_AEAll 340 173 51 167 49 AE1sSW_AEAll 185 54 155 46 AEW_AEAllBW 298 252 85 15 AE1sSW_AEAllBW 265 89 33 11 MaxEW_MaxE1SBW 258 91 8 3 16 6 MaxE1sSW_MaxEAllBW 2 1 27 MaxEW_MaxEAll 230 68 4 106 31 MaxE1sSW_MaxEAll 219 64 117 34 195 65 99 MaxEW_MaxEAllBW 62 36 For 5 different WUWs, the WUW wildfire achieved the best performance. There are 4 features score above or equal to 90%. I will show the lot of feature result in the following slide. Energy Feature Result of WUW “Wildfire” 12/4/2018 Chih-Ti Shih

32 Energy Result: AE1sSW_AE1SBW
The feature score 94% is the relative change of average energy of the first section of WUW to the average energy of the previous section. 12/4/2018 Chih-Ti Shih

33 Energy Result: MaxE1sSW_MaxEAllBW
The second best feature which score 91% is the relative change of the maximum energy in the first section of the WUW to the maximum energy of entire sample before the WUW. 12/4/2018 Chih-Ti Shih

34 Experiment Conclusions
Pitch based features experiments show no significant discriminating patterns. However, further investigation is needed. For example using pitch as a difference between WUW in alerting and referential context may reveal a pattern. Energy based features experiments show the significant discriminating patterns. These features can be integrated into the current WUW speech recognition system. Here we come to the conclusions of the experiments For the pitch based features, there is no significant discriminating patterns. For the energy based features, there are significant discriminating patterns. In order to do further investigation on the pitch feature, we will need a new corpus which has More natural speech sample. Include both WUW in alerting & referential context. 12/4/2018 Chih-Ti Shih

35 WUW Data Collection Objectives:
Allow us to investigate both pitch and energy features from the perspective of the difference between WUW in alerting context and referential context. Provide more natural speaking samples compared to the current WUWII corpus. After the investigation on both pitch and energy based features. We realized, to further identify the pattern of prosodic features between WUW and nonWUW require a specialized corpus which contain both WUW and nonWUWs. 12/4/2018 Chih-Ti Shih

36 New WUW Speech Data Collection
Advantages: The speech samples are natural. The data collection process will be less costly. Large number of data can be collected in a short period of time once the process is fully automated. The voice channel data is in CD quality. No manual labeling is required once the process is fully automated. Based on Dr. Wallance’s idea, we planned to collect speech data from video media such as TV series and movies. This slide shows the advantage to collect the data from video medias. First of all, the speech examples are more natural, since they tend to think and speak like a particular character in that situation. Secondly, to collect data from video media cost much less since we don’t need to pay individuals to record their voice. Thirdly, large number of data can be collected in a short period of time once the process is fully automatic. Next, we will have high quality speech sample. And last, if automation is done, no manual labeling is required. 12/4/2018 Chih-Ti Shih

37 WUW Data Collection Movie Clips Audio Channel Extraction
Video Channel Extraction Extraction of Transcription from CC, Time Markers, and Sentence Parsing Forced Alignment RelEx: Language Analysis Tool Image Sequence Processing Prosodic Features Extraction Processing of Prosodic Features Video + Audio Close Captioning Analysis, Prosodic & Image WUW Modeling Sentence Transcription Time Markers Waveform of an Utterance Sentence Transcription with Syntactic Labels WUW or nonWUW Context WUW & nonWUW Time Markers Prosodic Features Data Image Sequence Feature Extraction Processing of Image Features Image Segmentation Segmented Image Features Corpus Building This is the top level functional flow diagram of the research. The black boxes shows to the investigation on prosodic features. The blue boxes indicate the new speech data collection project. The green boxes indicate the potential video image analysis project. Top Level Program Flow Chart 12/4/2018 Chih-Ti Shih

38 WUW Data Collection English Name Dictionary RelEx
Language Analysis Tool English Name List WUW or nonWUW Marker Video Transcription File (.srt) Sentence Parser Corpus Building Sentence Transcription & Time Marker Name Sentence Transcription Name Sentence Time Marker Video Sample Subtitle Extractor Sentence Transcription with Syntactic Labels HTK Forced Alignment Audio Parser Name Audio Sample This is the detail functional flow chart for the speech data collection project. In order to achieve fully automation on the data collection process, RelEx language analysis tool is used. This tool will help us distinguish the WUW in alerting context and the WUW in referential context based on the English grammar. Another important information for the corpus is the time stamp of the WUW in the sentence. We planed to use HTK tool to perform forced alignment to find the time stamps of the WUW. Video Audio Audio Extractor Figure A‑2 WUW Audio Data Collection System Program Flow Diagram 12/4/2018 Chih-Ti Shih

39 WUW Data Collection Original subtitle file of the TV series “The Office” # Date | Time |Index| Start Time | End Time |Transcription | 12:14:41 | 38 | 00:03:39,025 | 00:03:42,659 | - It was fun.- Oh yeah bet it was fun | 12:14:41 | 39 | 00:03:43,822 | 00:03:47,297 | - Oh hey†! This is Oscar.- Martinez. | 12:14:41 | 40 | 00:03:47,444 | 00:03:50,848 | - See, I didn't even know, first thing basis.- We're all set. | 12:14:41 | 41 | 00:03:50,905 | 00:03:54,041 | Oh hey, diversity everybody let's do it. | 12:14:41 | 42 | 00:03:54,670 | 00:03:56,407 | Oscar works in here. | 12:14:41 | 43 | 00:03:56,465 | 00:03:59,686 | - Jim can you rapid up please†?- Yeah. | 12:14:41 | 44 | 00:04:00,494 | 00:04:02,325 | It's diversity day Jim, | 12:14:41 | 45 | 00:04:02,390 | 00:04:04,231 | wish everyday was diversity day. 12/4/2018 Chih-Ti Shih

40 WUW Data Collection Name Sentence Parser (Pattarapong, Ronald, & Xerxes, 2009) | 12:14:41 | 23 | 00:02:19,475 | 00:02:23,539 | - Thanks Dwight. | 12:14:41 | 36 | 00:03:32,794 | 00:03:36,601 | - Hey†! Oscar, how you doing man†? | 12:14:41 | 38 | 00:03:39,025 | 00:03:42,659 | - Oh yeah bet it was fun | 12:14:41 | 39 | 00:03:43,822 | 00:03:47,297 | - Oh hey†! This is Oscar. 12/4/2018 Chih-Ti Shih

41 RelEx Language Tool The RelEx is an English-Language semantic relationship extractor based on Carnegie-Mellon link parser . The RelEx is able to provide sentence information on subject, object, indirect object and various words tagging such as verb, gender and noun. The current status of the WUW data collection project is at developing a rule based or statistical pattern recognition process based on the relationship information produced by RelEx. The RelEx is an English-Language semantic relationship extractor based on Carnegie-Mellon link parser . The RelEx is able to provide sentence information on subject, object, indirect object and various words tagging such as verb, gender and noun. The current states of the project is at finding the pattern from the information that RelEx provided to distinguish the WUW and nonWUW. 12/4/2018 Chih-Ti Shih

42 RelEx Example Sample Sentence: “Computer start the presentation.”
(S [computer] (S (VP start (NP the presentation))).) Xp | Os | Wi D*u | | | | | | LEFT-WALL [computer] start.v the presentation.n . .n : nouns .v : verbs .a : adjectives .e : adverbs .p : preposition .s : singular .p : plural .t : title “W” : Left-wall “I” : imperative “O” : connects transitive verbs “S” : singular “D” : connects determiners “D*u” : relationship can be singular or uncountable noun “X” : connect punctuation symbols to words “Xp”: Periods at ends of sentences 12/4/2018 Chih-Ti Shih

43 Conclusions Implemented the pitch estimation algorithm, eSRFD.
Pitch based features were investigated but no significant discriminating patterns are found in the WUWII corpus. Energy based features were investigated, the experiments show the significant relative changes on multiple energy based features. A specialized speech data collection project is designed and partially implemented. It is a on-going research project. The project will be continued by VoiceKey group under Dr. Këpuska . In my thesis, Algorithm on pitch estimation is performed. The pitch based features are investigated although no significant patterns are found. The Energy based features are also investigated, the experiment show significant changes patterns on multiple energy based features. Finally, a specialized corpus project is designed and planned. 12/4/2018 Chih-Ti Shih

44 Acknowledgements Dr. Vento Z. Këpuska Dr. Samuel P. Kozaitis
Dr. Georgios C. Anagnostopoulos Dr. Judith B. Strother VoiceTeam members: Raymond Sastraputera Pattarapong Rojanasthien Ronald Ramdham Xerxes Beharry All Professors and Colleagues in this great department 12/4/2018 Chih-Ti Shih

45 Question? 12/4/2018 Chih-Ti Shih


Download ppt "Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering."

Similar presentations


Ads by Google