Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering.

Slides:



Advertisements
Similar presentations
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Advertisements

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Xkl: A Tool For Speech Analysis Eric Truslow Adviser: Helen Hanson.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Presented By: Karan Parikh Towards the Automated Social Analysis of Situated Speech Data Watt, Chaudhary, Bilmes, Kitts CS546 Intelligent.
EE 399 Lecture 2 (a) Guidelines To Good Writing. Contents Basic Steps Toward Good Writing. Developing an Outline: Outline Benefits. Initial Development.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Sound and Speech. The vocal tract Figures from Graddol et al.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.
Study Skills For Students of English. English as Your Language of Instruction p.1 Motivation Concentration Distraction Place of Study Time of Study.
SLOW DOWN!!!  Remember… the easiest way to make your score go up is to slow down and miss fewer questions  You’re scored on total points, not the percentage.
Automatic Pitch Tracking January 16, 2013 The Plan for Today One announcement: Starting on Monday of next week, we’ll meet in Craigie Hall D 428 We’ll.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Descriptive Research Study Investigation of Positive and Negative Affect of UniJos PhD Students toward their PhD Research Project Dr. K. A. Korb University.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Pitch Estimation by Enhanced Super Resolution determinator By Sunya Santananchai Chia-Ho Ling.
Frequency, Pitch, Tone and Length February 12, 2014 Thanks to Chilin Shih for making some of these lecture materials available.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Speech Recognition Raymond Sastraputera.  Introduction  Frame/Buffer  Algorithm  Silent Detector  Estimate Pitch ◦ Correlation and Candidate ◦ Optimal.
Performance Comparison of Speaker and Emotion Recognition
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
NATURAL LANGUAGE PROCESSING
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Reminders Outliers First Reading Quiz – tomorrow
Teaching Listening Why teach listening?
The rise of statistics Statistics is the science of collecting, organizing and interpreting data. The goal of statistics is to gain understanding from.
A color Coded Research paper
PSYC 206 Lifespan Development Bilge Yagmurlu.
Chapter 12 Chi-Square Tests and Nonparametric Tests
Wei Chen, Jack Mostow Gregory Aist Project LISTEN
Chapter Two Fundamentals of Data and Signals
The Human Voice. 1. The vocal organs
Text-To-Speech System for English
Erasmus University Rotterdam
Parts of an Academic Paper
Multimedia Systems and Applications
Introduction to Summary Statistics
Writing for Academic Journals
The Human Voice. 1. The vocal organs
Making Sense of Data (Part 1)
‘The most natural way to communicate is simply to speak
CSCI 5832 Natural Language Processing
Detecting Prosody Improvement in Oral Rereading
EVAAS Overview.
DSQR Training Process Capability & Performance
STEM Fair Graphs & Statistical Analysis
CSCI 5832 Natural Language Processing
Machine Learning in Practice Lecture 11
Describing Detail Sentences
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Music Technology What’s in the course?
Introduction to Text Analysis
Histograms: A Valuable Tool for Quality Evaluation
False discovery rate estimation
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Applying principles of computer science in a biological context
A modest attempt at measuring and communicating about quality
Auditory Morphing Weyni Clacken
Presentation transcript:

Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering master student and I have been working with Dr. Kepuska on speech recognition for about 2 years. Today, I am going to give a talk about my thesis work. investigation of prosodic features for Wake-Up-Word speech recognition task. Instructor: Dr. Veton Z. Këpuska Dept. of Electrical and Computer Engineering Florida Institute of Technology

Outline Introduction on Wake-Up-Word Speech Recognition Problem. Prosodic features. Investigation of pitch based features. Pitch characteristic and pitch estimation method. Pitch estimation algorithm, eSRFD. Pitch features experiment result. Investigation of energy based features. Energy characteristic Energy extraction Data collection. In my presentation today. I will start with the introduction to the WUW speech recognition system. Follow by a brie introduction on the prosodic features. Then I will cover the investigations on two prosodic features measurement, pitch and energy Finally, I will talk about the ongoing speech data collection project. 12/4/2018 Chih-Ti Shih

Introduction to WUW Example: Wake-Up-Word (WUW) speech recognition system was invented by Dr. Këpuska. Objective: To recognize a certain word used to request or to gain attention of the system (alerting context), the WUW(s), and Reject that word used in referential context, as well as all other words, sound and noise, the nonWUW(s). Let me start with a bit of background of the wake-up-word speech recognition system. The Wake-Up-Word speech recognition system was invented by Dr. Kepuska. One of the important purpose of the WUW system is to recognize a certain word used to request or gain attention of the system, we named, these type of word in alerting context as WUWs. For the same word but in referential context, we name them as non-WUWs. Here are the two examples In the first example sentence, operator, please go to the next slide. The word operator here is been used as an alerting context which requesting attention from the system. In the second example sentence, we are using the word operator as the WUW. The word operator here is been used as an referential context which no attention is needed from the system. My thesis perform investigations on using prosodic features to distinguish these two contexts. Example: Alerting Context: “Operator, please go to the next slide.” Referential Context: “We are using the word operator as the WUW.” 12/4/2018 Chih-Ti Shih

Alerting vs. Referential context The empirical evidence suggests that a Wake-Up-Word can be distinguished based on its use (Alerting or Referential) from prosodic features. In this thesis, the result that attempt to evaluate the validity of this hypothesis is presented. I my thesis, I attempted to evaluate the validity of this hypothesis. 12/4/2018 Chih-Ti Shih

Prosodic Features The word prosody refers to the intonation and rhythmic aspect of a language. (Merriam-Webster Dictionary). In modern phonetics the word prosody is most often referred to those properties of speech that cannot be derived from the segmental sequence of phonemes underlying human utterances. (William J. Hardcastle, 1997). From the phonological aspect; the prosody maybe classified into: Structure Tune Prominence. Before we go into the investigation of prosodic features. Let’s me explain what is prosodic features. The word prosody refers to the intonation and rhythmic of a language. In the phonetics study, the prosody refers to the properties of speech that cannot be derived from the segmental sequence of phonemes. The prosody features can be classified into structure, tune and prominence. 12/4/2018 Chih-Ti Shih

Prosodic Features: Structure The prosodic structure refers to the noticeable break or disjunctures between words in sentences which can also be interpreted as the duration of the silence between words as a person speaks. The silence period before the WUW is usually longer than the average silence period before any other word in the sentence. The prosodic structure refers to the noticeable break between words during a speech which can also be interpreted as the duration of the silence between words as a person speaks. This feature is considered in the original WUW speech recognition system by comparing the duration of silence just before the WUW with the duration of silences between non-WUWs. Here, let me emphases again, the non-WUWs refer to the WUW in referential context, and other words in a sentence. Example: Word1 Word2 Word3 Word 4 Wordn-1 WUW S2 S3 S4 Sn-1 SWUW 12/4/2018 Chih-Ti Shih

Prosodic Features: Tune The tune refers to the intonational melody of an utterance (Jurafsky & Martin) The tune maybe quantified by pitch measurement. The tune refers to the intonational melody of an utterance which can be quantified by pitch measurement also known as fundamental frequency of the sound. Here is an example sentence from the Movie Taxi Driver by Robert Deniro, You talking to me? The continuous increase on the pitch on the last three words indicate the request. Example: You talking to me? Robert DeNiro in “Taxi Driver”: 12/4/2018 Chih-Ti Shih

Pitch Pitch is the fundamental frequency (FØ) or repetition frequency of a sound. Pitch is determined by rate of vibration of the vocal cords located in the larynx. The range of pitch individual can produce: Male: 50 - 200 Hz Female: 180 – 400Hz Pitch is computed using fundamental-frequency determination algorithm (FDA). So what is pitch, pitch is the fundamental frequency or repetition frequency of a sound. Pitch is computed using fundamental frequency determinator algorithm. 12/4/2018 Chih-Ti Shih

Human Vocal As I mentioned, the vibration of vocal cord determine the pitch. So how is pitch varied as we spoke? The contraction of the vocal cord increases the vibration frequency which produce higher pitch. And the relief of the vocal cord reduce the vibration frequency which produce lower pitch. So the smaller the vocal cord, the higher the pitch. And that’s why, children and females usually have higher pitch because they have smaller vocal cords. 12/4/2018 Chih-Ti Shih

Pitch – FDA chart (Male). There are many FDAs, according to the evaluation by Dr. Begshaw. The eSRFD get the smallest combined error rate compare to other FDAs. The similar result showing the eSRFD achieve the best performance can also be found in the paper by Dr. Veprek Scdorilis in 2002. In this evaluation chart, 4 types of errors have been included. The gross error low which refer to the halving error, the gross error high refer to the doubling error, the voiced error refers to the unvoiced frame been miss identified as voiced frame and finally the unvoiced error refers to the voiced frame been miss identified as unvoiced frame. FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994) 12/4/2018 Chih-Ti Shih

Pitch – FDA chart (Female) As we can see from the two evaluation chart, the eSRFD achieves the best performance. So we are going to use it as our FDA. FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994) 12/4/2018 Chih-Ti Shih

Pitch – eSRFD Algorithm The concept of eSRFD is to perform the normalized cross-correlation between two sections. The correlation will tell us how similar the two sections are, the more similar the higher possibility the window size is the fundamental period of that frame. In the eSRFD, two cross-correlation are used, one is x,y section and second is y,z section. 12/4/2018 Chih-Ti Shih

Pitch – eSRFD Algorithm, cont. At beginning, the speech signal will be pass through a low pass filter. Then the speech signal is divided into frames, in our case, the each frame is 6.5 ms. The each frame is pass through the silence detector, if the frame is unvoiced, no further process will be performed, if the frame is voiced, cross-correlation of Pxy will be performed based on various window length from 20-160. 12/4/2018 Chih-Ti Shih

Pitch – eSRFD Algorithm, cont. For those window length with Pxy>Tsrfd, we considered they are the candidates of that frame. For those candidate with Pxy > Tsrfd. The second cross-correlation will be performed on those candidates. 12/4/2018 Chih-Ti Shih

Pitch – eSRFD Algorithm, cont. Here is the part for assigning score, and determine the pitch. After both Pxy and Pyz are computed for each candidate, score are given. If both Pxy, Pyz are above the Tsrfd, score of 2 will be given the candidate, if only Pxy is above the Tsrfd, score 1 is given. After scoring, there will be 4 cases, first one only one candidate score 2, second, only one candidate score 1. And third, multiple candidate score 1, and finally multiple candidate score 2 and 1. In the case one, the candidate is considered as the optimal value of the pitch period of the frame. In case 2, we will check if the previous frame and the next frame are unvoiced. If both frame are silent, this frame is reclassified as unvoiced frame. If either frame is voiced, the candidate is considered as the optimal value of the pitch period of the frame. For both case 3 and 4, candidate will be sort in accending order and another cross-correlation qnm will be performed on each candidate with the window length from the largest candidate to the smallest candidate. Then the optimal candidate will be found from the Pnm measurement. 12/4/2018 Chih-Ti Shih

Pitch: Median Filter Doubling Error Doubling Error Halving Error As I mentioned in the FDAs evaluation chart. Halving error and double errors may occur during the pitch estimation. We then apply the median filter to fix those errors. 12/4/2018 Chih-Ti Shih

Pitch Sample Non-WUW WUW Previous Section Non-WUW WUW Here is one sample form the pitch estimation. In this sentence, Wildfire is used as the WUW. There are two wildfire in the sectence, the first one is between 2-2.6 second and it’s a WUW in referential context. The second one is the at the end of the sentence which is marked by red lines. Here I would like to mention this section before the WUW, we name it as the previous section of the WUW. Hi. You know, I have this cool wildfire service and, you know, I'm gonna try to invoke it right now. Wildfire 12/4/2018 Chih-Ti Shih

Pitch: Pitch Based Features Definitions APW_AP1SBW: The relative change of the average pitch of WUW to the average pitch of the previous section just before WUW. AP1sSW_AP1SBW: The relative change of the average pitch of the first section of WUW to the average pitch of previous section just before WUW. APW_APALL: The relative change of the average pitch of WUW to the average pitch of the entire speech sample excluding the WUW sections. AP1sSW_APALL: The relative change of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections. APW_APALLBW: The relative change of the average pitch of the WUW to the average pitch of entire speech sample before the WUW. AP1sSW_APALL: The relative changes of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections. After we computed the pitch values using the eSRFD, we derived a series of features based on the pitch measurement. The features we derived is based on the relative changes between two segments. So, for example, the first feature APW_AP1SBW, is the relative change of the average pitch of WUW to the average pitch of the previous section just before WUW. Let’s refer back to the previous image, that feature is refer to the relative change of the average pitch in the red lines segment to the average pitch in the green line segment. 12/4/2018 Chih-Ti Shih

Pitch: Pitch Based Features Definitions MaxP_MaxP1SBW: The relative change of the maximum pitch in the WUW sections to the maximum pitch in the previous section just before the WUW. MaxP1sSW_MaxP1SBW: The relative change of the maximum pitch in the first section of the WUW to the maximum pitch of the previous section just before the WUW. MaxPW_MaxPAll: The relative change of the maximum pitch of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections. MaxP1sSW_MaxPAll: The relative change of the maximum pitch of the first section of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections. MaxP1sSW_MaxPAllBW: The percentage changes of the maximum pitch in the first section of the WUW to the maximum pitch of the entire speech before the WUW. MaxPW_MaxPAllBW: The percentage changes of the maximum pitch in the WUW sections to the maximum pitch of the entire speech sample before the WUW. 12/4/2018 Chih-Ti Shih

WUWII Corpus Contains 3410 speech samples. 5 different WUWs: “Operator”, “Wildfire”, “ThinkEngine”, “Onword”, “Voyager”. Each speech sample contains at least one WUW word. Contains time marker of the WUW. Before I go into the result. Let me talk about the corpus we used in this experiment. # Date | Time | Gender | Dialect | Phone type | File Name | Call NO | Utt. NO |Start Time | End Time | Ortho # --------------+----------+------+-------+------------------+-------+----+-------------------------------+-------+------+----- 12.26.2001|04.14.24|male|native|landlinephone|00006|006|WUWII00006_006.ulaw|1.184|1.828|hello Operator give me my main phone number 12.26.2001|04.14.33|male|native|landlinephone|00006|007|WUWII00006_007.ulaw|1.280|2.133|[click] I work for _ThinkEngine_ Networks 12.26.2001|04.14.42|male|native|landlinephone|00006|008|WUWII00006_008.ulaw|1.228|1.856|hey does this _Onword_ thing work 12/4/2018 Chih-Ti Shih

Table A‑1 Pitch Features Result, All WUWs Pitch: Result. All WUWs WUW: All Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0 APW_AP1SBW 1415 726 51 689 49 AP1sSW_AP1SBW 735 52 680 48 APW_APALL 2282 947 41 1335 59 AP1sSW_APALL 996 44 2 1284 56 APW_APALLBW 2188 962 1226 1003 46 1183 54 MaxPW_MaxP1SBW 948 67 53 4 414 29 MaxP1sSW_MaxP1SBW 719 642 45 MaxPW_MaxPAll 1020 109 5 1153 MaxP1sSW_MaxPAll 716 31 213 9 1353 MaxP1sSW_MaxPAllBW 1069 111 1008 MaxPW_MaxPAllBW 35 10 55 Here is the result of the experiment combine all 5 different WUWs, ‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. I have the detail individual result in the appendix A of my thesis. Our hypothesis is that, we thought the pitch in the WUW should be higher than the pitch of non-WUW. As we can see from the result table, there is no significant pattern here. I will explain why and how we are going to improve it in the later section. The best feature is MaxP_MaxP1SBW which means the relative change of the max pitch of the WUW to the max pitch of the pervious section just before WUW. Table A‑1 Pitch Features Result, All WUWs 12/4/2018 Chih-Ti Shih

Pitch: MaxPW_MaxP1SBW And here is the distribution and cumulative plot of the related pitch feature. Once again, I have the individual plot for each features for each different WUW. 12/4/2018 Chih-Ti Shih

Prosodic Features: Prominence The prominence refers to the stress and accent in a speech. Object (noun): [`ab.dzekt ] Object ( verb ): [ ab. `dzekt ] We compute the energy measurement to quantify the prominence feature. Here we come to another prosodic feature, prominence. The prominence refers to the stress and accent in a speech. In the English language, the meaning of a word can be varied by the location of the accent. For example, when we use the word “Object” as a noun, the stress is on the voice a. And if we use the word as a verb, we the stress on the dz sound. So, we applied the same idea to the whole sentence. The hypothesis is the WUW will have more prominence compare to non-WUW. The prominence includes the stress and accent of a word. We use measurement of energy to quantify the prominence feature. 12/4/2018 Chih-Ti Shih

Energy Sample Non-WUW WUW Previous Section Non-WUW WUW Here is the same sample I used for the pitch sample. I changed the third row to the energy reading. As we can see from this sample, there is a significant increase on the energy in the WUW section. Hi. You know, I have this cool wildfire service and, you know, I'm gonna try to invoke it right now. Wildfire 12/4/2018 Chih-Ti Shih

Energy based Features Definitions AEW_AE1SBW: The relative change of the average energy of the WUW to the average energy of previous section just before the WUW. AE1sSW_AE1SBW: The relative change of the average energy of the first section of the WUW to the average energy of previous section just before the WUW. AEW_AEAll: The relative change of the average energy of the WUW to the average energy of the entire sample speech excluding the WUW sections. AE1sSW_AEAll: The relative change of the average energy of the first section in the WUW to the average energy of the entire utterance excluding the WUW sections. AEW_AEAllBW: The relative change of the average energy of the WUW to the average energy of all speech before the WUW. AE1sSW_AEAllBW: The relative change of the average energy of the first section in the WUW to the average energy of the entire sample speech before the WUW. We derived the energy based features the same way as we derived the pitch based features. 12/4/2018 Chih-Ti Shih

Energy based Features Definitions. MaxEW_MaxE1SBW: The relative change of the maximum energy in the WUW sections to the maximum energy in the previous section of the WUW. MaxE1sSW_MaxEAllBW: The relative change of the maximum energy in the first section of WUW to the maximum energy in the entire speech before of the WUW. MaxEW_MaxEAll: The relative change of the maximum energy in the WUW to the maximum energy of the entire speech sample excluding the WUW section. MaxE1sSW_MaxEAll: The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech sample exclude the WUW section. MaxE1sSW_MaxEAllBW: The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech before the WUW. MaxEW_MaxEAllBW: The relative change of the maximum energy in the WUW sections to the maximum energy of the entire speech sample before the WUW. 12/4/2018 Chih-Ti Shih

Energy: Result All WUWs WUW: All WUWs Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0 AEW_AE1SBW 1479 1164 79 315 21 AE1sSW_AE1SBW 1283 84 1 240 16 AEW_AEAll 2175 1059 49 9 1116 51 AE1sSW_AEAll 1155 53 2 1018 47 AEW_AEAllBW 1969 1427 72 542 28 AE1sSW_AEAllBW 1562 3 404 MaxEW_MaxE1SBW 1244 20 215 15 MaxE1sSW_MaxEAllBW 1221 83 13 245 17 MaxEW_MaxEAll 1373 63 MaxE1sSW_MaxEAll 1336 61 25 814 37 1209 744 38 MaxEW_MaxEAllBW 60 39 Here is the energy feature result table of all 5 WUWs. Table A‑1 Energy Feature Result of All WUW 12/4/2018 Chih-Ti Shih

Energy Result: AE1sSW_AE1SBW This plot shows 94% of average energy of WUW is higher than the average energy in the previous section just before WUW. 12/4/2018 Chih-Ti Shih

Energy Result: MaxEW_MaxE1SBW 12/4/2018 Chih-Ti Shih

Energy Result: MaxE1sSW_MaxEAllBW 12/4/2018 Chih-Ti Shih

Energy: Result Wildfire WUW: Wildfire Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0 AEW_AE1SBW 282 253 90 29 10 AE1sSW_AE1SBW 261 93 21 7 AEW_AEAll 340 173 51 167 49 AE1sSW_AEAll 185 54 155 46 AEW_AEAllBW 298 252 85 15 AE1sSW_AEAllBW 265 89 33 11 MaxEW_MaxE1SBW 258 91 8 3 16 6 MaxE1sSW_MaxEAllBW 2 1 27 MaxEW_MaxEAll 230 68 4 106 31 MaxE1sSW_MaxEAll 219 64 117 34 195 65 99 MaxEW_MaxEAllBW 62 36 For 5 different WUWs, the WUW wildfire achieved the best performance. There are 4 features score above or equal to 90%. I will show the lot of feature result in the following slide. Energy Feature Result of WUW “Wildfire” 12/4/2018 Chih-Ti Shih

Energy Result: AE1sSW_AE1SBW The feature score 94% is the relative change of average energy of the first section of WUW to the average energy of the previous section. 12/4/2018 Chih-Ti Shih

Energy Result: MaxE1sSW_MaxEAllBW The second best feature which score 91% is the relative change of the maximum energy in the first section of the WUW to the maximum energy of entire sample before the WUW. 12/4/2018 Chih-Ti Shih

Experiment Conclusions Pitch based features experiments show no significant discriminating patterns. However, further investigation is needed. For example using pitch as a difference between WUW in alerting and referential context may reveal a pattern. Energy based features experiments show the significant discriminating patterns. These features can be integrated into the current WUW speech recognition system. Here we come to the conclusions of the experiments For the pitch based features, there is no significant discriminating patterns. For the energy based features, there are significant discriminating patterns. In order to do further investigation on the pitch feature, we will need a new corpus which has More natural speech sample. Include both WUW in alerting & referential context. 12/4/2018 Chih-Ti Shih

WUW Data Collection Objectives: Allow us to investigate both pitch and energy features from the perspective of the difference between WUW in alerting context and referential context. Provide more natural speaking samples compared to the current WUWII corpus. After the investigation on both pitch and energy based features. We realized, to further identify the pattern of prosodic features between WUW and nonWUW require a specialized corpus which contain both WUW and nonWUWs. 12/4/2018 Chih-Ti Shih

New WUW Speech Data Collection Advantages: The speech samples are natural. The data collection process will be less costly. Large number of data can be collected in a short period of time once the process is fully automated. The voice channel data is in CD quality. No manual labeling is required once the process is fully automated. Based on Dr. Wallance’s idea, we planned to collect speech data from video media such as TV series and movies. This slide shows the advantage to collect the data from video medias. First of all, the speech examples are more natural, since they tend to think and speak like a particular character in that situation. Secondly, to collect data from video media cost much less since we don’t need to pay individuals to record their voice. Thirdly, large number of data can be collected in a short period of time once the process is fully automatic. Next, we will have high quality speech sample. And last, if automation is done, no manual labeling is required. 12/4/2018 Chih-Ti Shih

WUW Data Collection Movie Clips Audio Channel Extraction Video Channel Extraction Extraction of Transcription from CC, Time Markers, and Sentence Parsing Forced Alignment RelEx: Language Analysis Tool Image Sequence Processing Prosodic Features Extraction Processing of Prosodic Features Video + Audio Close Captioning Analysis, Prosodic & Image WUW Modeling Sentence Transcription Time Markers Waveform of an Utterance Sentence Transcription with Syntactic Labels WUW or nonWUW Context WUW & nonWUW Time Markers Prosodic Features Data Image Sequence Feature Extraction Processing of Image Features Image Segmentation Segmented Image Features Corpus Building This is the top level functional flow diagram of the research. The black boxes shows to the investigation on prosodic features. The blue boxes indicate the new speech data collection project. The green boxes indicate the potential video image analysis project. Top Level Program Flow Chart 12/4/2018 Chih-Ti Shih

WUW Data Collection English Name Dictionary RelEx Language Analysis Tool English Name List WUW or nonWUW Marker Video Transcription File (.srt) Sentence Parser Corpus Building Sentence Transcription & Time Marker Name Sentence Transcription Name Sentence Time Marker Video Sample Subtitle Extractor Sentence Transcription with Syntactic Labels HTK Forced Alignment Audio Parser Name Audio Sample This is the detail functional flow chart for the speech data collection project. In order to achieve fully automation on the data collection process, RelEx language analysis tool is used. This tool will help us distinguish the WUW in alerting context and the WUW in referential context based on the English grammar. Another important information for the corpus is the time stamp of the WUW in the sentence. We planed to use HTK tool to perform forced alignment to find the time stamps of the WUW. Video Audio Audio Extractor Figure A‑2 WUW Audio Data Collection System Program Flow Diagram 12/4/2018 Chih-Ti Shih

WUW Data Collection Original subtitle file of the TV series “The Office” # Date | Time |Index| Start Time | End Time |Transcription 11-17-08 | 12:14:41 | 38 | 00:03:39,025 | 00:03:42,659 | - It was fun.- Oh yeah bet it was fun 11-17-08 | 12:14:41 | 39 | 00:03:43,822 | 00:03:47,297 | - Oh hey†! This is Oscar.- Martinez. 11-17-08 | 12:14:41 | 40 | 00:03:47,444 | 00:03:50,848 | - See, I didn't even know, first thing basis.- We're all set. 11-17-08 | 12:14:41 | 41 | 00:03:50,905 | 00:03:54,041 | Oh hey, diversity everybody let's do it. 11-17-08 | 12:14:41 | 42 | 00:03:54,670 | 00:03:56,407 | Oscar works in here. 11-17-08 | 12:14:41 | 43 | 00:03:56,465 | 00:03:59,686 | - Jim can you rapid up please†?- Yeah. 11-17-08 | 12:14:41 | 44 | 00:04:00,494 | 00:04:02,325 | It's diversity day Jim, 11-17-08 | 12:14:41 | 45 | 00:04:02,390 | 00:04:04,231 | wish everyday was diversity day. 12/4/2018 Chih-Ti Shih

WUW Data Collection Name Sentence Parser (Pattarapong, Ronald, & Xerxes, 2009) --------------------------------------- 11-17-08 | 12:14:41 | 23 | 00:02:19,475 | 00:02:23,539 | - Thanks Dwight. 11-17-08 | 12:14:41 | 36 | 00:03:32,794 | 00:03:36,601 | - Hey†! Oscar, how you doing man†? 11-17-08 | 12:14:41 | 38 | 00:03:39,025 | 00:03:42,659 | - Oh yeah bet it was fun 11-17-08 | 12:14:41 | 39 | 00:03:43,822 | 00:03:47,297 | - Oh hey†! This is Oscar. 12/4/2018 Chih-Ti Shih

RelEx Language Tool The RelEx is an English-Language semantic relationship extractor based on Carnegie-Mellon link parser . The RelEx is able to provide sentence information on subject, object, indirect object and various words tagging such as verb, gender and noun. The current status of the WUW data collection project is at developing a rule based or statistical pattern recognition process based on the relationship information produced by RelEx. The RelEx is an English-Language semantic relationship extractor based on Carnegie-Mellon link parser . The RelEx is able to provide sentence information on subject, object, indirect object and various words tagging such as verb, gender and noun. The current states of the project is at finding the pattern from the information that RelEx provided to distinguish the WUW and nonWUW. 12/4/2018 Chih-Ti Shih

RelEx Example Sample Sentence: “Computer start the presentation.” (S [computer] (S (VP start (NP the presentation))).) +---------------------Xp--------------------+ | +-------Os------+ | +---------Wi--------+ +---D*u---+ | | | | | | LEFT-WALL [computer] start.v the presentation.n . .n : nouns .v : verbs .a : adjectives .e : adverbs .p : preposition .s : singular .p : plural .t : title “W” : Left-wall “I” : imperative “O” : connects transitive verbs “S” : singular “D” : connects determiners “D*u” : relationship can be singular or uncountable noun “X” : connect punctuation symbols to words “Xp”: Periods at ends of sentences 12/4/2018 Chih-Ti Shih

Conclusions Implemented the pitch estimation algorithm, eSRFD. Pitch based features were investigated but no significant discriminating patterns are found in the WUWII corpus. Energy based features were investigated, the experiments show the significant relative changes on multiple energy based features. A specialized speech data collection project is designed and partially implemented. It is a on-going research project. The project will be continued by VoiceKey group under Dr. Këpuska . In my thesis, Algorithm on pitch estimation is performed. The pitch based features are investigated although no significant patterns are found. The Energy based features are also investigated, the experiment show significant changes patterns on multiple energy based features. Finally, a specialized corpus project is designed and planned. 12/4/2018 Chih-Ti Shih

Acknowledgements Dr. Vento Z. Këpuska Dr. Samuel P. Kozaitis Dr. Georgios C. Anagnostopoulos Dr. Judith B. Strother VoiceTeam members: Raymond Sastraputera Pattarapong Rojanasthien Ronald Ramdham Xerxes Beharry All Professors and Colleagues in this great department 12/4/2018 Chih-Ti Shih

Question? 12/4/2018 Chih-Ti Shih