Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Smart Room” A Vision for an Integrated Research at FIT

Similar presentations


Presentation on theme: "“Smart Room” A Vision for an Integrated Research at FIT"— Presentation transcript:

1 “Smart Room” A Vision for an Integrated Research at FIT
Veton Z. Këpuska Associate Professor Florida Institute of Technology 19 February 2019 Dr. Veton Z. Këpuska

2 What is a “Smart Room”? An application perspective
Cluster Sensor System 19 February 2019 Dr. Veton Z. Këpuska

3 Research Areas: Use of microphone arrays to locate and separate the source of each sound in its coverage area. Integration of Microphone Array and Video (e.g., surveillance) system to overlay/indicate the source of sound in the image of the corresponding monitor. Use of the location information to perform Multi-modal recognition and analysis: Audio signal for speech recognition, Video/Image Stream for lip-reading, Combination of both, Video/Image Stream Face recognition and/or automatic suspect tagging (e.g., height, color of the clothes, etc.), Motion Detection and Tracking, etc. 19 February 2019 Dr. Veton Z. Këpuska

4 Research Areas: Use of key-word recognition technology to identify potential foul intentions. Automatic behavioral analysis of tagged individuals. Integration of the overall system into a responsive network system. Efficient Communication Protocols and Data Compression techniques Wireless communication ⇒ Mobility Development of other Sensors and their integration 19 February 2019 Dr. Veton Z. Këpuska

5 Microphone Arrays Microphone Arrays for Speech Enhancement.
Beam forming already established methodology based on which a number of practical solutions exists. Typically the most dominant speech/sound is enhanced. Signal separation from all sources of sounds is required. Microphone Array(s) for Sound Localization Draw from closely related work in infrasound wavelengths for sound separation and localization. Triangulation of several microphone array clusters to enhance localization estimate: Related work in determination of mobile telephone location in wireless communication. 19 February 2019 Dr. Veton Z. Këpuska

6 Audio Signal Processing
Key-Word Spotting/Recognition Automatic Speech Recognition Task of identifying a single word/phrase in a continuous free speech. Requires Continuous Monitoring of Speech. Even if computer voice recognition is as good as human recognition: There is a need for Key-Word Recognition Mode to: Provide Context, Get Attention, Resynchronize Communication, and Mimic Human to Human Interaction and Communication. Provides for significantly more efficient Solution (Memory and CPU) vs. Natural Language Understanding System that requires enormous resources (Memory and CPU). It is a mode of communication that would enable more natural interaction of man and machine. 19 February 2019 Dr. Veton Z. Këpuska

7 Audio Signal Processing
In the context of Surveillance and Security Monitoring, Key-Word Recognition provides technology for spotting word/phrases that imply foul-intentions. It can be also used to provide the location information of the source of sound. This information then can be used to trigger processing on the video/image sequences. Example of Key-Word Recognition Technology Solution (ThinkEngine Networks, Marlborough, Massachusetts) 19 February 2019 Dr. Veton Z. Këpuska

8 Goals and Specifications
84 Simultaneous Channels of OnWord Recognition on each TI’s TMS320C205 DSP 200MHz Memory Space: 64K Byte Program 64K Byte Data 2M Byte External Data Total of 672 Channels with 8 DSPs Recognition Rate >90% with ~0% False Acceptance. 19 February 2019 Dr. Veton Z. Këpuska

9 Solution: 3 Patented Inventions
Feature Based Voice Activity Detector (VAD) Patent Application: “Voice Activity Detection Based on Cepstral Features” Fundamental Contribution to Pattern Recognition Patent Application: “Dynamic Time Warping (DTW) Matching using Reverse Ordered Feature Vectors” Extended DTW Matching. Patent Application “Rescoring using Distribution Distortion Measurements of Dynamic Time Warping Match” 19 February 2019 Dr. Veton Z. Këpuska

10 How to Measure OnWord Recognition Performance
Classical Recognition Rate (Correct Recognition, Correct Rejection, False Acceptance, False Rejection) inadequate. False Recognition defined in terms of number of False Acceptances per 1000 min of continuous free speech. 19 February 2019 Dr. Veton Z. Këpuska

11 Evaluation and Testing
“Wake-up-word” (WUWII) Corpus – internally collected data. Designed for complete Recognition Evaluation. CallHome English Corpus – False Recognition Phonebook – False Recognition 19 February 2019 Dr. Veton Z. Këpuska

12 WUWII Corpus WUWII Corpus consists of set of 317 calls containing
First and Last Name, Isolated Words (OnWords): “Operator”, “ThinkEngine”, “OnWord”, “Wildfire”, “Voyager”, As well as those same words used in a context of a sentence: “um good morning computer, it's nice to talk to you Operator are you understanding me [bang]”. 40% Female and 22% Non-native speakers. Equal Error rate operating point ~4%. ~93% Correct Recognition (~7% False Rejection) with ~0% False Acceptance. Note (~3% of False Rejection cases are not recoverable or are pathological) (i.e., 4% False Acceptance and 4% False Rejection). It is important to note that 4% of cases that produced False Rejection are extremely difficult cases to recognize correctly. On the other hand, this operating point of equal error rate may not be good enough for practical applications in which ~0% False Acceptance is required. This would dictate for the OnWord system to operate on a different operating point. The system performance for this conservative operating point is ~93% Correct Recognition (or ~7% False Rejection). However, if one considers that 3% of cases are not recoverable or pathological (e.g., too noisy conditions, unusual pronunciation, very low signal etc.) then actual correct performance is actual ~97%. 19 February 2019 Dr. Veton Z. Këpuska

13 WUWII Corpus Test Results
Distribution Plot of Confidence Scores for OnWord "Operator" 1.0 INV 0.9 INV-CUMMULATIVE 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Equal Error Rate OOV OOV-CUMMULATIVE Operating Threshold [%] 20 40 60 80 100 Confidence Score (0-100)% 19 February 2019 Dr. Veton Z. Këpuska

14 CallHome English Corpus
The CallHome English Corpus 120 unscripted telephone conversations between native speakers of English. Conversations up to 30 minutes long. All calls originated in North America; 90 of the 120 calls were placed to various locations overseas, while the remaining 30 were placed within North America. For further details visit: 19 February 2019 Dr. Veton Z. Këpuska

15 CallHome Corpus Test Results (1)
19 February 2019 Dr. Veton Z. Këpuska

16 CallHome Corpus Test Results (4)
19 February 2019 Dr. Veton Z. Këpuska

17 PhoneBook Corpus PhoneBook consists of a total of 93,667 isolated-word utterances, totaling 23 hours of speech. This breaks down to 7979 distinct words, each said by an average of 11.7 talkers, with 1358 talkers each saying up to 75 words. First chart, depicts results from the PhoneBook CD 1 and 2 that contain only isolated words. 19 February 2019 Dr. Veton Z. Këpuska

18 PhoneBook Corpus (cont.)
Second chart, results from CD 3 - that contains spontaneous phrases: long distance phone number, dollar amounts, 7 digit phone numbers, zip codes and some spurious/spontaneous words. 19 February 2019 Dr. Veton Z. Këpuska

19 PhoneBook Corpus (cont.)
Number of words is statistically significant as well as number of speakers and total duration of utterances: Num Minutes , Num Words , Num Unique Words , Num Speakers – All utterances are tightly end-pointed and have no initial or trailing silence => actual error rates using the same words in a free speech are significantly lower then presented here. For further details consult: 19 February 2019 Dr. Veton Z. Këpuska

20 PhoneBook Corpus Test Results (1)
19 February 2019 Dr. Veton Z. Këpuska

21 PhoneBook Corpus Test Results (2)
19 February 2019 Dr. Veton Z. Këpuska

22 Video System Obvious use in surveillance/security applications
Focus on integration of video/image stream with audio: Audio component can be used for: Key-Word Spotting/Recognition (real-time) General Speech Recognition and Automatic Transcription (close to real-time or offline) Speaker Identification/Recognition Video/Image Stream Face Identification/Recognition, Tagging of individuals Motion Analysis and Object Tracking Combined Video and Audio: Speech Recognition and Lip-Reading Behavior Analysis: Prosody (Pitch Tracking) -- Speech Aggressive Behavior -- Motion 19 February 2019 Dr. Veton Z. Këpuska

23 System Integration Wireless Networking
Introduction of Novel Solutions: Protocols Data Compression (Video and Audio) Tasks Allocation and Synchronization. Expansion of Capabilities of “Smart Room”: Sensor Development Sensor Fusion 19 February 2019 Dr. Veton Z. Këpuska

24 Sensor Development Expanded uses of this technology:
Detection of thermal and other radiation (e.g., body heat, x-rays) Detection of other sounds outside audio range 19 February 2019 Dr. Veton Z. Këpuska

25 Mode of Collaboration Support instrumentation effort
Support research work Joint research funding proposals Help WiCE and FIT gain recognition in research, Attract high quality students, Direct benefit from generated intellectual property Increasing the quality and size of graduating student pool 19 February 2019 Dr. Veton Z. Këpuska

26 Sensor Development Missile Defense Agency contract on IR photo-detectors (HgCdTe) Materials processing, device physics, microelectronics fabrication=> various types of sensors 19 February 2019 Dr. Veton Z. Këpuska


Download ppt "“Smart Room” A Vision for an Integrated Research at FIT"

Similar presentations


Ads by Google