Kinect Player Gender Recognition from Speech Analysis

Slides:



Advertisements
Similar presentations
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Advertisements

KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
Advanced Speech Enhancement in Noisy Environments
Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.
Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.
By : Adham Suwan Mohammed Zaza Ahmed Mafarjeh. Achieving Security through Kinect using Skeleton Analysis (ASKSA)
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Auto-tuning for Electric Guitars using Digital Signal Processing Pat Hurney, 4ECE 31 st March 2009.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Classifying Motion Picture Audio Eirik Gustavsen
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Viewpoint Tracking for 3D Display Systems A look at the system proposed by Yusuf Bediz, Gözde Bozdağı Akar.
Advisor: Prof. Tony Jebara
Page 1 | Microsoft Introduction to audio stream Kinect for Windows Video Courses.
Representing Acoustic Information
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.
Detection and Segmentation of Bird Song in Noisy Environments
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Mobile Device and Cloud Server based Intelligent Health Monitoring Systems Sub-track in audio - visual processing NAME: ZHAO Ding SID: Supervisor:
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High Quality Music Metacompression Using Repeated- Segment.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Project By: Brent Elder, Mike Holovka, Hisham Algadaibi.
Jacob Zurasky ECE5526 – Spring 2011
Project By: Brent Elder, Mike Holovka, Hisham Algadaibi.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Programming with the Kinect for Windows SDK
DIEGO AGUIRRE COMPUTER VISION INTRODUCTION 1. QUESTION What is Computer Vision? 2.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
A New Fingertip Detection and Tracking Algorithm and Its Application on Writing-in-the-air System The th International Congress on Image and Signal.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
ECE 8443 – Pattern Recognition EE 3512 – Signals: Continuous and Discrete Objectives: Spectrograms Revisited Feature Extraction Filter Bank Analysis EEG.
December 9, 2014Computer Vision Lecture 23: Motion Analysis 1 Now we will talk about… Motion Analysis.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Topic: Pitch Extraction
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Speech Recognition Created By : Kanjariya Hardik G.
AHED Automatic Human Emotion Detection
ARTIFICIAL NEURAL NETWORKS
Spoken Digit Recognition
Can Computer Algorithms Guess Your Age and Gender?
PROJECT PROPOSAL Shamalee Deshpande.
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Digital Systems: Hardware Organization and Design
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
A maximum likelihood estimation and training on the fly approach
Auditory Morphing Weyni Clacken
Sign Language Recognition With Unsupervised Feature Learning
Presentation transcript:

Kinect Player Gender Recognition from Speech Analysis Radford Parker ECE 6254

Problem Currently, the Kinect can recognize gender of players through image processing techniques Can a system be designed that in real-time, recognizes gender by using only speech analysis? This would improve the computational efficiency of the system drastically as it moves from two-dimensional analysis to one-dimensional analysis

Background There have been many implementations with a large variety of features proposed Some examples include pitch [1], harmonic structure [2], cepstral analysis [3], and spectral coefficients [4] Only recently has work been done on real-time streaming audio gender recognition [5]

Hardware The Microsoft Kinect contains an RGB camera, calibrated IR-based depth sensor, and a four microphone array It records four separate channels at 16 kHz

Getting the Spectrum In order to achieve real-time performance, a simple approach was taken using an FFT based on the Danielson and Lanczos algorithm The algorithm would detect when a player is speaking and then would take the FFT of an overlapping Hamming window

Finding the Peaks Once the spectrum has been found, the peak of the signal, fp, is detected between 75 Hz and 275 Hz This is the typical vocal range for adult humans [6] Because fp can be a harmonic, smaller peaks are checked for at fp/2 and fp/3

Finding the Pitch The neighborhoods around fp/2 and fp/3 are searched The actual pitch, fa, is found when all frequencies in the local area have a smaller response and the gain is significant compared to the entire spectrum

Determining Gender The algorithm is run every 8192 samples (about 500 ms), with an overlap of 1024 samples If the max volume of those samples is significant enough to be speech, the FFT is computed For every FFT, an associated fa is found and classified as male if in (70, 140) and female if in (165, 275) These scores are aggregated and whenever one gender outnumbers the other by a certain factor, the player is designated that gender This aggregation can allow for greater confidence

Experimentation In order to verify the algorithm, tests were performed offline using the same approach The samples were taken from the TSP Speech Database and are all a few seconds in length and sampled at 48 kHz The dataset contains 1,444 speech samples spoken by 14 different females and 11 different males

Data Set Results Contingency table for the TSP data set: Which, when removing unknowns, yields a correct classification of 92% of males and 96% of females Actual Males Actual Females Predicted Males 549 28 Predicted Females 51 704 Unknown 60 52

Timing Results For the TSP data set, the algorithm runs on average around 15 ms per every second of recording, or every 48,000 samples Even with the additional Kinect overhead of streaming video, streaming depth, performing skeletal tracking, and recording all four channels, the algorithm can still run in real-time

Kinect Male Player Result

Kinect Female Player Result

Kinect Two Player Result

References [1] Parris, E. S., Carey, M. J., Language Dependent Gender Identification. Acoustics, Speech, and Signal Processing. ICASSP-96 Conference Proceedings, vol. 2. (1996) 685 - 688. [2] Vergin, R., Farhat A., O'Shaughnessy D.: Robust Gender-dependent Acoustic-phonetic Modelling in Continuous Speech Recognition Based on a New Automatic Male/female Classification. ICSLP-96 Conference Proceedings, vol. 2. October (1996) 1081-1084. [3] Hurb, H., Chen, L., Gender Identification Using a General Audio Classifier. ICME '03 Proceedings, vol. 2. July (2003) 733-736. [4] Slomka, S., Sridharan, S., Automatic Gender Identification Optimised for Language Independence. TENCON '97 IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications Conference Proceedings, vol. 1. December (1997) 685 - 688. [5] E. Scheme, E. Castillo-Guerra, K. Englehart, and A. Kizhanatham, “Practical considerations for real-time implementation of speech-based gender detection,” in Proc. of the Iberoamerican Cong. in Patt. Reco., Nov 2006. [6] Baken, R. J. Clinical Measurement of Speech and Voice. London: Taylor and Francis Ltd. (1987).