Download presentation
Presentation is loading. Please wait.
Published byLorraine Jacobs Modified over 8 years ago
1
Study on Deep Learning in Speaker Recognition Lantian Li CSLT / RIIT Tsinghua University lilt@cslt.riit.tsinghua.edu.cn May 26, 2016
2
Biometric Recognition Speaker Recognition Deep Learning in Speaker Recognition Prospects
3
Biometric refers to metrics related to human characteristics [wiki]. Term "biometrics" is derived from Greek words bio (life) and metric (to measure). Biometric recognition is a kind of automated technologies for measuring and analyzing an individual's physiological or behavioral characteristics, and can be used to verify or identify an individual.
4
Fingerprint Face Palmprint Iris Retina Scan DNA Signatures Gait/gesture Keystroke Voiceprint Physiological Characteristic Behavioral Characteristic
5
Language Recognition What language was spoken? Accent Recognition Where is he/she from? Speech Recognition What was spoken? Gender Recognition Male or Female? Emotion Recognition Positive? Negative? Happy? Sad? Speaker Recognition Who spoke?
6
Speaker recognition is the identification of a person from characteristics of voices (voice biometrics). It is also called voiceprint recognition. [wiki]
7
Speaker Identification Determining which identity in a specified speaker set is speaking during a given speech segment. Speaker Verification Determining whether a claimed identity is speaking during a speech segment. It is a binary decision task. Speaker Detection Determining whether a specified target speaker is speaking during a given speech segment. Speaker Tracking ( Speaker Diarization = Who Spoke When ) Performing speaker detection as a function of time, giving the timing index of the specified speaker.
8
Advantages of Speaker recognition Speech signal more accessible The use more acceptable by users Remote authentication more convenient Application scenarios Access control (e.g. voice lock) Transaction authentication (e.g. remote payment) Forensic analysis (e.g. police criminal detection)
9
Time 1930 1960 1970 1980 1990 2000 2010 Feature Model speech waveform spectrogrm LPC, LPCC, PLAR MFCC PLP phone information, deep spk-vector Template matching DTW, VQ, HMM GMM-UBM, GMM-SVM JFA, i-vector Deep learning Keep pace with times Work together to develop Small and clean speech data Big and practical speech data
10
Machine learning is a subfield of compute science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence [wiki]. Deep learning is a branch of machine learning methods based on learning representations of data. Neural networks is a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data.
11
Lantian Li, Dong Wang, Zhiyong Zhang, Thomas Fang Zheng, “Deep Speaker Vectors for Semi Text-independent Speaker Verification”, arXiv:1505.06427, 2015.
12
Segment pooling and dynamic time warping (DTW) Lantian Li, Yiye Lin, Zhiyong Zhang, Dong Wang, “Improved Deep Speaker Feature Learning for Text-Dependent Speaker Recognition”, APSIPA ASC 2015, pp. 426-429. IEEE, 2015.
13
Max-margin metric learning Metric learning: to learn a projection M. Distance metric: Goal: to discriminate true speakers and imposters. Lantian Li, Dong Wang, Chao Xing, Thomas Fang Zheng, “Max-margin Metric Learning for Speaker Recognition”, arXiv:1510.05940, 2015.
14
Multi-task Recurrent Model for Speech and Speaker Recognition Zhiyuan Tang +, Lantian Li +, Dong Wang, “Multi-task Recurrent Model for Speech and Speaker Recognition”, arXiv:1603.09643, 2016.
16
Sequence model --> Deep speaker embedding Encoder-decoder + attention model
17
Thank you http://lilt.cslt.org/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.