ASR System & LIBDNN Yen-Chen Wu r03942044@ntu.edu.tw 台大語音實驗室 暑期專題研究 ASR System & LIBDNN Yen-Chen Wu r03942044@ntu.edu.tw
Outline DNN in Speech Recognition DNN TIMIT Introduction How to use libdnn
DNN IN SPEECH RECOGNITION
Speech Recognition In speech processing… each word consists of syllables each syllable consists of phonemes Each time frame, with an observance (vector) mapped to a phoneme. “青色” → “青(ㄑㄧㄥ)色(ㄙㄜ、)” → ”ㄑ” (syllables) 青:TSI --I –N (phonemes) 色:S--@ (phonemes)
Observation Sequences Sample Rate: 16000 Observation Sequences 10 ms 25 ms sliding window frames of features Digital Speech Processing Lect. 2.0 Frame 1 Frame 2 Frame 3
DNN in Speech Recognition Goal: predict phoneme given feature in each time frame. Frame-wise prediction Input: acoustic features MFCC, FBANK or... Output: pronunciation units Phonemes or... To know more about Automatic Speech Recognition(ASR), please refer to http://speech.ee.ntu.edu.tw/DSP2015Spring/
Training Deep Neural Network
Main Problems Model initialize Feedforward Backpropagate Update Predict
Model Initialize DNN sometimes fails at local optimum problem, so initialization matters. Practically, there exists unsupervised pre-training technique on initialization. However, in this homework, we recommend you initialize them randomly for the simplicity and efficiency.
Feedforward
Backpropagate
Update
Evaluation Framewise phoneme prediction Frame Accuracy
WHY DNN? Basic Model in Deep Learning Network Structure Feature Extraction (Representation) Variety of Structures (CNN, RNN, LSTM, NTM…etc) Network Structure How many layers? Number of neurons in each layer Training Parameter Learning Rate Batch Size
Dataset and Format
Dataset TIMIT(Texas Instrument and Massachusetts Institute of Technology) Well-transcribed speech of American English speakers of different sexes and dialects. Designed for the development and evaluation of ASR systems.
Dataset Each instance consists of 3 parts: speaker faem0, sentence si1392, the 37th frame
Data Format WAV file: Speak-Sentence ID + .wav Check by your ear(s) ARK file: Instance ID + features TODO
HOW TO USE LIBDNN
LIBDNN libdnn 是一個輕量、好讀、人性化的深層學習函式 庫。由 C++ 和 CUDA 撰寫而成,目的是讓開發人 員、研究人員、或任何有興趣的人都可以輕鬆體驗 並駕馭深層學習所帶來的威力。 Ref: 以深層與卷積類神經網路建構聲學模型之大字彙連續 語音辨識 ( Deep and Convolutional Neural Networks for Acoutic Modeling in Large Vocabulary Continuous Speech Recognition ) 已安裝於專題生工作站
資料格式(一) 稀疏矩陣( LibSVM ) 這個向量大部分的值都是0,只有少數幾維的值為1
資料格式(二) 緊密排列的方式(dense) 本次練習給的格式
如何使用 主要有以下三個程式: 會將指令寫成shell-script 直接修改參數即可 nn-init nn-train nn-init [train_set_file] <-o> <--input-dim> <--struct> [options] EX: nn-init -o init.model --input-dim 69 --struct 1024 --output-dim 39 nn-train nn-train <training_set_file> <model_in> [valid_set_file] [model_out] <--input-dim> [options] EX: nn-train train.dat init.model --input-dim 69 nn-predict nn-predict <testing_set_file> <model_file> [output_file] <--input- dim> [options] EX: nn-predict test.dat train.model --input-dim 69 會將指令寫成shell-script 直接修改參數即可
WORK STATION 專題生開工作站帳號請找 ssh -p 2822 your_account@140.112.21.35 實驗室網管: 廖宜修 r03921048@ntu.edu.tw ssh -p 2822 your_account@140.112.21.35 進入工作站後先確認data位置 /home/wyc2010/DNN_practice 複製run.sh回到自己的家目錄 cp /home/wyc2010/DNN_practice/run.sh 開始實驗! sh run.sh