Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒

Improving Language-Universal Feature Extraction with Deep Maxout and Convolutional Neural Networks
Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒日期：2015/01/16

Abstract Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). We replace the standard sigmoid nonlinearity with the recently proposed maxout units. (nice property of generating sparse feature representations) The convolutional neural network (CNN) architecture is applied to obtain more invariant feature space.

(high-level representations)
Introduction Deep neural networks (DNNs) have shown significant gains over the traditional GMM/HMM. S1 S2 S3 …… …… HMM states Hidden layer (high-level representations)

Introduction These shared layers are taken as a language-universal feature extractor (LUFE). Given a new language, acoustic models can be built over the outputs from the LUFE, instead of the raw features (e.g., MFCCs).

Introduction On the complex image and speech signals, sparse features tend to give better classification accuracy compared with non-sparse feature types. CNN local filters and max-pooling layers, able to reduce spectral variation in the speech data. Therefore, more robust and invariant representations are expected to be obtained from the CNN-LUFE feature extractor.

Feature Extraction with DNNs
Fine-tuning of the DNNs can be carried out using SGD(stochastic gradient descent) based on mini-batches. each language has its own softmax output layer multilingual DNNs are tied across all the languages

SGD(stochastic gradient descent)、 BDG、Mini-batch gradient
BGD(batch GD)指的是，每次更新θ的時候都需要所有的資料集。這個演算法有兩個缺點：資料集很大時，訓練過程計算量太大。需要得到所有的資料才能開始訓練。 SGD是指一個樣本就update一次。常用於大規模的訓練集，但容易收斂至局部最佳解。 Mini-batch GD就是在兩個中取平衡，非BGD全部的 batch都考慮，也非SGD只考慮一個樣本。

Feature Extraction with DNNs
作者提出的方法與前兩張投影片介紹的Hybrid DNN 有兩點不同： Feature Extractor的使用了全部的hidden layer，但作者認為只需要部份的即可，例如使用接近softmax的 layer，可能比較重要。在Target language的部分，作者建立一個DNN model，來取代單一的softmax layer。

LUFEs with Convolutional Networks
WHY CNN？ The CNN hidden activations become invariant(cause max-pooling) to different types of speech variability and provide better feature representations. In the convolution layer, we only consider filters along frequency, assuming that the time variability can be modeled by HMM. 𝑣 1 𝑣 2 𝑣 3 𝑣 4

LUFEs with Convolutional Networks
𝜎 : sigmoid activation function. 兩層CNN後在接Multiple fully-connected DNN layers與softmax layer。當fully-connected layers 輸入high-level features(指CNN extract invariant features)，在分類HMM state時較容易。 𝑝 𝑗 𝑘 max max max ℎ 𝑗 𝑘為pooling size，此範例為2

Sparse Feature Extraction
hidden layer group Our past study presents the deep maxout networks (DMNs) for speech recognition. Compared with standard DNNs, DMNs perform better on both hybrid systems and bottleneck-feature (BNF) tandem systems. input BNF output

Sparse representations can be generated from any of the maxout layers via a non-maximum masking operation. We extend our previous idea from the following three aspects.

First, We compute population sparsity for each feature type as a quantitative indicator. ℓ 1 −𝑏𝑎𝑙𝑙: | 𝑉 𝑖 | 1 = 𝑖 | 𝑉 𝑖 | ℓ 2 −𝑏𝑎𝑙𝑙 : | 𝑉 𝑖 | 2 = 𝑖 𝑉 𝑖 2 𝑣 1 = 0,1,0,0,0,1,0,0,0,0 , 𝑣 2 ={1,1,0,1,1,1,1,1,0,1} 𝑝𝑆𝑝𝑎𝑟𝑠𝑖𝑡𝑦 𝑣 1 =2 , 𝑝𝑆𝑝𝑎𝑟𝑠𝑖𝑡𝑦 𝑣 2 =8 (A lower pSparsity means higher sparsity of the features.)

Sparse ﬁltering 𝑓 1 1 ⋯ 𝑓 1 𝑖 ⋮ ⋱ ⋮ 𝑓 𝑗 1 ⋯ 𝑓 𝑗 𝑖
右圖是指藍色樣本相對綠色樣本只有 𝑓 1 增加，經過Normalized之後，藍色樣本不只 𝑓 1 增加， 𝑓 2 也會減少，代表特徵間具有競爭性。 𝑓 𝑗 = 𝑓 𝑗 𝑓 𝑗 2 𝑓 ⋯ 𝑓 1 𝑖 ⋮ ⋱ ⋮ 𝑓 𝑗 1 ⋯ 𝑓 𝑗 𝑖 𝑓 (𝑖) = 𝑓 (𝑖) 𝑓 (𝑖) 2

Second, to better understand the impact of sparsity, we compare DMNs against the rectifier networks. DNNs consisting of rectifier units, referred to as deep rectifier networks (DRNs)[ReLU], have shown competitive accuracy on speech recognition. The rectifier units have the activation function 𝒎𝒂𝒙(𝟎,𝒙). As a result, high-sparsity features can be naturally produced from a DRN-based extractor, because many of the hidden outputs are forced to 0.

Third, DMNs are combined with CNNs to obtain both sparse and invariant feature representations. The CNN-DMN-LUFE extractor is structured in the similar way as CNN-LUFE. The only difference is that the fully-connected layers in CNNs are replaced by maxout layers.

Experiments We aim at improving speech recognition on the Tagalog limited language pack. In training time, Only 10 hours of telephone conversation speech.(test time 2-hours) On which feature extractors are trained.(Cantonese、 Turkish、Pashto)

Experiments triphone On each language, we build the GMM/HMM system with the same recipe. An initial maximum likelihood model is first trained using 39-dimensional PLPs with per- speaker mean normalization. Then 9 frames are spliced together and projected down to 40 dimensions with linear discriminant analysis (LDA). Speaker adaptive training (SAT) is performed using feature-space maximum likelihood linear regression (fMLLR).

LDA(linear discriminant analysis)

Monolingual DNNs and CNNs
The DNN model has 6 hidden layers and 1024 units at each layer. DNN parameters are initialized with stacked denoising autoencoders (SDAs). Previous work has applied SDAs on LVCSR and shown that SDAs perform comparably as RBMs for DNN pre-training

Monolingual DNNs and CNNs
For CNNs, the size of the filter vectors 𝑟 𝑗𝑖 is constantly set to 5. We use a pooling size of 2. convolution2 pooling2 convolution1 pooling1 output 100 feature map 200 feature map

Results of DNN- and CNN-LUFEs
dimension of the features from each LUFE After DNNs are trained, the lower four layers are employed as the feature extractor, which gives better results than generating features from the last hidden layer. We can take the two convolution stages as the LUFE, and features are generated from the second max-pooling layer. Alternatively, we can also extract features from the lowest fully-connected layer which is on top of the convolution stages. These two manners of CNN feature extraction are called FeatType1 and FeatType2.

Results of DNN- and CNN-LUFEs
For fair comparison, the identical DNN topology, 4 hidden layers each with 1024 units, is used for hybrid systems over different feature extractors. FeatType1 FeatType2 convolution2 pooling2 convolution1 pooling1 output 100 feature map 200 feature map

Results of Sparse LUFEs
Table 3 compares DRNs and DMNs based LUFEs in terms of pSparsity and target language WERs. However, features produced by DMNs achieve better WERs than features from DRNs. Increasing the group size (from 2 to 3 and finally 4) in DMNs gives sparser features but degrades the recognition results. The best sparse features in Table 3 are generated by DMNs with 512 unit groups and the group size of 2 at each hidden layer.

Combining CNNs and DMNs
We replace the sigmoid fully-connected layers with maxout layers. Compared with the CNN-LUFE, the CNN-DMN-LUFE generates sparse features as well as target-language WER reduction.

Conclusions and Future Work
This paper has investigated the effectiveness of deep maxout and convolutional networks in performing language-universal feature extraction. In comparison with DNNs, CNNs generate better feature representations and more WER improvement on cross-language hybrid systems. Maxout networks have the property of generating sparse hidden outputs, which makes the learned representations more interpretable and explanatory. CNN-DMN-LUFE, a hybrid of CNNs and maxout networks, results in the best recognition performance on the target language.

Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒

Similar presentations

Presentation on theme: "Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒

Similar presentations

Presentation on theme: "Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒"— Presentation transcript:

Similar presentations

About project

Feedback