Feature fusion and attention scheme

Slides:



Advertisements
Similar presentations
Face Recognition and Biometric Systems Eigenfaces (2)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Lecture 14 – Neural Networks
Radial Basis Functions
Un Supervised Learning & Self Organizing Maps Learning From Examples
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Radial Basis Function Networks:
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
381 Self Organization Map Learning without Examples.
CHAPTER 14 Competitive Networks Ming-Feng Yeh.
Convolutional LSTM Networks for Subcellular Localization of Proteins
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Lecture 3b: CNN: Advanced Layers
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Attention Model in NLP Jichuan ZENG.
Machine Learning Supervised Learning Classification and Regression
Deep Residual Learning for Image Recognition
Convolutional Sequence to Sequence Learning
RNNs: An example applied to the prediction task
Chapter 7. Classification and Prediction
Hierarchical Question-Image Co-Attention for Visual Question Answering
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
Computer Science and Engineering, Seoul National University
Attention Is All You Need
Combining CNN with RNN for scene labeling (segmentation)
Intelligent Information System Lab
Mini Presentations - part 2
Supervised Training of Deep Networks
Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis
Convolutional Networks
Random walk initialization for training very deep feedforward networks
Attention Is All You Need
RNNs: Going Beyond the SRN in Language Prediction
Convolutional Neural Networks for sentence classification
Hidden Markov Models Part 2: Algorithms
Neuro-Computing Lecture 4 Radial Basis Function Network
Tips for Training Deep Network
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Neural Networks Geoff Hulten.
Other Classification Models: Recurrent Neural Network (RNN)
Papers 15/08.
Use 3D Convolutional Neural Network to Inspect Solder Ball Defects
Neural Speech Synthesis with Transformer Network
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Analysis of Trained CNN (Receptive Field & Weights of Network)
RCNN, Fast-RCNN, Faster-RCNN
RNNs: Going Beyond the SRN in Language Prediction
Attention.
Mihir Patel and Nikhil Sardana
Inception-v4, Inception-ResNet and the Impact of
Attention for translation
Automatic Handwriting Generation
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
A unified extension of lstm to deep network
Batch Normalization.
Neural Machine Translation using CNN
Recurrent Neural Networks
Sequence-to-Sequence Models
Week 7 Presentation Ngoc Ta Aidean Sharghi
LHC beam mode classification
Visual Grounding.
A Neural Passage Model for Ad-hoc Document Retrieval
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Feature fusion and attention scheme

Average/Max A very simple and naive approach is to get the average of input vectors. Vout = avg(v1,v2,...,vn) Or use max function on each vector dimension: Vout,k = max(v1,k,v2,k,...,vn,k) This approach is usually non-optimal vector fusion since the weightage for each vector cannot be the same in most of the cases.

Feature concatenate Instead of using simple averaging/max scheme, it is possible to apply a vector fusion layer on a concatenated feature. Y = MLP(concat(x1,x2,...,xn)) Since the MLP receives fixed dimension of input, this fusion scheme only accepts fixed input size. Concatenated inputs Result vector

LSTM fusion As RNN module (or RNN-like module) is able to obtain recurrent information of inputs, an RNN-like structure can also be utilized for feature merging. Simple LSTM fusion. The 'Average' block can be replaed by other functioanl part such as softmax or Norm-Avarage block.

Scaled dot product attention Dot product attention is more straight forward. The weightage is obtained by the inner production of vector itself. Note that XTX is equal to Euclidean length of X, thus SDPA is a weightage policy based on embedded vector length. This is a simple approch for feature merging.

Attention mechanism Attention receives 3 vector inputs: Q (query), K(key) and V(value) Typically, attention model produce attention map A (weight) based on Q and K, then A is applied on V to get final results. Attention map A is usually normalized by softmax function, although there are other options such as L1 norm. Attention mask is optinally added before normalization. Such as Top-k selection or other mask generation function/net. QKV attention block

Self-attention (fully connected) The Q,K,V are produced by seperated MLP. Which is: Q = MLP1(x) K = MLP2(x) V = MLP3(x) Y = Attention(Q,K,V) Furthermore, a short cut between Y and x can be established, similar to the shortcut in ResNet blocks. Residual shortcut (opt.) x

Self-attention (convolutional) Q = conv1(x) K = conv2(x) V = conv3(x) Q is in shape: [h*w, chatt] K is in shape: [h*w, chatt] V is in shape: [h*w, chx] Weight = softmax(QKT,axis=-1) Weight is in shape: [w*h, w*h] Residual shortcut (opt.) x

Self-attention: discussion Self attention can be used as one network component as it receives a layer output and produce a new feature. This netpart is useful for single image structure, since the Q,K,V are all deducted from image itself. Moreover, the convolutional self-attention utilizes the global information to derive the attention map for convolutional layers, which are only contain local information.

General-query attention The vectors extracted from images have different reliability, such as image quality, feature variance, etc. Therefore, a general query vector Q can be independently trained and remain K and V to be related to X. The query vector tends to test the reliability of input feature vector x. X: [dim, #feature] Q: [1,dim] x

General-query attention The general-query attention can serve as vector fusion block. Furthermore, we can stack two of such attention blocks, as the first one to extract the general Query vector based on input features and the second one acts as the merging block. Feature Block1 Block2 tanh Feature

Iterative fusion Merely using attention block may be biased because fixed query vector will not always give the correct weightage prediction. We assume that the fused feature reaches a maximum response based on an optimum weight {wk}, then we can iteratively find the feature center. Initialize: wk = 1/k for i=1:maxiter do result = sum(wkfk) wk = softmax(result · fk) end for return {wk}

Ensemble two fusion schemes We can further train an ensemble approach the two fusion schemes using a variable α. w1 = αwg + (1-α)wi w2 = βwg + (1-β)wi where α is coefficient for weight combination, and β is a random number in interval (0,1) We train α using w1 and train network using w2 in training stage, and test with w1.