Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature fusion and attention scheme

Similar presentations


Presentation on theme: "Feature fusion and attention scheme"— Presentation transcript:

1 Feature fusion and attention scheme

2 Average/Max A very simple and naive approach is to get the average of input vectors. Vout = avg(v1,v2,...,vn) Or use max function on each vector dimension: Vout,k = max(v1,k,v2,k,...,vn,k) This approach is usually non-optimal vector fusion since the weightage for each vector cannot be the same in most of the cases.

3 Feature concatenate Instead of using simple averaging/max scheme, it is possible to apply a vector fusion layer on a concatenated feature. Y = MLP(concat(x1,x2,...,xn)) Since the MLP receives fixed dimension of input, this fusion scheme only accepts fixed input size. Concatenated inputs Result vector

4 LSTM fusion As RNN module (or RNN-like module) is able to obtain recurrent information of inputs, an RNN-like structure can also be utilized for feature merging. Simple LSTM fusion. The 'Average' block can be replaed by other functioanl part such as softmax or Norm-Avarage block.

5 Scaled dot product attention
Dot product attention is more straight forward. The weightage is obtained by the inner production of vector itself. Note that XTX is equal to Euclidean length of X, thus SDPA is a weightage policy based on embedded vector length. This is a simple approch for feature merging.

6 Attention mechanism Attention receives 3 vector inputs:
Q (query), K(key) and V(value) Typically, attention model produce attention map A (weight) based on Q and K, then A is applied on V to get final results. Attention map A is usually normalized by softmax function, although there are other options such as L1 norm. Attention mask is optinally added before normalization. Such as Top-k selection or other mask generation function/net. QKV attention block

7 Self-attention (fully connected)
The Q,K,V are produced by seperated MLP. Which is: Q = MLP1(x) K = MLP2(x) V = MLP3(x) Y = Attention(Q,K,V) Furthermore, a short cut between Y and x can be established, similar to the shortcut in ResNet blocks. Residual shortcut (opt.) x

8 Self-attention (convolutional)
Q = conv1(x) K = conv2(x) V = conv3(x) Q is in shape: [h*w, chatt] K is in shape: [h*w, chatt] V is in shape: [h*w, chx] Weight = softmax(QKT,axis=-1) Weight is in shape: [w*h, w*h] Residual shortcut (opt.) x

9 Self-attention: discussion
Self attention can be used as one network component as it receives a layer output and produce a new feature. This netpart is useful for single image structure, since the Q,K,V are all deducted from image itself. Moreover, the convolutional self-attention utilizes the global information to derive the attention map for convolutional layers, which are only contain local information.

10 General-query attention
The vectors extracted from images have different reliability, such as image quality, feature variance, etc. Therefore, a general query vector Q can be independently trained and remain K and V to be related to X. The query vector tends to test the reliability of input feature vector x. X: [dim, #feature] Q: [1,dim] x

11 General-query attention
The general-query attention can serve as vector fusion block. Furthermore, we can stack two of such attention blocks, as the first one to extract the general Query vector based on input features and the second one acts as the merging block. Feature Block1 Block2 tanh Feature

12 Iterative fusion Merely using attention block may be biased because fixed query vector will not always give the correct weightage prediction. We assume that the fused feature reaches a maximum response based on an optimum weight {wk}, then we can iteratively find the feature center. Initialize: wk = 1/k for i=1:maxiter do result = sum(wkfk) wk = softmax(result · fk) end for return {wk}

13 Ensemble two fusion schemes
We can further train an ensemble approach the two fusion schemes using a variable α. w1 = αwg + (1-α)wi w2 = βwg + (1-β)wi where α is coefficient for weight combination, and β is a random number in interval (0,1) We train α using w1 and train network using w2 in training stage, and test with w1.


Download ppt "Feature fusion and attention scheme"

Similar presentations


Ads by Google