This blog tells you how to build a transformers from scratch.
Why when need transformers
How far the Neural Network can look back depends on the length of the NN and the choice.
CNN have a notable disadvantage for time series prediction :
- Increase the kernel size(also increase the parameters).
Skips over
some past state/inputs
$\rightarrow$ trade-off
Transformers
Attention
Attention in deep networks generally refers to any mechanism
where individual states are weighted and then combined
.
The states depend more on the later element than the previous element (problem for RNN, LSTM).
For the
Attention
, we take several latter states rather than one to compute :
$$
\overline{h} = \sum_{t=1}^T w_th_t(k) \text{for example the whole time sequence}
$$
Self-Attention Mechansim
$K,Q,V\in R^{T\times d}$ (keys, queries, values):
To get $K$, $K = XW_k$, $k_1$ only depends on the $x_1$, and so on.
$$
SelfAttention(K,Q,V) = softmax(\frac{KQ^T}{\sqrt{d}}) V
$$
$KQ^T$ is $R ^{T\times T}$ , in this matrix:
$$
entry_{(i,j)} = k_i^Tq_j
$$
Similarity of the $k_i$ and $q_j$.The softmax is by row, each of the row will now turn into weights sum to 1.
$d\in R^{T\times}$, so the output is also $R^{T\times d}$.
Each row of the output
Properties
:
- Invariant, same permutations of the $K,Q,V$ lead to the same permutation of the output.
- Allows influence between $k_t, q_t, v_t$ over all times. Without increase the parameter counts
- Compute cost is $O(T^2 d +Td)$.
Transformers architecture
Transformer Block
$z_1 = SelfAttention(z^{i}w_K,z^{i}w_Q,z^{i}w_V = softmax(\frac{z^{i}w_Kw_Q^Tz^{i T}}{\sqrt{d}}) z^{i}w_V$
$z_2 = LayerNorm(z^{i}+z_1)$
$z^{i+1} = LayerNorm(z_2+Relu(z_2w_1)w_2)$
$Relu(z_2w_1)w_2$ is a two layer network in the MLP
.
Masked Self-Attention
Encode the position into the transformer
After the softmax, make the upper triangle are all zeros.
Add Position encoding so that the neural network knows where is the x.