LOADING

加载过慢请开启缓存 浏览器默认开启

build_self_transformer

2023/11/3 Transformer

This blog tells you how to build a transformers from scratch.

Why when need transformers

How far the Neural Network can look back depends on the length of the NN and the choice.

CNN have a notable disadvantage for time series prediction :

  • Increase the kernel size(also increase the parameters).
  • Skips over some past state/inputs

$\rightarrow$ trade-off

Transformers

Attention

image-20231103151429168

Attention in deep networks generally refers to any mechanism where individual states are weighted and then combined.

The states depend more on the later element than the previous element (problem for RNN, LSTM).

For the Attention , we take several latter states rather than one to compute :
$$
\overline{h} = \sum_{t=1}^T w_th_t(k) \text{for example the whole time sequence}
$$

Self-Attention Mechansim

$K,Q,V\in R^{T\times d}$ (keys, queries, values):

To get $K$, $K = XW_k$, $k_1$ only depends on the $x_1$, and so on.
$$
SelfAttention(K,Q,V) = softmax(\frac{KQ^T}{\sqrt{d}}) V
$$
$KQ^T$ is $R ^{T\times T}$ , in this matrix:
$$
entry_{(i,j)} = k_i^Tq_j
$$
Similarity of the $k_i$ and $q_j$.

The softmax is by row, each of the row will now turn into weights sum to 1.

$d\in R^{T\times}$, so the output is also $R^{T\times d}$.

Each row of the output

Properties:

  • Invariant, same permutations of the $K,Q,V$ lead to the same permutation of the output.
  • Allows influence between $k_t, q_t, v_t$ over all times. Without increase the parameter counts
  • Compute cost is $O(T^2 d +Td)$.

Transformers architecture

image-20231103154001603

Transformer Block

$z_1 = SelfAttention(z^{i}w_K,z^{i}w_Q,z^{i}w_V = softmax(\frac{z^{i}w_Kw_Q^Tz^{i T}}{\sqrt{d}}) z^{i}w_V$

$z_2 = LayerNorm(z^{i}+z_1)$

$z^{i+1} = LayerNorm(z_2+Relu(z_2w_1)w_2)$

$Relu(z_2w_1)w_2$ is a two layer network in the MLP.

Masked Self-Attention

Encode the position into the transformer

image-20231103160658243

After the softmax, make the upper triangle are all zeros.

image-20231103161830126

image-20231103162039410

Add Position encoding so that the neural network knows where is the x.