build_self_transformer

2023/11/3 Transformer

This blog tells you how to build a transformers from scratch.

Why when need transformers

How far the Neural Network can look back depends on the length of the NN and the choice.

CNN have a notable disadvantage for time series prediction :

Increase the kernel size(also increase the parameters).
Skips over some past state/inputs

$\rightarrow$ trade-off

Transformers

Attention

Attention in deep networks generally refers to any mechanism where individual states are weighted and then combined.

The states depend more on the later element than the previous element (problem for RNN, LSTM).

For the Attention , we take several latter states rather than one to compute :
$$
\overline{h} = \sum_{t=1}^T w_th_t(k) \text{for example the whole time sequence}
$$

Self-Attention Mechansim

$K,Q,V\in R^{T\times d}$ (keys, queries, values):

To get $K$, $K = XW_k$, $k_1$ only depends on the $x_1$, and so on.
$$
SelfAttention(K,Q,V) = softmax(\frac{KQ^T}{\sqrt{d}}) V
$$
$KQ^T$ is $R ^{T\times T}$ ， in this matrix:
$$
entry_{(i,j)} = k_i^Tq_j
$$
Similarity of the $k_i$ and $q_j$.

The softmax is by row, each of the row will now turn into weights sum to 1.

$d\in R^{T\times}$, so the output is also $R^{T\times d}$.

Each row of the output

Properties:

Invariant, same permutations of the $K,Q,V$ lead to the same permutation of the output.

Allows influence between $k_t, q_t, v_t$ over all times. Without increase the parameter counts

Compute cost is $O(T^2 d +Td)$.