Self-Attention Mechanism: Calculation Process Explained

The calculation process of self-attention is as follows:
\

Input: We start with an input sequence of vectors, usually representing words or tokens in a sentence. Let's say we have N tokens, each represented by a d-dimensional vector.
\
Queries, Keys, and Values: Each token is transformed into three vectors - Query, Key, and Value. These transformations are linear projections of the input vectors and are often computed using learnable weight matrices. So, we will have N Query vectors, N Key vectors, and N Value vectors.
\
Similarity Scores: Now, we calculate the similarity scores between each pair of Query and Key vectors. This can be done by taking the dot product between the Query and Key vectors. The dot product captures the similarity between two vectors.
\
Scaling: To avoid large values in the similarity scores, we scale them by dividing them by the square root of the dimension of the Key vectors. This helps in stabilizing the gradients during training.
\
Attention Weights: Next, we apply the softmax function to the scaled similarity scores. This converts the scores into attention weights, ensuring that the sum of the weights for each Query vector is equal to 1. These attention weights represent the importance or relevance of each Key vector for a given Query vector.
\
Weighted Sum: We then take a weighted sum of the Value vectors using the attention weights. The weighted sum represents the context or information attended to by each Query vector.
\
Output: Finally, we obtain the output of self-attention by concatenating or stacking the attended Value vectors. This results in a new sequence of vectors, which can be further processed by additional layers or used for downstream tasks.

This process of self-attention allows the model to capture dependencies and relationships between different tokens in the input sequence, enabling it to learn contextual information effectively.

Self-Attention Mechanism: Calculation Process Explained