Now we’ll see some of the power of multi-headed attention We’ll consider a simple version of multi-headed attention which is identical to singleheaded self-attention as we’ve presented it in this hom

i. To design q1 and q2 such that c is approximately equal to 0.5(va + vb), we need to choose the query vectors such that they attend to the key vectors that correspond to the means µa and µb, respectively. Thus, we can set q1 = µa and q2 = µb. Then, the output of single-headed attention with query q1 will be a vector c1 that attends to the key vectors that are close to µa, and the output of single-headed attention with query q2 will be a vector c2 that attends to the key vectors that are close to µb. Since the means are mutually orthogonal, the key vectors that are close to µa will be far from µb, and vice versa. Therefore, c1 will be approximately equal to va, and c2 will be approximately equal to vb. Finally, taking the average of c1 and c2 gives us the desired output: c ≈ 0.5(va + vb).

ii. In this case, the covariance matrix for the key vector corresponding to µa has an additional term that is proportional to µaµ⊤a, which means that the key vectors that are close to µa will be correlated with each other. This can lead to a situation where attending to one key vector that is close to µa will also attend to other key vectors that are close to µa. Therefore, the output c will not only attend to va, but also to other key vectors that are close to µa. This can make the output c more spread out and less focused on va. However, since the covariance matrices for the key vectors corresponding to the other means µi are still proportional to the identity matrix, attending to those key vectors will not be affected by the correlation between the key vectors that are close to µa. Therefore, the output c will still be approximately equal to 0.5(va + vb) but may be more spread out and less focused on va

Now we’ll see some of the power of multi-headed attention We’ll consider a simple version of multi-headed attention which is identical to singleheaded self-attention as we’ve presented it in this hom