Now we’ll see some of the power of multi-headedattention We’ll consider a simple version of multi-headed attention which is identical to singleheaded self-attention as we’ve presented it in this home

i. We can design q1 and q2 as follows: q1 = va + εµa q2 = vb + εµb where ε is a small constant. Then, using the formula for single-headed attention, we can compute c1 and c2 as follows: c1 = ∑ i αi(va⊤ki)ki c2 = ∑ i αi(vb⊤ki)ki Taking the average of c1 and c2, we get: c = 1 2 (c1 + c2) = ∑ i αi(va + vb)⊤ki ki = (va + vb)⊤ ∑ i αiki ki = (va + vb)⊤(αI) = α(va + vb) Thus, c is approximately equal to 1 2 (va + vb), as desired.

ii. With the covariance matrices as specified, using the same query vectors as in part i, we can compute c1 and c2 as follows: c1 = ∑ i αi(va⊤ki)ki + α(va⊤µa)µa c2 = ∑ i αi(vb⊤ki)ki + α(vb⊤µb)µb Taking the average of c1 and c2, we get: c = 1 2 (c1 + c2) = ∑ i αi(va + vb)⊤ki ki + 1 2 α(va + vb)⊤µaµa + µbµb⊤ = (va + vb)⊤ (αI + 1 2 (µaµ ⊤ a ))(αI)−1(αI + 1 2 (µaµ ⊤ a ))−1 (∑ i ki ki) + 1 2 (µaµ ⊤ a ) + µbµb⊤ = (va + vb)⊤ (αI + 1 2 (µaµ ⊤ a ))−1 (∑ i ki ki) + 1 2 (µaµ ⊤ a ) + µbµb⊤ Ignoring the cases where ka ⊤qi < 0, we expect the output c to be a weighted sum of the key vectors ki, with the weights determined by the Mahalanobis distance between the query vectors and the key vectors. Specifically, the weights will be higher for key vectors that are closer to the query vectors in the Mahalanobis sense, which takes into account the covariance matrix. Since the covariance matrix for key vector ka is different from the covariance matrices for all other key vectors, we expect that ka will have a disproportionately high weight in the sum. Additionally, since the mean vectors µa and µb are orthogonal, we expect that the term µaµ ⊤ a will dominate the sum, leading to a final output that is close to va

Now we’ll see some of the power of multi-headedattention We’ll consider a simple version of multi-headed attention which is identical to singleheaded self-attention as we’ve presented it in this home