Consider a set of key vectors k1 kn that are now randomly sampled ki ∼ N µi Σi wherethe means µi ∈ Rd are known to you but the covariances Σi are unknown Further assume thatthe means µi are all pe

i. One possible query q could be to compute the dot product between each key vector ki and a fixed query vector q0 that is orthogonal to all the means µi. Specifically, let q0 be a unit vector that is orthogonal to all the means µi, i.e., µi ⊤ q0 = 0 for all i. Then, define the query q as:

q = q0 + ϵ,

where ϵ is a small random perturbation.

Intuitively, this query works because for small α, the dominant contribution to the dot product ki ⊤ q comes from the mean µi, which is fixed and known. The covariance term αI adds a small amount of noise to the key vector, but this noise is roughly isotropic and does not bias the dot product towards or away from the query q. Therefore, the dot product between ki and q is roughly proportional to the cosine of the angle between ki and q0, which is half the angle between ki and µi. Averaging over all the keys, we obtain:

c ≈ 1/2 * (v_a + v_b),

where v_a = cos(θ_a/2) and v_b = cos(θ_b/2) are the cosine similarities between the query vector q and the key vectors ka and kb, respectively, and θ_a and θ_b are the angles between q0 and ka, and q0 and kb, respectively. Since q0 is orthogonal to µa and µb, we have θ_a = θ_b = π/2, so v_a = v_b = 0. Therefore, c ≈ 0, which is the desired result.

ii. In this case, the query q used in part (i) may not be effective, because the perturbation to the key vector ka is no longer isotropic and may bias the dot product towards or away from the query q. Specifically, the term 1/2(µaµ⊤a) in the covariance matrix of ka causes the key vector to have large variance in the direction of µa, which means that the dot product ki ⊤ q is more likely to be large if ki is aligned with µa than if ki is orthogonal to µa.

To address this issue, we could modify the query q to be proportional to µa + ϵq0, where ϵ is a small random perturbation and q0 is a unit vector that is orthogonal to µa. In other words, the query q is a small perturbation in the direction of µa, but with some randomness added to avoid biasing the dot product too strongly towards µa.

Intuitively, this query works because it still encourages the key vectors to align with µa, but in a more controlled way that is less sensitive to the variance in the direction of µa. Specifically, the dot product ki ⊤ q is now proportional to the cosine of the angle between ki and (µa + ϵq0), which is half the angle between ki and µa + ϵq0. Averaging over all the keys, we obtain:

c ≈ 1/2 * (v_a + v_b),

where v_a = cos(θ_a/2 + ϵ/2) and v_b = cos(θ_b/2 + ϵ/2) are the cosine similarities between the query vector q and the key vectors ka and kb, respectively, with a small randomness added. As before, θ_a and θ_b are the angles between µa + ϵq0 and ka and kb, respectively.

Since q0 is orthogonal to µa, we have θ_b ≈ π/2, so v_b ≈ 0. However, θ_a may be small if ka is aligned with µa, so v_a may be nonzero but small. Therefore, c ≈ 1/2 * v_a, which is a reasonable approximation of the true cosine similarity between the query and the key ka

Consider a set of key vectors k1 kn that are now randomly sampled ki ∼ N µi Σi wherethe means µi ∈ Rd are known to you but the covariances Σi are unknown Further assume thatthe means µi are all pe