Query Mask Generation for Multi-Head Attention in TensorFlow

This code segment creates a mask for the query input to the multi-head attention layer.

The first line calculates the sum of absolute values of the embeddings and applies the sign function, resulting in a binary mask where 0 indicates a padded position.

'query_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1))) # (N, T_q)'

The second line duplicates this mask for each head of the attention layer.

'query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)'

The third line expands the mask to cover all keys and values in the input sequence.

'query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)'

Finally, the output tensor is multiplied by this mask, effectively setting the attention scores to 0 for any padded positions in the input sequence.

'outputs *= query_masks # broadcasting. (N, T_q, C)'

Query Mask Generation for Multi-Head Attention in TensorFlow