Dynamic Offset Estimator for Deformable Convolution with Non-Local Blocks

To capture the similarity located from near to far distances, the offset should be learned dynamically (i.e., the offset should be able to cover a wide range of area, actively reaching various and distant positions). We design an offset estimator for learning the dynamic offsets, called dynamic offset estimator. Because the offset for deformable convolution should be learned based on the similarity between the reference image and the input low-resolution image, a reference feature and an input feature are concatenated as an input for the dynamic offset estimator, as shown in Fig. 3. We follow the multi-scale philosophy commonly adopted in optical flow estimations [6, 10]. The concatenated input is down-sampled three times such that multiple levels of scales can be considered when predicting offsets. To localize relevant features which can be located at far distances effectively, we exploit non-local blocks in the dynamic offset estimator. The non-local operations capture the global correlation of intra- or inter-features, which helps with the prediction of dynamic offsets with an extremely large receptive field to handle both small and large displacements. We utilize three non-local blocks in the dynamic offset estimator so that the features are amplified with attention in each level of scale. Note that the processing of non-local operations with regard to down-sampled features can be considered as measuring the patch-wise similarity rather than the pixel-wise similarity. Given an input x and an output y, the non-local block operation is defined as follows:xxxxx. Where i is the index of the output position and j is the index of all possible positions. Wy denotes the weight matrix and C(x) is the normalization factor. f(·) and g(·) represent the pair-wise operation and the linear embedding function, respectively. Here, we can consider y as an attention guided feature, which highlights the global correlation between the input feature and the reference feature at the pixel- or patch-level. The function g(xj) can be expressed as Wgxj, which computes the linear embedding of the input signal x at position j. f(xi,xj) calculates the pairwise similarity between xi and xj. In this operation, we expect the similarity to be calculated in a similar manner to an inner product which is commonly used in patch matching. We adopt an embedded Gaussian function [29] for this pairwise operation defined as follows: f(xi,xj) = exp(θ(xi)Tφ(xj)), (3) where θ(·) and φ(·) are two linear embedding functions. For parameter-efficiency, we halve the dimension of the embeddings within the dynamic offset estimator.