Spatiotemporal Fusion for High-Resolution Remote Sensing: A Review of Methods and Applications

In areas such as dynamic monitoring, change detection, and land-cover classification, high spatial resolution remote sensing images with a dense time series are needed to capture detailed land surface dynamics [1]–[10]. However, due to technological limitations and budget constraints, there is often a tradeoff between the temporal and spatial resolutions of remote sensing images [11]–[14]. In recent years, although great breakthroughs have been made in Earth observation through the availability of remote sensing images with high spatial and temporal resolutions from multiplatform satellites, such as Sentinel-2 and the China High-resolution Earth Observation System (CHEOS), the current availability of remote sensing data is still insufficient in practical applications because of the cloud cover and other disturbances [15], [16]. The insufficient remote sensing data cannot meet the requirements of long-term and detailed land surface dynamics studies, which require dense historical time-series remote sensing images with a high spatial resolution [17]–[20]. Spatiotemporal fusion is a feasible and cost-effective way to promote the applications of the current Earth observation data. Spatiotemporal fusion aims to integrate multisource satellite images to obtain images with both high spatial and high temporal resolutions. For example, MODIS data are characterized by a high temporal resolution and low spatial resolution (LSHT), whereas Landsat Enhanced Thematic Mapper Plus (ETM+) data are characterized by a high spatial resolution and low temporal resolution (HSLT) [21], [22]. Based on one or two Landsat-MODIS image pairs on prior dates and one MODIS image on the prediction date, spatiotemporal fusion models can combine the spatial resolution of Landsat imagery with the temporal frequency of MODIS imagery to generate a Landsat-like image on the prediction date.

In recent years, many spatiotemporal fusion methods have been proposed in an attempt to aggregate remote sensing data from various sensors at different spatial and temporal resolutions. Generally speaking, the current spatiotemporal fusion methods can be classified into three categories: 1) weight function-based methods; 2) unmixing-based methods; and 3) learning-based methods [23], [24]. In the weight function-based methods, the fine pixel values are estimated through combining the information of all the input images by weight functions [24]. Among the weight function-based algorithms, the most representative examples are the spatial and temporal adaptive reflectance fusion model (STARFM) [25] and its enhanced version (ESTARFM) [26]. The classic STARFM builds a simple approximate relationship between the HSLT and LSHT pixels and searches similar neighboring pixels according to the spectral difference, the temporal difference, and the location distance. Considering the existence of complex heterogeneous areas, Zhu et al. [26] proposed ESTARFM by assigning different conversion coefficients for homogeneous and heterogeneous areas to modify the weights of the neighboring pixels. However, both algorithms are built under the assumption that the proportion of each land-cover type does not change during the observation period, which does not consider the human activities on the Earth’s surface, such as disturbance events (e.g., forest fires) and changes in urban land use. To deal with this problem, Hilker et al. [27] proposed a spatial and temporal adaptive algorithm for mapping reflectance change (STAARCH) for sudden disturbance event mapping. Furthermore, Zhao et al. [28] developed a robust adaptive spatial and temporal image fusion model (RASTFM) for complex land surface changes. The disadvantage of the weight function-based methods is that adopting neighboring pixels may introduce blur into the predicted image, which, in turn, incurs the loss of high-frequency details.

Based on the linear spectral mixing theory, the unmixing-based methods unmix the coarse pixels to estimate the value of the fine pixels. Zhukov et al. [29] first proposed the multisensor multiresolution technique (MMT) to integrate remote sensing images with different spatial resolutions and acquired at different times. However, MMT was confronted with two problems: 1) the large errors caused by spectral unmixing and 2) the lack of endmember spectral variability. In order to address these issues, Zurita-Milla et al. [30] proposed the spatial–temporal data fusion approach (STDFA), which obtains the prediction by considering the reflectance change estimated through unmixing the endmember reflectance in a moving window. STDFA has also been enhanced by the use of an adaptive moving window size [31]. However, the unmixing-based methods still have difficulty in land-cover-type change prediction, due to the lack of high spatial resolution land-use databases, which limits their application.

With the development of machine learning, learning-based methods have been proposed in recent years. Huang and Song [32] proposed the sparse representation-based spatiotemporal reflectance fusion model (SPSTFM), which was the first method to bring dictionary-pair learning techniques from natural image super-resolution to spatiotemporal data fusion. Following SPSTFM, to deal with the single prior Landsat-MODIS image pair case, Song and Huang [33] developed another sparse representation-based spatiotemporal satellite image fusion (SSIF) model through one-pair image learning. The sparse representation-based methods aim to extract the mapping relationships between the HSLT Landsat and LSHT MODIS images via learning a dictionary pair. They then predict the fusion image by weighting predictions from the two end dates of the observation period. However, the sparse representation-based methods need to relearn the dictionary for the images of different research areas, which is inefficient. Compared with dictionary learning, deep learning has a better generalization ability over diverse remote sensing scenes. Song et al. [34] proposed a spatiotemporal image fusion method using a deep convolutional neural network (STFDCNN). The convolutional neural network (CNN) is adopted to model the relationship between the coarse-resolution (CR) image and fine-resolution (FR) image, and a high-pass fusion model is used for the prediction. Liu et al. [35] improved STFDCNN by taking the temporal dependence and temporal consistency into consideration and proposed a two-stream CNN for spatiotemporal image fusion (StfNet). However, there are still several limitations to these two CNN-based methods. First, STFDCNN and StfNet are not end-to-end learning models. The prediction stage is divided into two parts—CNN-based mapping and reconstruction— which increases the complexity of the algorithms. Second, each band needs to be trained separately, which increases the amount of parameters, memory usage, and training time.

Spatiotemporal Fusion for High-Resolution Remote Sensing: A Review of Methods and Applications