Benchmarking Vision Transformer (ViT) Models for BigEarthNet Land Cover Classification

Abstract:

Remote sensing data plays a crucial role in diverse fields like agriculture, urban planning, and environmental monitoring. The publicly available BigEarthNet dataset offers multi-spectral images for land cover classification, fueling the development of advanced image analysis techniques. While Convolutional Neural Networks (CNNs) have yielded impressive results for BigEarthNet classification, the emergence of Vision Transformer (ViT) models presents new possibilities for image classification.

This paper delves into the application of ViT models for BigEarthNet classification, comparing their performance with established benchmarks. The study investigates whether ViT models are currently being employed for BigEarthNet and evaluates their effectiveness. The selected ViT models for this analysis include ViT-B/16, ViT-L/16, and DeiT-B/16.

The findings reveal that ViT models surpass baseline results in terms of accuracy and F1-score. Among the three tested ViT models, DeiT-B/16 demonstrates superior performance, achieving an accuracy of 89.76% and an F1-score of 0.89.

This study conclusively demonstrates the efficacy of ViT models for BigEarthNet classification, with the DeiT-B/16 model emerging as the most effective. This research provides valuable insights for remote sensing experts and researchers seeking to harness the potential of ViT models in BigEarthNet classification tasks.

Benchmarking Vision Transformer (ViT) Models for BigEarthNet Land Cover Classification