Active Learning Sample Selection Strategies: A Comprehensive Review

Active learning is a machine learning paradigm that aims to improve model performance by strategically selecting the most informative unlabeled data for labeling. A crucial aspect of active learning is the sample selection strategy, which determines which samples to query for labels. These strategies can be broadly categorized into three types: diversity-based sampling, uncertainty-based sampling, and hybrid sampling.

Diversity-based sampling methods aim to select unlabeled data that best represents the overall data distribution. This ensures that the labeled data provides a comprehensive view of the feature space. Mahmudul et al. proposed a novel technique that not only prioritizes the informativeness of individual samples but also considers their contextual information during selection. Ozan et al. constructed a core-set over latent features to identify a diverse set of samples. Samarth et al. developed a method that learns a latent space using a variational autoencoder (VAE) and an adversarial network to discriminate between labeled and unlabeled data. These strategies often rely on unsupervised methodologies, such as clustering, to achieve their goals.

Uncertainty-based sampling methods focus on selecting samples with high uncertainty, which implies that the model is less confident about their labels. Learning from these uncertain samples can rapidly improve model performance. Anant Raj and Francis Bach proposed an active learning method based on uncertainty sampling that converges to the optimal predictor of a linear model under various sampling schemes for binary and multi-class classification. Ajay et al. introduced an uncertainty measure that generalizes margin-based uncertainty to multi-class classification and is computationally efficient. Shen et al. proposed a heuristic for uncertainty-based active learning in sequence tagging. Wang et al. developed an uncertainty-based active labeling method, AL-DL, while Yoo et al. proposed a 'loss prediction module' to predict the target loss of unlabeled inputs. However, uncertainty-based methods often neglect the intrinsic value of the feature representation due to their reliance on predicted class probabilities.

Hybrid active learning methods combine elements of both diversity and uncertainty in their sample selection strategies. This approach seeks to exploit the strengths of both methodologies. Gui et al. proposed an efficient active learning algorithm for large batch settings by combining uncertainty and diversity. Jordan et al. designed a strategy named BADGE that incorporates both predictive uncertainty and sample diversity into each selected batch. While effective, their approach may not scale well to larger tasks beyond computing limitations.

In conclusion, this review provides a comprehensive overview of active learning sample selection strategies, emphasizing the strengths and limitations of diversity-based, uncertainty-based, and hybrid methods. Further research is needed to develop scalable and efficient active learning algorithms that effectively leverage both diversity and uncertainty to achieve optimal performance in various machine learning tasks.

Active Learning Sample Selection Strategies: A Comprehensive Review