Abstract:
The existing infrared small-target detection method based on convolutional neural networks (CNN) exhibits the problem of a limited receptive field in the encoder stage, and the decoder lacks an effective feature interaction when fusing multiscale features. To address the aforementioned issues, in this study, a new method is proposed based on an encoder–decoder structure. Specifically, a vision transformer is used as an encoder to extract multiscale features from small infrared target images. The vision transformer is an emerging deep-learning architecture that uses a self-attention mechanism to capture the global relationship between all pixels in the input image, thereby effectively processing long-range dependencies and contextual information in the image. Furthermore, a dual-decoder module, comprising an interactive decoder and auxiliary decoder, is proposed to improve the ability of the decoder to reconstruct small infrared targets. The dual-decoder module can make full use of the complementary information between different features, promote interaction between deep and shallow features, and better reconstruct small infrared targets by combining the results of the two decoders. Experimental results on widely used public datasets show that the proposed method outperforms other methods in terms of two evaluation indicators:
F1 and mIoU.