Abstract:
The fusion of infrared and visible light images can generate images containing more information in line with human visual perception compared with the original images, and is also beneficial for downstream tasks. Traditional image fusion methods based on signal processing have problems such as poor generalization ability and reduced performance of complex image fusion. Deep learning is capable of features extraction and provides good results. However, its results have problems such as reduced preservation of textural details and blurred images. To address these problems, this study proposes a fusion network model of infrared and visible light images based on the multiscale Swin Transformer and an attention mechanism. Swin Transformers can extract long-distance semantic information from a multiscale perspective, and the attention mechanism can weaken the insignificant features in the proposed features to retain the main information. In addition, this study proposes a new hybrid fusion strategy and designs brightness enhancement and detail retention modules according to the respective characteristics of the infrared and visible images to retain more textural details and infrared target information. The fusion method has three parts: the encoder, fusion strategy, and decoder. First, the source image was input into the encoder to extract multiscale depth features. Then, a fusion strategy was designed to fuse the depth features of each scale. Finally, the fused image was reconstructed using a decoder based on nested connections. The experimental results on public datasets show that the proposed method has a better fusion performance compared with other state-of-the-art methods. Among the objective evaluation indicators, EI, AG, QP, EN, and SD were optimal. From a subjective perspective, the proposed infrared and visible light image fusion method can preserve additional edge details in the results.