Infrared Weak and Small Target Detection Algorithm Based on Deep Learning
Published Online: Sep 30, 2024
Page range: 47 - 55
DOI: https://doi.org/10.2478/ijanmc-2024-0026
Keywords
© 2024 Lei Wang et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
As an important thermal measurement technology, infrared imaging technology uses infrared detectors to receive infrared thermal radiation in different wavelengths on the surface of the scene and convert it into images. This technique offers a variety of advantages, such as passive imaging, long range, ease of concealment, and ability to work day and night. This makes infrared imaging widely used in military, security, medical, industrial testing and other fields. However, there are also some challenges faced by infrared imaging devices in practical applications. First of all, because the imaging mechanism of infrared images is different from that of visible light, the contrast between the target object and the background is usually low, which makes it difficult to identify and detect the target. Secondly, when the background interference is strong and the target signal is weak, the signal-to-noise ratio of the infrared image is usually low, resulting in the target image often showing a small target with incomplete structure [1].
In addition, the problems of noise, scattering, and radiation inhomogeneity that are prevalent in infrared imaging further increase the difficulty of effectively detecting small targets in infrared images. With the development of computer vision technology, how to accurately and quickly detect and identify small targets in complex backgrounds has become one of the hot and difficult problems in research. The purpose of this paper is to explore and study the methods and technologies to improve the detection performance of small objects in infrared imaging, in order to provide new ideas and solutions for this field.
In the early stages of micro-object detection, traditional algorithms were mainly based on filters and wavelet transforms to achieve single-frame and multi-frame detection [2]. Some commonly used techniques include median filtering, high-pass filtering, wavelet transform, and threshold segmentation. Yuan Shuai et al. [3] proposed a method to separate the target from the background by comparing the difference between the target area and the inner and outer double-layer neighborhoods, so as to enhance the local contrast of bright and weak small targets and effectively suppress complex background noise. Liu Delian et al. [4] proposed the concept of stagnation point connection, which was used as a benchmark to calculate the difference between the gray scale of each pixel and the datum to determine the target position. The improvement of these single-frame and multi-frame algorithms has the advantages of relatively simple computation and low complexity, but the detection performance is limited when the target contrast feature is not obvious in complex and changeable real scenes.
With the development of deep learning, effective and non-traditional solutions have been introduced to solve complex problems in the field of computer vision. At present, deep learning networks are mainly divided into two types in object detection: two-stage object detection algorithms (represented by R-CNN series) and single-stage object detection algorithms (represented by SSD and YOLO series). Both algorithms rely on deep convolutional neural networks (CNNs) to learn high-level features of images, capture semantic information in images, and use multi-scale strategies to detect targets at different resolutions, thereby improving detection performance [5]. Li Mukai et al. [6] introduced SEblock based on the idea of calibrating features according to weights in SENet, and improved the accuracy of YOLOv7 for small target detection to 83.97%. In view of the problems existing in the infrared image itself, Jiang Zhixin et al. [7] chose to combine the histogram equalization with the MSR shown in the image preprocessing to enhance the image, and at the same time, the loss function of the Faster-CNN network was improved, and the mAP was improved by 6.11% compared with the original network. However, these deep learning algorithms based on prior boxes still have certain difficulties in processing small targets in images, especially for small targets that are scattered and do not overlap or are occluded. The introduction of attention mechanism to improve the backbone network alleviates this problem to a certain extent, but there are still limitations of transformer decoder in feature representation. Therefore, the DETR network based on transformer architecture can effectively improve the detection ability of small targets in complex backgrounds through global modeling capabilities and encoded location information.
RT-DETR is the first real-time object detector in the DETR series and consists of four models. Different networks use different backbones, among which rtdetr-l uses HGNetv2-l as the backbone network, which has the best performance under the same conditions with fewer parameters and computational costs. The network structure is shown in Figure 1. The RT-DETR-L network structure consists of three parts: Backbone, Neck, and Decoder. The backbone network HGNetv2 consists of four HG Stage modules, each of which is mainly composed of HG Blocks. The backbone network extracts feature maps of three scales at different levels as the input of the hybrid encoder. The hybrid encoder is composed of two modules, the AIFI encoder and the CCFM feature fusion, in which the AIFI is still a multi-head transformer in essence, and only the deepest S5 feature layer is processed, and the F5 feature layer is finally output, which is used as the input of the CCFM module together with the S3 and S4 features, and the upper feature fusion and the lower feature fusion are carried out twice, respectively. For the fusion result, the anchor frame is selected through IOU-ware Query Selection, and the top-k300 is finally selected as the input into the decoder for prediction output.

RT-DETR network structure
In view of the problems of low contrast, high noise and unclear details in infrared images, the detection performance of the model can be improved by selecting an appropriate algorithm for image enhancement before the model is processed.
Dynamic Detail Enhancement (DDE) [8]and MSR algorithm are two commonly used image enhancement techniques in the preprocessing of infrared images. DDE separates the base layer and the detail layer of the image through the filtering algorithm, and enhances the detail layer to achieve significant enhancement of the details. However, the brightness and contrast of the base layer are limited, and it is easy to amplify the noise while amplifying the details. The multi-scale MSR algorithm improves the brightness and global contrast of the image through digital transformation and smoothing. However, the processing effect at the level of detail is limited, in complex scenes, the improvement of local contrast is not obvious. Therefore, the combination of DDE and MSR algorithm can dynamically adjust and enhance the detail and clarity of the image according to the complexity of the image content.
The detailed design process of the algorithm is as follows.
Bilateral filtering is used to decompose the original image into basic component IB, and the output is:
where the
The detail component is the result of subtracting the base component from the original image, The formula is shown in the following formula.
The basic component contains the large-scale structure and lighting information of the image, and the local details are enhanced by MSR for the basic component
Finally, the detail component is recombined with the enhanced base component to obtain the final infrared image.
In order to solve the problem that small and medium-sized targets are difficult to detect in infrared image object detection, the original rt-detr-l uses HGNetv2, which is essentially still a CNN network structure, and uses HGBlock to extract features through deep convolution operations. Therefore, an improved RT-DETR network model is designed, and the EMA attention mechanism module is added to the backbone network of RT-DETR, so that it can extract multiscale feature information that is rich in global context information and differentiated features, especially small target feature information.
Figure 2 shows how to add an EMA module.

EMA module
At the same time, in order to suppress the interference of background and noise in the infrared image, the attention of small targets is improved. The introduction of CAMixing convolution-attention module enables CCFM to suppress the interference of irrelevant information and improve the denoising performance when multi-scale feature fusion, which is conducive to enhancing the modeling of global and local features and improving the detection rate of small targets.
Add the CAMixing module to the network, as shown in Figure 3.

CAMixing module
In RT-DETR, the encoder is only used in S5, and comparative experiments are used to verify that it not only helps to significantly reduce the computational effort and improve the computational speed, but also does not cause significant damage to the performance of the model [9]. Therefore, the addition of CAMixing to the two-way downward feature fusion with S5 helps to improve the detection accuracy and reduce the amount of computation.
The original IOU-ware Query Selection uses GIoU to optimize the query selection process of the prediction box, and introduces an external rectangle to reduce the loss between the real box and the prediction box of the large target. However, for small target detection, a slight shift in the target may cause a significant change in its position information. GIoU does not directly consider the aspect ratio difference between the prediction box and the real box, and cannot fully capture the subtle changes in position information. Therefore, Shape-IoU is used instead of GIOU for small targets, and the loss is calculated by paying attention to the shape and scale of the bounding box itself, so as to optimize the detection accuracy of small targets. The regression loss calculated by Shape-IoU is shown in the following formula.
Among them, IOU is the intersection and union ratio, S is the zoom factor, and its value is related to the size and number of targets in the dataset. and,
Where
Ω
Where w,h are the width and height of the prior frame, respectively,
In the case of an infrared small target image, it is mainly composed of the background, and only a small part is occupied by the target. It is easier to learn the features of the background than the features of the target during training. Therefore, the ATFL loss function is used to replace the original VFL classification loss, which decouples the target from the background, and uses the adaptive mechanism to adjust the loss weight, forcing the model to allocate more attention to the small target features. The ATFL expression is as follows:
Where
Table I shows the experimental environment in this paper, which is based on the Ubuntu 18.04 operating system, the graphics card model is RTX2080Ti, and the memory is 16GB. The experiment basically uses the parameters recommended by RT-DETR, builds the model based on Python3 and Pytorch framework, and uses the standard SGD optimizer, with batch-size set to 8 and epochs set to 100.
Experimental environment | Version |
---|---|
CPU | IntelCorei7-11800H |
GPU | NVIDIA GeForce RTX2080 Ti |
Language | Python3.8 |
Deep Learning Framework | Pytorch1.14.0 |
CUDA | 11.8.0 |
The infrared aircraft small target dataset [11] used in this experiment includes a total of 22 annotated data folders, and the image content is mainly based on the ground background, sky background, multiple aircraft, aircraft distance, aircraft approaching, etc. A total of 12177 infrared images were selected, with an image resolution of 256*256, a channel count of 1, and a bit depth of 24. This dataset is widely used in tasks such as object detection and target tracking.
In this experiment, Precision (P), Recall (R) and Average Precision (AP) are mainly used as network evaluation indicators, and their mathematical expressions are as follows:
Where TP stands for True Positives, FP stands for False Positives, and Precision measures the proportion of instances that the model predicts to be positive. Used to evaluate the accuracy of instances predicted to be positive samples.
Recall, also known as recall, is the proportion of the sample predicted as positive to the predicted sample, and its mathematical expression is:
The mathematical expression for the average precision (AP) is:
AP is used to evaluate the Precision-Recall curves of the model at different thresholds, and is the average value obtained by integrating the Precision-Recall curves. It measures the average accuracy of the model’s predictions at different thresholds.
In addition, in order to evaluate the processing effect of the image preprocessing algorithm, PSNR and SSIM similarity were used to distinguish the similarity of the images before and after processing, the information entropy (Entropy) was used to reflect the retention degree of image information, and the average gradient (AG) and edge intensity (EME) were used to evaluate the clarity of the image.
The objective evaluation method is used to evaluate the quality of the improved images, and the effectiveness of the improvement is verified compared with other algorithms.
The final enhancement is shown in Figure 4.

Image enhancement effect
Table II lists the verification results.
In this paper, the Shape-IoU loss function is introduced into the network model [12]. which has a scale parameter whose size can be determined by optimizing the parameters, and the s parameter is finally set to 0.4 by comparing different parameters on the dataset. Comparative experiments are shown in Table III.
C
index algorithm | PSNR | SSIM | Entropy | AG | EME |
---|---|---|---|---|---|
Original image | 6.2850 | 41.9135 | 2.6388 | ||
SSR | 28.2970 | 0.85522 | 5.7150 | 44.2675 | 2.8967 |
MSR | 28.7772 | 0.8676 | 6.2176 | 44.9135 | 2.8932 |
DDE | 36.0989 | 0.9679 | 6.4594 | 44.2206 | 2.6857 |
Bilateral filtering | 34.6621 | 0.8395 | 6.3155 | 21.7407 | 1.4891 |
DDE+MSR | 28.7581 | 0.8436 | 6.2793 | 46.8969 | 2.8882 |
C
s | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 1.0 |
---|---|---|---|---|---|---|---|
mAP(%) | 83.5 | 83.4 | 84.9 | 84.8 | 83.5 | 83.4 |
In order to verify the effectiveness of each improvement point of the network, based on the RT-DETR-L network, six sets of comparative experiments were carried out on the dataset. and the environment and parameter settings were uniform. The experimental results are shown in Table IV, and "√' indicates that the corresponding method was used. It can be seen from Table 4 that after adding the EMA and CAMixing modules and improving the loss function, the algorithm achieves the best detection accuracy, and the AP value is 3.2% higher than the original RT-DETR, which proves the effectiveness of the improved module in this paper. Figure 5 shows the AP variation curve of the improved method more intuitively. The detection fluctuates greatly due to the influence of the dataset, but the improved network detection accuracy is significantly stable and the accuracy increases. Figure 6 shows the detection results of the original network and the improved network, and the prediction score increases significantly, which verifies the effectiveness of the improved method and proves that the improved method is better for the detection of dense targets.

AP change curve

Improved detection results
C
EMA | CMAixing | Shape-IoU | ATFL | P/% | R/% | AP/% | Param/106 |
---|---|---|---|---|---|---|---|
73.2 | 75.2 | 84.6 | 32.81 | ||||
√ | 72.2 | 76.3 | 85.5 | 33.40 | |||
√ | 74.8 | 75.2 | 85.6 | 34.97 | |||
√ | √ | 74.1 | 75.0 | 86.2 | 35.23 | ||
√ | √ | 75.5 | 76.2 | 85.9 | 32.81 | ||
√ | √ | √ | √ | 75.4 | 77.1 | 87.8 | 35.23 |
In order to solve the challenge of target detection in infrared imaging scenes, this paper effectively enhances the contrast and detail visibility of infrared images by combining DDE and MSR algorithms, improves the RT-DETR network structure, introduces EMA attention mechanism and CAMixing convolutional attention module, and significantly improves the model’s detection ability and overall detection accuracy of small targets. At the same time, the Shape-IoU and ATFL loss functions are combined to improve the regression ability of small targets under infrared conditions. Experimental results show that the improved algorithm is better than the original DETR algorithm in detecting infrared weak and small targets in complex backgrounds, and the mean average accuracy (mAP) is increased by 3.2%, showing good robustness and adaptability. However, considering the richer training data, especially the infrared images containing different weather, time and terrain conditions, it is necessary to further process the contrast between the target and the background to enhance the generalization ability of the model. In addition, real-time performance is very important in practical applications, and the detection speed can be improved by optimizing the model structure and algorithm, making it better suitable for realtime infrared object detection scenarios.