Abstract:
Unmanned aerial vehicle aerial image recognition is widely used in areas such as urban planning and road monitoring. However, existing methods have limitations. Traditional manual feature extraction is insufficient. Mainstream convolutional neural network exhibit high missed detection rate for small-sized targets, and pure transformer networks rely on large datasets and have poor real-time performance. Furthermore, aerial datasets are often limited, and small-sized targets are easily misclassified due to few pixels and weak textures.To address the challenge of recognizing small-sized targets in aerial images, an improved algorithm based on YOLOv5 is proposed, which effectively enhances detection accuracy and real-time performance.
The improvement methods focused on three aspects. First, a patch-transformer component was introduced into the backbone network. Drawing on the vision transformer partitioning, the image was divided into 16×16 patches, processed by the transformer encoder to reconstruct the feature maps. Meanwhile, the backbone network layers were reduced to balance computational load and enhance feature extraction capability for small-sized targets. Second, the pan aggregation network neck network (SAM-PAN) was improved by adding a P2 feature layer to reduce the minimum receptive field to 4×4, and a spatial attention module was introduced to strengthen the focus on key features of small-sized targets. Third, a class-FL loss function was proposed, incorporating target size information into the Focus loss to increase attention to misclassification of small-sized targets during training.
The experiment was conducted on the Visdrone2019 dataset (containing 10 categories of targets, more than 260000 video frames and over 10000 static images) under an environment with Intel i9-13900H CPU and RTX 4090D GPU. The results showed that the improved network achieved a mean average precision (MAP) of 0.408, 0.071 higher than YOLOv5s and 0.029 higher than YOLOv5m. The total processing time was 18.9 ms, falling between the two comparison models and meeting real-time engineering requirements. Ablation experiment demonstrated that patch-transformer, SAM-PAN, and class-FL improved MAP by 3.4%, 1.5%, and 0.9% respectively, and their combined use achieved the optimal performance.
In conclusion, the algorithm effectively solves the problem of small-sized target recognition through structural optimization and loss function improvement, offering both accuracy and real-time performance and providing significant engineering reference value for aerial image recognition. Future work will further explore CNN-transformer fusion for performance improvement.