一种基于YOLOv5的改进航拍图像识别算法

张明明; 郑光迪; 万鸣; 冷智辉; 王敏; 戴清泉; 钟艳平; 李凯

doi:10.7510/jgjs.issn.1001-3806.2026.02.019

摘要: 为了解决航拍图像识别小尺寸目标的问题，基于你只需看一次（YOLOv5）理论研究了一种改进型网络算法。首先对骨干网络引入补丁式变换器模块提升网络对小尺寸目标特征提取能力；通过在颈部网络扩展更小感受野的特征层，嵌入空间注意力模块，提升了网络对小尺寸目标特征提取的有效性；最后通过向损失函数分类损失项中引入目标尺寸信息，增加训练过程小尺寸目标分类错误的关注度，并通过无人机航拍数据集进行了实验验证。结果表明，相较于YOLOv5s和YOLOv5m网络，改进型网络的平均精度均值分别提高了7.1%和2.9%，可满足工程实时性要求。该研究对航拍图像识别具有一定的参考价值。

Abstract:

Unmanned aerial vehicle aerial image recognition is widely used in areas such as urban planning and road monitoring. However, existing methods have limitations. Traditional manual feature extraction is insufficient. Mainstream convolutional neural network exhibit high missed detection rate for small-sized targets, and pure transformer networks rely on large datasets and have poor real-time performance. Furthermore, aerial datasets are often limited, and small-sized targets are easily misclassified due to few pixels and weak textures.To address the challenge of recognizing small-sized targets in aerial images, an improved algorithm based on YOLOv5 is proposed, which effectively enhances detection accuracy and real-time performance.

The improvement methods focused on three aspects. First, a patch-transformer component was introduced into the backbone network. Drawing on the vision transformer partitioning, the image was divided into 16×16 patches, processed by the transformer encoder to reconstruct the feature maps. Meanwhile, the backbone network layers were reduced to balance computational load and enhance feature extraction capability for small-sized targets. Second, the pan aggregation network neck network (SAM-PAN) was improved by adding a P₂ feature layer to reduce the minimum receptive field to 4×4, and a spatial attention module was introduced to strengthen the focus on key features of small-sized targets. Third, a class-FL loss function was proposed, incorporating target size information into the Focus loss to increase attention to misclassification of small-sized targets during training.

The experiment was conducted on the Visdrone2019 dataset (containing 10 categories of targets, more than 260000 video frames and over 10000 static images) under an environment with Intel i9-13900H CPU and RTX 4090D GPU. The results showed that the improved network achieved a mean average precision (MAP) of 0.408, 0.071 higher than YOLOv5s and 0.029 higher than YOLOv5m. The total processing time was 18.9 ms, falling between the two comparison models and meeting real-time engineering requirements. Ablation experiment demonstrated that patch-transformer, SAM-PAN, and class-FL improved MAP by 3.4%, 1.5%, and 0.9% respectively, and their combined use achieved the optimal performance.

In conclusion, the algorithm effectively solves the problem of small-sized target recognition through structural optimization and loss function improvement, offering both accuracy and real-time performance and providing significant engineering reference value for aerial image recognition. Future work will further explore CNN-transformer fusion for performance improvement.

一种基于YOLOv5的改进航拍图像识别算法

An improved aerial image recognition algorithm based on Yolov5

友情链接：