Abstract:
To address the weak feature representation and low candidate box accuracy caused by ignoring 3-D structural context in existing methods, this study proposes a 3-D object detection method for point clouds based on the voxel-keypoint feature aggregation network (VKMFANet).
Existing point-cloud 3-D object detection methods have obvious limitations. Point-based methods achieve high accuracy but incur high computational cost and poor real-time performance. Voxel-based methods are efficient but lose 3-D structural context due to feature conversion, which degrades accuracy. Although point-voxel aggregation methods have better accuracy, they still suffer from feature conversion issues and low sampling efficiency. Therefore, VKMFANet adopted a two-stage architecture combining voxel-keypoint encoding and multi-level feature aggregation to enhance detection performance.
In the first stage, a 3-D sparse convolution module extracted point-cloud features and projected them onto a bird’s-eye view, after which a region proposal network generated candidate boxes. The second stage focused on extracting multi-level features from candidate boxes: (a) a 3-D sparse convolution feature extraction module aggregated contextual features of neighboring voxels via voxel queries to preserve 3-D structure. (b) A bird’s-eye-view feature pooling module aligned coordinates with an affine transformation to reduce information loss. (c) An internal point-cloud spatial-structure feature extraction module introduced a multi-layer self-attention keypoint sampling method, employed the FPS algorithm to select keypoints, and encoded spatial relationships via multi-scale spherical queries. (d) A convolutional attention aggregation module fused the above features through channel- and point-level attention to produce the final features used for classification and bounding-box regression.
Experiments were conducted on the KITTI and Waymo datasets, using an Intel Xeon CPU and an NVIDIA RTX 3090 GPU as hardware, and Python 3.9 and PyTorch 1.10.1 as the software environment. The results showed that VKMFANet achieved 93.88% 3-D detection average precision (AP) and 96.27% bird’s-eye view (BEV) detection AP for cyclists at the Easy level on the KITTI dataset, outperforming mainstream methods such as PV-RCNN. On the Waymo dataset at level 1 difficulty, the mean average precisions (MAPs) for cars, pedestrians, and cyclists were 58.1%, 68.67%, and 63.38%, respectively, with advantages maintained at level 2 difficulty. Ablation experiments verified the effectiveness of each feature module, and this method improved processing speed by 16Hz compared to PV-RCNN, balancing accuracy and efficiency.
The study performs excellently in long-range sparse scenes, providing an efficient solution for 3-D object detection in autonomous driving and other fields, with significant practical value.