高级检索

基于分层超点自注意力模型的自动驾驶3维语义分割研究

3-D semantic segmentation for autonomous driving based on a hierarchical superpoint self-attention model

  • 摘要: 为了解决自动驾驶车辆语义分割时的物体边界划分模糊和全局语义混淆的问题,提出了一种基于分层超点自注意力模型的层级融合点云语义分割方法。其中边界轮廓增强模块通过多尺度卷积增强边界特征,双重注意力机制通过局部与全局建模超点特征的坐标差异和长依赖关系、层级融合机制融合邻近层超点细粒度特征和拓扑关系;在S3DIS、KITTI-360和DALES 3个大规模点云数据集进行了大量实验和可视化对比。结果表明,采用该方法在3个数据集上分别获得了67.6%、61.7%和79.2%的高平均交并比,推理时间为1.8 s。该方法可显著提升点云分割精度并且具有良好的鲁棒性和推理效率,为车辆在多个驾驶场景下提供准确的语义信息,为后续高精地图定位等工作提供了重要的语义支撑。

     

    Abstract:
    The reliability of autonomous driving systems fundamentally depends on precise environmental perception, where advanced algorithmic processing of sensor data generates comprehensive and accurate environmental information to ensure robust support for downstream path planning, decision-making, and control modules, thereby guaranteeing operational safety. Current perception systems predominantly employ vision- and LiDAR-based approaches, where camera-based methods are inherently constrained to 2-D interpretation and are highly sensitive to light intensity, while LiDAR's 3-D point cloud data effectively circumvents these limitations through inherent illumination-invariant spatial representation. Point clouds encapsulate rich geometric, spatial, and radiometric information, enabling robust large-scale environmental perception; however, existing LiDAR-based segmentation methods fail to simultaneously achieve precise local boundary delineation and globally consistent semantic segmentation in autonomous driving scenarios. To address the challenges of domain adaptation in autonomous driving scenarios and resolve the dual issues of ambiguous object boundaries and global semantic confusion in point cloud segmentation, this work proposes a hierarchical superpoint self-attention-based method with multi-level feature fusion capabilities.A hierarchical superpoint mechanism-based multi-scale fusion framework with self-attention for point cloud semantic segmentation. The hierarchical superpoint partitioning process was implemented within a U-shaped encoder-decoder architecture that interactively combined: (i) a dual partial attention(DPA) module that modeled both local-global coordinate variations and long-range dependencies of superpoint features, (ii) a boundary profile enhancement (BPE) module that utilized multi-scale convolutions to refine edge features, and (iii) a unique hierarchical feature fusion (HFF) module that integrated fine-grained superpoint characteristics and topological relationships across adjacent layers. The hierarchical partitioning process progressively merged adjacent superpoints with similar features into larger units, leading to a systematic reduction in superpoint quantity while simultaneously increasing intra-superpoint semantic purity at each level. This bottom-up aggregation propagated coherent feature representations through the network hierarchy. In the final stage, a classifier transformed the refined superpoint features into semantic labels, generating the final segmentation output.
    Visual comparisons in Figures 7, Fig.8, and Fig.9 respectively between the proposed Partition Demarcation Lift-Superpoint Transformer (PDL-SPT) and baseline Superpoint Transformer (SPT) on S3DIS, KITTI-360, and DALES datasets demonstrated PDL-SPT's superior performance in both boundary delineation and large-scale semantic segmentation tasks. Tables 1-3 presented the detailed performance improvements of PDL-SPT across different categories, achieving high mean intersection over union (mIoU) scores of 67.6% on S3DIS, 61.7% on KITTI-360, and 79.2% on DALES, respectively, while Table 4 showed a 0.2s inference time reduction versus SPT. The proposed PDL-SPT method demonstrated significant performance gains across 11 of the 13 categories in the indoor S3DIS dataset for autonomous driving point cloud segmentation, with notable segmentation accuracy improvements of 8.9% for columns, 2.2% for sofas, and 3.4% for walls, particularly enhancing the recognition of critical structural elements like load-bearing pillars, crash barriers, and wall and other markers. The KITTI-360 dataset is designed for autonomous driving point cloud segmentation in complex urban road scenarios.
    The PDL-SPT method achieved accuracy improvements in 8 of the 15 categories in the urban-focused KITTI-360 dataset, with segmentation gains of 1.2% for fences, 8.8% for traffic lights, 1.4% for traffic signs, 1.5% for pedestrians, and 8.2% for motorcycles. KITTI-360's visualization samples cover common complex urban scenarios including straight streets, street corners, and intersections, where dense environments particularly challenge local boundary delineation. Experimental results confirmed PDL-SPT's effectiveness in handling segmentation tasks under such complex traffic conditions. The DALES dataset, focusing on rural and suburban autonomous driving point cloud segmentation, featured sparse scenes that particularly challenged large-scale semantic segmentation capabilities. The PDL-SPT method demonstrated accuracy improvements across all 8 categories in the rural or suburban-focused DALES dataset, with segmentation gains of 4.2% for trucks, 4.7% for utility poles, and 0.7% for power lines. The sparse environments in DALES particularly challenge large-scale semantic segmentation capabilities, as rural roads and suburbs typically contain numerous utility poles and power lines. Precise segmentation of these objects enables reliable detection by autonomous vehicles, effectively preventing potential collisions. In summary, PDL-SPT demonstrated superior performance in both boundary delineation and large-scale semantic segmentation for autonomous driving scenarios.The hierarchical superpoint mechanism-based multi-scale fusion architecture with self-attention effectively addresses boundary ambiguity and global semantic confusion in point cloud segmentation across indoor, urban, rural, and suburban driving scenarios. Experimental results demonstrate that PDL-SPT significantly enhances segmentation accuracy for key autonomous driving object categories, including indoor columns and walls, fences and traffic signs in urban scenarios, dynamic moving objects, utility facilities and large vehicles in rural scenarios. Meanwhile, the model's inference time is reduced, providing vehicles with more reaction time in complex and dynamic traffic environments.

     

/

返回文章
返回