这篇paper的目的:为了解决随机采样带来的弊端,提出局部特征聚合模型(local feature aggregation module)
我们研究了大规模三维点云的有效语义分割问题。通过依赖昂贵的采样技术或计算繁重的预处理/后处理步骤,现在大多数现有的方法只能在小规模点云上进行训练和操作。在本文中,我们介绍了一种高效、轻量级的神经结构RandLA-Net,用于直接推断大规模点云的每点语义。我们的方法的关键是使用随机点抽样而不是更复杂的点选择方法。尽管随机抽样具有很高的计算效率和内存效率,但它也会随机地丢弃一些关键特性(flaw)。为了克服这个问题,我们引入了一种新的局部特征聚合模型来逐步增加每个3D点的接受域,从而有效地保留几何细节。大量的实验表明,我们的RandLA-Net一次可以处理100万个点,比现有方法快200倍。此外,我们的RandLA-Net在语义分割的两个大规模基准上明显超过了最先进的方法Semantic3D和Se- manticKITTI。
Efficient semantic segmentation of large-scale 3D point clouds is a fundamental and essential capability for real-time intelligent systems, such as autonomous driving and augmented reality. A key challenge is that the raw point clouds acquired by depth sensors are typically irregularly sampled, unstructured and unordered. Although deep convolutional networks show excellent performance in structured 2D computer vision tasks, they cannot be directly applied to this type of unstructured data.
Recently, the pioneering work PointNet [37] has emerged as a promising approach for directly processing 3D point clouds. It learns per-point features using shared multi-layer perceptrons (MLPs). This is computationally efficient but fails to capture wider context information for each point. To learn richer local structures, many dedicated neural modules have been subsequently and rapidly introduced. These modules can be generally categorized as:
Although these approaches achieve impressive results for object recognition and semantic segmentation, almost all of them are limited to extremely small 3D point clouds (e.g., 4k points or 1×1 meter blocks) and cannot be directly extended to larger point clouds (e.g., millions of points and up to 200×200 meters).
The reasons for this limitation are three-fold.
A handful of recent works have started to tackle the task of directly processing large-scale point clouds.
In this paper, we aim to design a memory and computationally efficient neural architecture, which is able to directly process large-scale 3D point clouds in a single pass, without requiring any pre/post-processing steps such as voxelization, block partitioning or graph construction. However, this task is extremely challenging as it requires:
To this end, we first systematically demonstrate that random sampling is a key enabler for deep neural networks to efficiently process large-scale point clouds. However, random sampling can discard key semantic information, especially for objects with low point densities. To counter the potentially detrimental impact of random sampling, we propose a new and efficient local feature aggregation module to capture complex local structures over progressively smaller point-sets.
Amongst existing sampling methods, farthest point sampling and inverse density sampling are the most frequently used for small-scale point clouds [38, 54, 29, 64, 15]. As point sampling is such a fundamental step within these networks, we investigate the relative merits of different approaches in Section 3.2, both by examining their computational complexity and empirically by measuring their memory consumption and processing time. From this, we see that the commonly used sampling methods limit scaling towards large point clouds, and act as a significant bottleneck to real-time processing. However, we identify random sampling as by far the most suitable component for large-scale point cloud processing as it is fast and scales efficiently. Random sampling is not without cost, because prominent point features may be dropped by chance and it cannot be used directly in existing networks without incurring a per- formance penalty. To overcome this issue, we design a new local feature aggregation module in Section 3.3, which is capable of effectively learning complex local structures by progressively increasing the receptive field size in each neural layer.
In particular, for each 3D point, we
Note that all these neural components are implemented as shared MLPs, and are therefore remarkably memory and computational efficient.
Overall, being built on the principles of simple random sampling and an effective local feature aggregator, our efficient neural architecture, named RandLA-Net1, not only is up to 200× faster than existing approaches on large-scale point clouds, but also surpasses the state-of-the-art semantic segmentation methods on both Semantic3D [16] and Se- manticKITTI [3] benchmarks. Figure 1 shows qualitative results of our approach. Our key contributions are: