当前位置: 首页 > 工具软件 > Ronin > 使用案例 >

定位系列论文阅读-RoNIN(二)-Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations

姜乐语
2023-12-01

0.Abstract

0.1逐句翻译

This paper sets a new foundation for data-driven inertial navigation research, where the task is the estimation of positions and orientations of a moving subject from a sequence of IMU sensor measurements.
本文的研究为数据驱动惯性导航的研究奠定了新的基础,其中的任务是从一系列IMU传感器测量中估计运动对象的位置和方向。
(大约就是使用惯性传感器推算当前的位置)

More concretely, the paper presents
更具体地说,本文提出

  1. a new benchmark containing more than 40 hours of IMU sensor data from 100 human subjects with ground-truth 3D trajectories under natural human motions;
    一个新的基准包含超过40小时的IMU传感器数据,来自100名人体受试者在自然人体运动下的地面真实3D轨迹
    (文章准备了一个新的数据集)

  2. novel neural inertial navigation architectures, making significant improvements for challenging motion cases;
    新颖的神经惯性导航体系结构,对具有挑战性的运动情况进行了显著改进;

and 3) qualitative and quantitative evaluations of the competing methods over three inertial navigation benchmarks. We will share the code and data to promote further research.1
3)对三种惯性导航基准的竞争方法进行定性和定量评估。我们将分享代码和数据,以促进进一步的研究

0.2总结

  • 1.本文准备了数据集和开源代码,便于大家进一步研究。
  • 2.本文提出了新的惯性导航结构(应该指的是使用深度学习,之前大家都没有使用深度学习来解决这个问题。)

1. Introduction

1.1逐句翻译

第一段(就是说惯性传感器十分重要有研究的必要)

An inertial measurement unit (IMU), often a combination of accelerometers, gyroscopes, and magnetometers, plays an important role in a wide range of navigation applications.
惯性测量单元(IMU)通常由加速度计、陀螺仪和磁力计组成,在广泛的导航应用中发挥着重要作用。

In Virtual Reality, IMU sensor fusion produces real-time orientations of head-mounted displays.
在虚拟现实中,IMU传感器融合产生的头戴式显示器的实时方向。
(大约就是vr当中也需要使用这个东西)

In Augmented Reality applications (e.g., Apple ARKit [1], Google ARCore [7], or Microsoft HoloLens[16]), IMU augments SLAM [17, 14, 6] by resolving scale ambiguities and providing motion cues in the absence of visual features.
在增强现实应用中(例如,Apple ARKit[1],谷歌ARCore[7],或Microsoft HoloLens[16]), IMU通过解决尺度模糊和在没有视觉特征的情况下提供运动线索来增强SLAM[17,14,6]。
(惯性传感器对各种都是有很多帮助的,其他的我不太懂,但是在slam当中惯性传感器一般是作为一种优化项,只是在一定程度上做一下修正,还是一般依靠图像几何。多源融合系统当中,如果只剩下惯性传感器work的话,系统就默认当前已经宕机了)

UAVs, autonomous cars, humanoid robots, and smart vacuum cleaners are other emerging domains, utilizing IMUs for enhanced navigation, control, and beyond.
无人机、自动驾驶汽车、类人机器人和智能吸尘器是其他新兴领域,它们利用imu增强导航、控制等功能。

第二段(惯性导航是非常理想的一个导航方式,具体介绍了几个优点)

Inertial navigation is the ultimate form of IMU-based navigation, whose task is to estimate positions and orientations of a moving subject only from a sequence of IMU sensor measurements (See Fig. 1).
惯性导航是基于IMU的导航的最终形式,其任务是仅通过IMU传感器的一系列测量来估计运动对象的位置和方向(见图1)。

Inertial navigation has been a dream technology for academic researchers and industrial engineers, as IMUs
惯性导航一直是学术研究人员和工业工程师梦寐以求的技术,比如imu

  1. are energy-efficient, capable of running 24 hours a day;
    1)节能,可24小时运行;

  2. work anywhere even inside pockets; and
    2)在任何地方工作,甚至在口袋里;和

  3. are in every smartphone, which everyone carries everyday all the time.
    3)在每个人每天都随身携带的智能手机中。

第三段(介绍当前传感器需要受到诸多限制的现状)

Most existing inertial navigation algorithms require unrealistic constraints that are incompatible with everyday smartphone usage scenarios.
大多数现有的惯性导航算法需要不现实的约束,与日常智能手机使用场景不兼容。

For example, an IMU must be attached to a foot to enable the zero speed update heuristic (i.e., a device speed becomes 0 every time a foot touches the ground) [11].
例如,IMU必须附加到脚上,以启用零速度更新启发式(即,每次脚接触地面时,设备速度变为0)[11]。

Step counting methods assume that the IMU is rigidly attached to a body and a subject must walk forward so that the motion direction becomes a constant in device coordinate frame [3]
步长计算方法假设IMU刚性附着在物体上,物体必须向前行走,使运动方向在设备坐标系[3]中为常数

第四段(介绍前人为了减弱这些限制做的努力)

Data-driven approaches have recently made a breakthrough in loosing these constraints [22, 5] where the acquisition of IMU sensor data and ground-truth motion trajectories allows supervised learning of direct motion parameters (e.g., a velocity vector from IMU sensor history).
数据驱动方法最近在放宽这些限制方面取得了突破[22,5],IMU传感器数据和地面真实运动轨迹的获取允许监督学习直接运动参数(例如,从IMU传感器历史中获得速度矢量)。

第五段 (介绍本文贡献)

This paper seeks to take data-driven inertial navigation research to the next level via the following three contributions.
本文试图通过以下三个方面的贡献,将数据驱动惯性导航研究推向一个新的水平。

• The paper provides the largest inertial navigation database consisting of more than 42.7 hours of IMU and ground-truth 3D motion data from 100 human subjects. Our data acquisition protocol allows users to handle smartphones naturally as in real day-to-day activities.
本文提供了最大的惯性导航数据库,包含了超过42.7小时的IMU,以及来自100名受试者的地真三维运动数据。我们的数据采集协议允许用户像在现实生活中一样自然地处理智能手机。

• The paper presents novel neural architectures for inertial navigation, making significant improvements for challenging motion cases over the existing best method.
本文提出了一种新颖的惯性导航神经体系结构,在具有挑战性的运动情况下对现有的最佳方法进行了显著改进。

• The paper presents extensive qualitative and quantitative evaluations of existing baselines and state-of-the-art methods on the three benchmarks.
本文对现有的基线和三个基准的最新方法进行了广泛的定性和定量评价。

第六段(作者将开源并共享数据集)

We will share the code and data to promote further research in a hope to establish an ultimate anytime anywhere navigation system for everyone’s smartphone.
我们将分享代码和数据,以促进进一步的研究,希望建立一个最终的导航系统,随时随地为每个人的智能手机。
开源代码
数据集
官方网站

2. Related Work

We group inertial navigation algorithms into three categories based on their use of priors.
我们根据惯性导航算法对先验的使用将其分为三类。

2.1 Physics-based (no priors): 物理基础,不需要先验知识

传统的惯性积分面临很多限制

IMU double integration is a simple idea for inertial navigation.
IMU双积分是一种简单的惯性导航思想。

Given the device orientation (e.g., via Kalman filter[12] on IMU signals), one subtracts the gravity from the device acceleration, integrates the residual accelerations once to get velocities, and integrates once more to get positions.
给定设备方向(例如,通过IMU信号上的Kalman滤波器[12]),从设备加速度中减去重力,将剩余加速度积分一次得到速度,再积分一次得到位置。

Unfortunately, sensor biases explode quickly in the double integration process, and these systems do not work in practice without additional constraints.
不幸的是,在双积分过程中,传感器偏差会迅速增加,如果没有额外的约束,这些系统在实践中就无法工作。

A foot mounted IMU with zero speed update is probably the most successful example, where the sensor bias can be corrected subject to a constraint that the velocity must become zero whenever a foot touches the ground.
安装在脚上的IMU的零速度更新可能是最成功的例子,在这种情况下,传感器的偏差可以被纠正,只要约束条件是当脚接触地面时,速度必须为零。

2.2Heuristic priors: 基于一定的先验知识

在一定的限制下可以达到很好的效果,但是这和鲁棒性矛盾

Human motions are highly repetitive.
人类的动作是高度重复的。

Most existing inertial navigation research seeks to find heuristics exploiting such motion regularities.
现有的大多数惯性导航研究试图利用这种运动规律寻找启发式。

Step counting is a popular approach assuming that

  1. An IMU is rigidly attached to a body;
  2. The motion direction is fixedwith respect to the IMU; and
  3. The distance of travel is proportional to the number of foot-steps.
    计算步数是一种流行的方法
  4. IMU刚性附着在主体上;
    2)运动方向相对于IMU固定;和
    3)行走的距离与行走的步数成正比。

The method produces impressive results in a controlled environment where these assumptions are assured.
在这些假设得到保证的受控环境中,该方法产生了令人印象深刻的结果。

More sophisticated approaches utilize principal component analysis [10] or frequency domain analysis [13] to infer motion directions.
更复杂的方法是利用主成分分析[10]或频域分析[13]来推断运动方向。

However,these heuristic based approaches do not match up with the
robustness of emerging data-driven methods [22].
然而,这些基于启发式的方法与新兴的数据驱动方法[22]的鲁棒性并不匹配。

2.3Data-driven priors: 数据驱动的

传统机械编排方法

Robust IMU double integration(RIDI) was the first data driven Inertial navigation method [22].
鲁棒IMU双积分(RIDI)是首个数据驱动的惯性导航方法[22]。

RIDI focuses on regressing velocity vectors in a device coordinate frame, while relying on traditional sensor fusion methods to estimate device orientations.
RIDI专注于在设备坐标系中回归速度矢量,而依赖于传统的传感器融合方法来估计设备的方向。

RIDI works for complex motion cases such as backward-walking, significantly expanding the operating ranges of the inertial navigation system.
RIDI适用于后向行走等复杂运动情况,显著扩大了惯性导航系统的工作范围。

IONet is a neural network based approach, which regresses the velocity magnitude and the rate of motion-heading change without relying on external device orientation information [4].
IONet是一种基于神经网络的方法,在不依赖外部设备定位信息[4]的情况下,对速度幅度和运动方向变化速率进行回归。
(也就说这个东西可以仅仅依赖传感器数据对加速度和)

2.4Inertial navigation datasets:

RIDI dataset

RIDI dataset utilized a phone with 3D tracking capability (Lenovo Phab Pro 2) to collect IMU-motion data under four different phone placements (i.e., a hand, a bag, a leg pocket, and a body).
RIDI数据集利用具有3D跟踪功能的手机(联想Phab Pro 2)收集四种不同手机放置位置(即手、包、腿袋和身体)下的imu运动数据。

The Visual Inertial SLAM produced the ground-truth motion data.
视觉惯性SLAM产生运动数据的真值。

The data was collected by 10 human subjects, totalling 2.5 hours.
数据由10名受试者收集,耗时2.5小时。

IONet dataset, namely OXIOD

IONet dataset, namely OXIOD used a high precision motion capture system (Vicon) under four different phone placements (i.e., a hand, a bag, a pocket, and a trolley) [5].
IONet数据集,即oxod采用高精度运动捕捉系统(Vicon),在四种不同的手机放置(即手、包、口袋和手推车)下[5]。
The data was collected by five human subjects, totalling 14.7 hours.
数据由5名受试者收集,共耗时14.7小时。

传统的数据集有什么不足

The common issue in these datasets is the reliance on a single device for both IMU data and the ground-truth motion acquisition.
这些数据集中的共同问题是IMU数据和地真运动采集都依赖于单个设备。

The phone must have a clean line-of-sight for Visual Inertial SLAM or must be clearly visible for the Vicon system all the time, prohibiting natural phone handling especially for a bag and a leg pocket scenarios.
对于视觉惯性SLAM来说,手机必须有一个清晰的视线范围,或者Vicon系统必须始终清晰可见,禁止自然的手机处理,特别是在包和腿袋的情况下。

因此作者做了什么变化

This paper presents a new IMU-motion data acquisition protocol that utilizes two smartphones to overcome these issues.
本文提出了一种新的imu运动数据采集协议,利用两个智能手机来克服这些问题。

3. The RoNIN dataset

Scale, diversity and fidelity are the three key factors in building a next-generation inertial navigation database.
规模、多样性和保真度是构建下一代惯性导航数据库的三个关键因素

In comparison to the current largest database OXIOD [5], our dataset boasts
与目前最大的数据库oxod[5]相比,我们的数据集值得夸耀

作者介绍自己的数据集,目前不需要看,暂时跳过

4. Robust Neural Inertial Navigation (RoNIN)

Our neural architecture for inertial navigation, dubbed Robust Neural Inertial Navigation (RoNIN), takes ResNet [8], Long Short Term Memory Network (LSTM) [9], or Temporal Convolutional Network (TCN) [2] as its backbone.
我们用于惯性导航的神经结构,被称为鲁棒神经惯性导航(RoNIN),以ResNet[8]、Long - Short - Term Memory Network (LSTM)[9]或时态卷积网络(TCN)[2]为骨干。

RoNIN seeks to regress a velocity vector given an IMU sensor history with two key design priciples:

  1. Coordinate frame normalization defining the input and output feature space and
  2. Robust velocity losses improving the signal-to-noise-ratio even with noisy regression targets.
    RoNIN试图通过两个关键的设计原则来回归IMU传感器历史的速度矢量:
    1)坐标系归一化定义输入输出特征空间和
    2)鲁棒速度损失,即使在有噪声的回归目标下,也能提高信噪比。

We now explain the coordinate frame normalization strategy, three backbone neural architectures, and the robust velocity losses. Lastly, the section presents our neural architecture for the body heading regression.
我们现在解释坐标系归一化策略,三个骨干神经结构,和鲁棒速度损失。最后,该部分介绍了我们的神经结构的身体头部回归。

4.1. Coordinate frame normalization坐标系归一化

第一段(因为载体系和导航系的冲突,所以坐标系选择很重要)

Feature representations, in our case the choice of coordinate frames, have significant impacts on the training.
在我们的坐标框架的选择当中,特征表示对训练有显著的影响。

IMU sensor measurements come from moving device coordinate frames, while ground-truth motion trajectories come from a global coordinate frame.
IMU传感器的测量数据来自移动设备的坐标系,而地真运动轨迹来自全局坐标系。
(传感器的数据是使用的是b系,但是导航使用的是n系)

RoNIN uses a heading-agnostic coordinate frame to represent both the input IMU and the output velocity data.
RoNIN使用一个不确定方向的坐标系来表示输入的IMU和输出的速度数据。

第二段(因为载体b系在不断变化,所以我们不能选择这个东西当做参考系)

Suppose we pick the local device coordinate frame to encode our data.
假设我们选择本地设备坐标系来编码我们的数据。

The device coordinate changes every frame, resulting in inconsistent motion representation.
设备坐标每帧都会改变,导致不一致的运动表示。

For example, target velocities would change depending on how one holds a phone even for exactly the same motions.
例如,目标速度会根据手持手机的方式而变化,即使是在运动完全相同的情况下。

第三段(通过重力进行对齐,以及其漏洞)

RIDI [22] proposed the stabilized IMU coordinate frame, which is obtained from the device coordinate frame by aligning its Y-axis with the negated gravity direction.
RIDI[22]提出了稳定的IMU坐标系,通过将设备的y轴对准负重力方向,得到稳定的IMU坐标系。
(通过和重力校准,得到一个相对固定的坐标系)

However, this alignment process has a singularity (ambiguity) when the Y-axis points towards the gravity (e.g., a phone is inside a leg pocket upside-down), making the regression task harder, usually completely fail due to the randomness.
然而,当y轴指向重力时,这个对齐过程会出现奇点(模糊性),当y轴指向重力时(例如,手机倒放在腿袋里),这使得回归任务更加困难,通常会由于随机性而完全失败。

第四段(因为对齐到固定的坐标系存在问题,所以本文的结构设计成对齐到任何坐标系都可以接受的状态)

RoNIN uses a heading-agnostic coordinate frame (HACF), that is, any coordinate frame whose Z axis is aligned with gravity. In other words, we can pick any such coordinate frame as long as we keep it consistent through-out the sequence.
RoNIN使用了一个头部无关的坐标系(HACF),也就是说,任何Z轴与重力对齐的坐标系。换句话说,我们可以选择任何这样的坐标系只要我们在整个序列中保持一致。

The coordinate transformation into HACF does not suffer from singularities or discontinuities withproper rotation representation, e.g. with quaternion.
通过适当的旋转表示,例如四元数,坐标变换到HACF不会出现奇点或不连续。

第五段(于是本文作者在训练时选择了随机的坐标系作为转化目标,但是测试的时候选择确定的坐标系)

During training, we use a random HACF at each step, which is defined by randomly rotating ground-truth trajectories on the horizontal plane.
在训练过程中,我们在每一步使用随机的HACF,它是由水平面上随机旋转的轨迹真值确定的。

IMU data is transformed into the same HACF by the device orientation and the same horizontal rotation. The use of device orientations effectively incorporates sensor fusion 3 into our data-driven system.
IMU数据通过设备方向和水平旋转转换为相同的HACF。设备定位的使用有效地将传感器融合到我们的数据驱动系统中。

At test time, we use the coordinate frame defined by system device orientations from Android or iOS, whose Z axis is aligned with gravity.
在测试时,我们使用由Android或iOS系统设备方向定义的坐标系,其Z轴与重力对齐。

4.2. Backbone architectures骨干架构(骨干网络)

4.2.1 逐句翻译

We present three RoNIN variants based on ResNet [8], LSTM [9] or TCN [2].
我们提出了三种基于ResNet[8]、LSTM[9]和TCN[2]的RoNIN变体。

RoNIN ResNet

RoNIN ResNet: We take the 1D version of the standard ResNet-18 architecture [8] and add one fully connected layer with 512 units at the end to regress a 2D vector.
RoNIN ResNet:我们采用标准ResNet-18架构[8]的1D版本,并添加一个末端带有512个单元的全连接层来回归一个2D向量。

At frame i, the network takes IMU data from frame i − 200 to i as a 200×6 tensor and produces a velocity vector at frame i. At test time, we make predictions every five frames and integrate them to estimate motion trajectories.
在第i帧,网络将从第i - 200帧到第i帧的IMU数据作为200×6张量,在第i帧产生一个速度矢量。在测试时,我们每5帧进行预测,并对其进行积分来估计运动轨迹。
(这里大约是使用一个滑动窗口进行划分数据,最终回归出来一个速度)

RoNIN LSTM

RoNIN LSTM: We use a stacked unidirectional LSTM while enriching its input feature by concatenating the output of a bilinear layer [20].
RoNIN LSTM:我们使用一个堆叠的单向LSTM,同时通过连接双线性层[20]的输出来丰富其输入特征。

RoNIN-LSTM has three layers each with 100 units and regresses a 2D vector for each frame, to which we add an additional integration layer to calculate the loss.
RoNIN-LSTM有三层,每层100个单元,每帧回归一个二维向量,我们在其中增加一个积分层来计算损耗。

See Sect. 4.3 for the details of the integration layer.
有关积分层的详细信息,请参阅4.3节。

RoNIN TCN:

RoNIN TCN: TCN is a recently proposed CNN architecture, which approximates many-to-many recurrent architectures with dilated causal convolutions.
RoNIN TCN: TCN是最近提出的CNN体系结构,它近似于多对多循环体系结构,具有扩展的因果卷积。

RoNIN TCN has six residual blocks with 16, 32, 64, 128, 72, and 36 channels, respectively, where a convolutional kernel of size 3 leads to the receptive field of 253 frames.
RoNIN TCN有6个残差块,分别为16、32、64、128、72和36个通道,其中卷积核大小为3,接收域253帧。

4.2.2总结

这里从这些backbone可以看出来本文是使用一个滑动窗口预测一个每个窗口的速度。

4.3. Robust velocity loss

Defining a velocity for each IMU frame amounts to computing the derivative of low-frequency VI-SLAM poses at much higher frame rate. This makes the ground-truth velocity very noisy.
为每个IMU帧定义一个速度相当于在更高的帧速率下计算低频VI-SLAM姿态的导数。这使得速度的真值非常嘈杂。
(就是vi-slam作为真值输出频率比较低,但是我们计算损失的时候速度输出频率很高,论文里描述为噪声,我理解这里的情况和知识蒸馏是一样的,v-slam输出的是一种一段时间的均值,并不是这里每个时刻都是这个速度,所以我们在实际学习的过程中,如果让每个时刻的速度都训练成一个平均速度,那是十分不合理的。)

We propose two robust velocity losses that increase the signal-to-noise-ratio for better motion learning.
我们提出两种鲁棒速度损失,增加信噪比,以更好的运动学习。

4.3.1Latent velocity loss

Latent velocity loss

Latent velocity loss: RoNIN LSTM/TCN regresses a sequence of two dimensional vectors over time.
潜在速度损失:RoNIN LSTM/TCN随时间回归一个二维矢量序列。

We add an integration layer that sums up the vectors (over 400/253 frames for LSTM/TCN), then define a L2 norm against the ground-truth positional difference over the same framewindow.
我们增加了一个积分层来总结向量(LSTM/TCN超过400/253帧),然后定义了一个L2范数,针对同一框架窗口上的地面真值位置差。

Note that this loss simply enforces that the sum of per-frame 2D vectors must match the position difference.
注意,这种损失只是强制每帧2D向量的和必须匹配位置差。

To our surprise, both LSTM and TCN learn to regress a velocity in this latent layer before the integration, and hence, we name it the latent velocity loss.
出乎我们意料的是,LSTM和TCN在整合之前都学会了在这个潜层中回归一个速度,因此,我们将其命名为潜速度损失。

Strided velocity loss

Strided velocity loss: For RoNIN ResNet, the network learns to predict positional difference over a stride of 200 frames (i.e., one second), instead of instantaneous velocities.
跨步速度损失:对于RoNIN ResNet,网络学习预测200帧(即1秒)跨度内的位置差异,而不是瞬时速度。

More specifically, we compute MSE loss between the 2D network output at frame i and Pi − Pi−200, where Pi is the ground truth position at frame i in the global frame.
更具体地说,我们计算在第i帧的2D网络输出和Pi−Pi−200之间的MSE损失,其中Pi是全局帧第i帧的地面真实位置。

4.4. RoNIN body heading network

4.4.1逐句翻译

Different from the position regression, the task of heading regression becomes inherently ambiguous when a subject is stationary.
与位置回归不同,当主体静止时,航向回归的任务具有固有的模糊性。
(就是你速度的话如果是静止,你就把他回归到0就完事了,但是航向在静止的时候你回归成什么东西都不大对劲)

Suppose one is sitting still at a chair for 30 seconds.
假设一个人在椅子上静坐了30秒

We need to access the IMU sensor data 30 seconds back in time to estimate the body heading, as IMU data have almost zero information after the sitting event.
我们需要提前30秒访问IMU传感器数据来估计身体的方向,因为IMU数据在坐姿事件后几乎没有任何信息。

Therefore, we borrow the RoNIN LSTM architecture for the task, which is capable of keeping a long memory.
因此,我们借用了RoNIN LSTM架构来完成这个任务,它能够保持长时间的记忆。

More precisely, we take the RoNIN LSTM architecture without the integration layer, and let the network predict a 2D vector (x, y), which are sin and cos values of the body heading angle at each frame.
更准确地说,我们采用没有集成层的RoNIN LSTM体系结构,让网络预测一个二维向量(x, y),即每一帧的体向角的sin值和cos值。

During training, we unroll the network over 1,000 steps for back-propagation.
在训练期间,我们展开网络超过1000步进行反向传播。

Note that the initial state is ambiguous if the subject is stationary, therefore we only update network parameters when the first frame of the unrolled sequence have velocity magnitude greater than 0.1[m/s].
注意,如果被试是静止的,初始状态是模糊的,因此我们只在展开序列的第一帧速度幅度大于0.1[m/s]时更新网络参数。
(因为静止的时候我们认为这个航向是不可信的,所以我们在判断为静止的时候不进行反向传播更新参数)

We use MSE loss against sin and cos values of ground-truth body heading angles. We also add a normalization loss as k 1 − x2 − y2k to guide the network to predict valid trigonometric values.
我们利用均方误差(MSE)损失对地真体航向角的正弦值和余弦值。我们还添加了一个归一化损失作为k 1−x2−y2k,以指导网络预测有效的三角函数值。

4.4.2总结

作者在这里解决了两个问题:

  • 1.人们静止的时候完全检测不到任何信息,所以直接输出航向是不现实的,所以需要静止之前一段时间的速度。
    因此作者使用了拥有长时间记忆的LSTM网络来解决这个问题。
  • 2.人们在静止的时候我们很难规定他的航向。
    所以作者在测试者几乎静止的时候就直接不对航向网络进行反向传播了。
 类似资料: