ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
ChauffeurNet:通过模仿最佳驾驶和合成最坏情况进行学习驾驶
https://arxiv.org/abs/1812.03079
Abstract
Our goal is to train a policy for autonomous driving via imitation learning that is robust enough to drive a real vehicle. We find that standard behavior cloning is insufficient for handling complex driving scenarios, even when we leverage a perception system for preprocessing the input and a controller for executing the output on the car: 30 million examples are still not enough. We propose exposing the learner to synthesized data in the form of perturbations to the expert’s driving, which creates interesting situations such as collisions and/or going off the road. Rather than purely imitating all data, we augment the imitation loss with additional losses that penalize undesirable events and encourage progress { the perturbations then provide an important signal for these losses and lead to robustness of the learned model. We show that the ChauffeurNet model can handle complex situations in simulation, and present ablation experiments that emphasize the importance of each of our proposed changes and show that the model is responding to the appropriate causal factors. Finally, we demonstrate the model driving a car in the real world.
我们的目标是通过模仿学习训练一个足够强大的自动驾驶策略,以驾驶真实车辆。我们发现,即使是利用感知系统对输入进行预处理,以及利用控制器在汽车上执行输出,标准的模仿学习也不足以应对复杂的驾驶场景:即使有3000万个示例仍然不够。我们提出让学习者接触到以专家驾驶的扰动形式合成的数据,这创造了有趣的情况,如碰撞和/或驶离道路。我们不仅仅模仿所有数据,而是用额外的损失来增加模仿损失,这些损失惩罚不良事件并鼓励进步——这些扰动为这些损失提供了一个重要的信号,并导致学习模型的鲁棒性。我们展示了ChauffeurNet模型能够在模拟中处理复杂情况,并通过消融实验强调了我们提出的每项更改的重要性,并表明模型正在响应适当的因果因素。最后,我们展示了模型在现实世界中驾驶汽车的情况。
1. Introduction
In order to drive a car, a driver needs to see and understand the various objects in the environment, predict their possible future behaviors and interactions, and then plan how to control the car in order to safely move closer to their desired destination while obeying the rules of the road. This is a difficult robotics challenge that humans solve well, making imitation learning a promising approach. Our work is about getting imitation learning to the level where it has a shot at driving a real vehicle; although the same insights may apply to other domains, these domains might have different constraints and opportunities, so we do not want to claim contributions there.
为了驾驶汽车,驾驶员需要看到并理解环境中的各种物体,预测它们可能的未来行为和互动,然后计划如何控制汽车,以便在遵守交通规则的同时安全地靠近他们期望的目的地。这是一个人类解决得很好的困难的机器人挑战,使模仿学习成为一个有希望的方法。我们的工作是将模仿学习提高到一个水平,使其有机会驾驶真实车辆;尽管同样的见解可能适用于其他领域,但这些领域可能有不同的约束和机会,因此我们不想声称在那里有所贡献。
We built our system based on leveraging the training data (30 million real-world expert driving examples, corresponding to about 60 days of continual driving) as effectively as possible. There is a lot of excitement for end-to-end learning approaches to driving which typically focus on learning to directly predict raw control outputs such as steering or braking after consuming raw sensor input such as camera or lidar data. But to reduce sample complexity, we opt for mid-level input and output representations that take advantage of perception and control components. We use a perception system that processes raw sensor information and produces our input: a top-down representation of the environment and intended route, where objects such as vehicles are drawn as oriented 2D boxes along with a rendering of the road information and traffic light states. We present this mid-level input to a recurrent neural network (RNN), named ChauffeurNet, which then outputs a driving trajectory that is consumed by a controller which translates it to steering and acceleration. The further advantage of these mid-level representations is that the net can be trained on real or simulated data, and can be easily tested and validated in closed-loop simulations before running on a real car.
我们构建系统的基础是尽可能有效地利用训练数据(3000万个真实世界的专家驾驶示例,相当于大约60天的持续驾驶)。对于驾驶的端到端学习方法有很多兴奋点,这些方法通常专注于学习直接预测原始控制输出,如转向或制动,这些输出是在处理原始传感器输入(如摄像头或激光雷达数据)之后得出的。但为了降低样本复杂性,我们选择使用中层输入和输出表示,这些表示利用了感知和控制组件。我们使用一个感知系统来处理原始传感器信息,并产生我们的输入:环境和预期路线的俯视图表示,其中物体如车辆被绘制为带有道路信息和交通灯状态渲染的定向2D框。我们将这种中层输入呈现给一个循环神经网络(RNN),命名为ChauffeurNet,然后它输出一个驾驶轨迹,由控制器将其转换为转向和加速。这些中层表示的进一步优势是,该网络可以在真实或模拟数据上进行训练,并且可以在实际运行在真实汽车之前,在闭环模拟中进行轻松测试和验证。
Our first finding is that even with 30 million examples, and even with mid-level input and output representations that remove the burden of perception and control, pure imitation learning is not sufficient. As an example, we found that this model would get stuck or collide with another vehicle parked on the side of a narrow street, when a nudging and passing behavior was viable. The key challenge is that we need to run the system closedloop, where errors accumulate and induce a shift from the training distribution (Ross et al. (2011)). Scientifically, this result is valuable evidence about the limitations of pure imitation in the driving domain, especially in light of recent promising results for high-capacity models (Laskey et al. (2017a)). But practically, we needed ways to address this challenge without exposing demonstrators to new states actively (Ross et al. (2011); Laskey et al. (2017b)) or performing reinforcement learning (Kuefler et al. (2017)).
我们的第一项发现是,即使有3000万个示例,即使使用中层输入和输出表示来减轻感知和控制的负担,纯粹的模仿学习也是不够的。例如,我们发现这个模型会在一条狭窄街道上与停放在路边的另一辆车相撞或卡住,而实际上可以通过轻微触碰和超车的行为来避免。关键挑战在于我们需要在闭环中运行系统,错误会累积并导致从训练分布中产生偏差(Ross等人,2011年)。从科学角度来看,这一结果为纯粹模仿在驾驶领域的局限性提供了有价值的证据,特别是考虑到最近对高容量模型的有希望的结果(Laskey等人,2017a)。但从实际角度来看,我们需要找到解决这一挑战的方法,而不必主动让示范者面对新状态(Ross等人,2011年;Laskey等人,2017b),或者进行强化学习(Kuefler等人,2017年)。
We find that this challenge is surmountable if we augment the imitation loss with losses that discourage bad behavior and encourage progress, and, importantly, augment our data with synthesized perturbations in the driving trajectory. These expose the model to nonexpert behavior such as collisions and off-road driving, and inform the added losses, teaching the model to avoid these behaviors. Note that the opportunity to synthesize this data comes from the mid-level input-output representations, as perturbations would be difficult to generate with either raw sensor input or direct controller outputs.
我们发现,如果我们用损失来增强模仿损失,这些损失会阻止不良行为并鼓励进步,并且重要的是,通过在驾驶轨迹中增加合成的扰动来增加我们的数据,这个挑战是可以克服的。这些扰动使模型接触到非专家行为,如碰撞和越野驾驶,并为增加的损失提供信息,教会模型避免这些行为。值得注意的是,合成这些数据的机会来自于中层输入输出表示,因为扰动很难用原始传感器输入或直接控制器输出来生成。
We evaluate our system, as well as the relative importance of both loss augmentation and data augmentation, first in simulation. We then show how our final model successfully drives a car in the real world and is able to negotiate situations involving other agents, turns, stop signs, and traffic lights. Finally, it is important to note that there are highly interactive situations such as merging which may require a significant degree of exploration within a reinforcement learning (RL) framework. This will demand simulating other (human) traffic participants, a rich area of ongoing research. Our contribution can be viewed as pushing the boundaries of what you can do with purely offline data and no RL.
我们首先在模拟环境中评估我们的系统,以及损失增强和数据增强的相对重要性。然后,我们展示了我们的最终模型如何在现实世界中成功驾驶汽车,并能够处理涉及其他代理、转弯、停车标志和交通灯的情况。最后,需要注意的是,像并道这样高度互动的情况可能需要在强化学习(RL)框架内进行大量探索。这将需要模拟其他(人类)交通参与者,这是一个正在进行的丰富研究领域。我们的贡献可以被视为推动了纯粹离线数据和没有RL的情况下可以做的事情的边界。
2. Related Work
Decades-old work on ALVINN (Pomerleau (1989)) showed how a shallow neural network could follow the road by directly consuming camera and laser range data. Learning to drive in an end-to-end manner has seen a resurgence in recent years. Recent work by Chen et al. (2015) demonstrated a convolutional net to estimate affordances such as distance to the preceding car that could be used to program a controller to control the car on the highway. Researchers at NVIDIA (Bojarski et al. (2016, 2017)) showed how to train an end-to-end deep convolutional neural network that steers a car by consuming camera input. Xu et al. (2017) trained a neural network for predicting discrete or continuous actions also based on camera inputs. Codevilla et al. (2018) also train a network using camera inputs and conditioned on high-level commands to output steering and acceleration. Kuefler et al. (2017) use Generative Adversarial Imitation Learning (GAIL) with simple affordance-style features as inputs to overcome cascading errors typically present in behavior cloned policies so that they are more robust to perturbations. Recent work from Hecker et al. (2018) learns a driving model using 360-degree camera inputs and desired route planner to predict steering and speed. The CARLA simulator (Dosovitskiy et al. (2017)) has enabled recent work such as Sauer et al. (2018), which estimates several affordances from sensor inputs to drive a car in a simulated urban environment. Using mid-level representations in a spirit similar to our own, M¨uller et al. (2018) train a system in simulation using CARLA by training a driving policy from a scene segmentation network to output high-level control, thereby enabling transfer learning to the real world using a different segmentation network trained on real data. Pan et al. (2017) also describes achieving transfer of an agent trained in simulation to the real world using a learned intermediate scene labeling representation. Reinforcement learning may also be used in a simulator to train drivers on difficult interactive tasks such as merging which require a lot of exploration, as shown in Shalev-Shwartz et al. (2016). A convolutional network operating on a space-time volume of bird’s eye-view representations is also employed by Luo et al. (2018); Djuric et al. (2018); Lee et al. (2017) for tasks like 3D detection, tracking and motion forecasting. Finally, there exists a large volume of work on vehicle motion planning outside the machine learning context and Paden et al. (2016) present a notable survey.
几十年前关于ALVINN(Pomerleau,1989年)的工作展示了一个浅层神经网络如何通过直接消费摄像头和激光测距数据来跟随道路。近年来,以端到端方式学习驾驶已经重新兴起。Chen等人(2015年)的最近工作展示了一个卷积网络来估计如与前车的距离等可供性,这些可供性可以用来编程一个控制器来控制汽车在高速公路上的行驶。NVIDIA的研究人员(Bojarski等人,2016年,2017年)展示了如何训练一个端到端的深度卷积神经网络,通过消费摄像头输入来驾驶汽车。Xu等人(2017年)训练了一个神经网络,用于基于摄像头输入预测离散或连续的行动。Codevilla等人(2018年)也训练了一个网络,使用摄像头输入,并在高级命令的条件下输出转向和加速。Kuefler等人(2017年)使用生成对抗模仿学习(GAIL)与简单的可供性风格特征作为输入,以克服行为克隆策略中通常存在的级联错误,使它们对扰动更加鲁棒。Hecker等人(2018年)的最近工作使用360度摄像头输入和期望的路线规划器来学习驾驶模型,以预测转向和速度。CARLA模拟器(Dosovitskiy等人,2017年)使得最近的工作如Sauer等人(2018年)成为可能,该工作从传感器输入估计多个可供性,以在模拟的城市环境中驾驶汽车。Müller等人(2018年)以与我们自己相似的精神使用中级表示,在CARLA模拟器中训练系统,通过训练一个从场景分割网络到输出高级控制的驾驶策略,从而实现了使用真实数据训练的不同分割网络将学习转移到现实世界。Pan等人(2017年)也描述了使用学习到的中间场景标记表示,将训练在模拟中的代理转移到现实世界。强化学习也可以在模拟器中用于训练驾驶员执行如并道这样的困难互动任务,这些任务需要大量的探索,如Shalev-Shwartz等人(2016年)所示。Luo等人(2018年);Djuric等人(2018年);Lee等人(2017年)也采用了在鸟瞰图表示的空间时间体积上操作的卷积网络,用于3D检测、跟踪和运动预测等任务。最后,在机器学习环境之外,有关车辆运动规划的工作有很多,Paden等人(2016年)提出了一个值得注意的调查。
3. Model Architecture
3.1 Input Output Representation
We begin by describing our top-down input representation that the network will process to output a drivable trajectory. At any time t, our agent (or vehicle) may be represented in a top-down coordinate system by pt; θt; st, where pt = (xt; yt) denotes the agent’s location or pose, θt denotes the heading or orientation, and st denotes the speed. The top-down coordinate system is picked such that our agent’s pose p0 at the current time t = 0 is always at a fixed location (u0; v0) within the image. For data augmentation purposes during training, the orientation of the coordinate system is randomly picked for each training example to be within an angular range of θ0±∆, where θ0 denotes the heading or orientation of our agent at time t = 0. The top-down view is represented by a set of images of size W × H pixels, at a ground sampling resolution of φ meters/pixel. Note that as the agent moves, this view of the environment moves with it so the agent always sees a fixed forward range, Rforward = (H − v0)φ of the world { similar to having an agent with sensors that see only up to Rforward meters forward.
我们首先描述网络将处理的俯视图输入表示,以输出可驾驶的轨迹。在任何时间 t t t,我们的代理(或车辆)可以通过 p t , θ t , s t p_t,θ_t,s_t pt,θt,st 在俯视坐标系中表示,其中 p t = ( x t , y t ) p_t = (x_t,y_t) pt=(xt,yt) 表示代理的位置或姿态, θ t θ_t θt 表示朝向或方向, s t s_t st 表示速度。俯视坐标系的选择使得我们代理在当前时间 t = 0 t = 0 t=0 的姿态 p 0 p_0 p0 总是在图像内的固定位置 ( u 0 , v 0 ) (u_0,v_0) (u0,v0)。为了在训练期间进行数据增强,坐标系的方向对于每个训练示例都是随机选择的,以在 θ 0 ± ∆ θ_0±∆ θ0±∆ 的角度范围内,其中 θ 0 θ_0 θ0 表示我们代理在时间 t = 0 t = 0 t=0 的朝向或方向。俯视图由一组大小为 W × H W × H W×H 像素的图像表示,地面采样分辨率为 ϕ \phi ϕ 米/像素。注意,随着代理的移动,这种环境视图也随之移动,因此代理总是看到一个固定的前方范围, R f o r w a r d = ( H − v 0 ) ϕ R_{forward} = (H − v_0)\phi Rforward=(H−v0)ϕ 的世界{类似于有一个传感器只能看到前方 R f o r w a r d R_{forward} Rforward 米范围内的代理。
As shown in Fig. 1, the input to our model consists of several images of size W × H pixels rendered into this top-down coordinate system. (a) Roadmap: a color (3-channel) image with a rendering of various map features such as lanes, stop signs, cross-walks, curbs, etc. (b) Traffic lights: a temporal sequence of grayscale images where each frame of the sequence represents the known state of the traffic lights at each past timestep. Within each frame, we color each lane center by a gray level with the brightest level for red lights, intermediate gray level for yellow lights, and a darker level for green or unknown lights1. © Speed limit: a single channel image with lane centers colored in proportion to their known speed limit. (d) Route: the intended route along which we wish to drive, generated by a router (think of a Google Maps-style route). (e) Current agent box: this shows our agent’s full bounding box at the current timestep t = 0. (f) Dynamic objects in the environment: a temporal sequence of images showing all the potential dynamic objects (vehicles, cyclists, pedestrians) rendered as oriented boxes. (g) Past agent poses: the past poses of our agent are rendered into a single grayscale image as a trail of points.
如 图1 所示,我们模型的输入由几幅大小为W × H像素的图像组成,这些图像渲染到这个俯视坐标系中。(a) 路线图:一幅彩色(3通道)图像,渲染了各种地图特征,如车道、停车标志、人行横道、路缘等。(b) 交通灯:一系列灰度图像的时间序列,每个序列帧代表每个过去时间步的交通灯已知状态。在每个帧内,我们根据交通灯的状态用灰度级别着色每个车道中心,红灯用最亮的级别,黄灯用中间灰度级别,绿灯或未知灯用较暗的级别。© 速度限制:一幅单通道图像,车道中心的颜色与其已知速度限制成比例。(d) 路线:我们希望驾驶的预期路线,由路由器生成(想象一下类似谷歌地图风格的路线)。(e) 当前代理框:这显示了我们代理在当前时间步t = 0的完整边界框。(f) 环境中的动态物体:一系列图像的时间序列,显示所有潜在的动态物体(车辆、自行车手、行人)渲染为定向框。(g) 过去代理姿态:我们代理的过去姿态渲染成一幅灰度图像,作为一系列点的轨迹。
a-g是输入,h是输出
We use a fixed-time sampling of δt to sample any past or future temporal information, such as the traffic light state or dynamic object states in the above inputs. The traffic lights and dynamic objects are sampled over the past Tscene seconds, while the past agent poses are sampled over a potentially longer interval of Tpose seconds. This simple input representation, particularly the box representation of other dynamic objects, makes it easy to generate input data from simulation or create it from real-sensor logs using a standard perception system that detects and tracks objects. This enables testing and validation of models in closed-loop simulations before running them on a real car. This also allows the same model to be improved using simulated data to adequately explore rare situations such as collisions for which real-world data might be difficult to obtain. Using a top-down 2D view also means efficient convolutional inputs, and allows flexibility to represent metadata and spatial relationships in a human-readable format. Papers on testing frameworks such as Tian et al. (2018), Pei et al. (2017) show the brittleness of using raw sensor data (such as camera images or lidar point clouds) for learning to drive, and reinforce the approach of using an intermediate input representation.
我们使用固定时间间隔 δ t δt δt 来采样任何过去或未来的时序信息,比如上述输入中的交通灯状态或动态对象状态。交通灯和动态对象在过去 T s c e n e T_{scene} Tscene 秒内被采样,而过去的代理姿态则在一个可能更长的时间间隔 T p o s e T_{pose} Tpose 秒内被采样。这种简单的输入表示,特别是其他动态对象的盒子表示,使得从模拟中生成输入数据或使用标准感知系统从真实传感器日志中创建它变得容易,该系统能够检测和跟踪对象。这使得在实际车辆上运行之前,可以在闭环模拟中测试和验证模型。这也允许使用模拟数据来改进同一模型,以充分探索罕见情况,如碰撞,对于这些情况,获取现实世界的数据可能很困难。使用自上而下的 2D 视图还意味着高效的卷积输入,并允许以人类可读的格式表示元数据和空间关系。关于测试框架的论文,如 Tian 等人(2018年),Pei 等人(2017年)显示了使用原始传感器数据(如相机图像或激光雷达点云)学习驾驶的脆弱性,并加强了使用中间输入表示的方法。
If I denotes the set of all the inputs enumerated above, then the ChauffeurNet model recurrently predicts future poses of our agent conditioned on these input images I as shown by the green dots in Fig. 1(h).
如果我用 I I I 表示上述列举的所有输入集合,那么 ChauffeurNet 模型会基于这些输入图像 I I I 递归地预测我们代理(即自动驾驶汽车)的未来姿态,如图1(h)中的绿点所示。
In Eq. (1), current pose p0 is a known part of the input, and then the ChauffeurNet performs N iterations and outputs a future trajectoryfpδt; p2δt; :::; pNδtg along with other properties such as future speeds. This trajectory can be fed to a controls optimizer that computes detailed driving control (such as steering and braking commands) within the specific constraints imposed by the dynamics of the vehicle to be driven. Different types of vehicles may possibly utilize different control outputs to achieve the same driving trajectory, which argues against training a network to directly output low-level steering and acceleration control. Note, however, that having intermediate representations like ours does not preclude end-to-end optimization from sensors to controls.
在方程(1)中,当前姿态 p 0 p_0 p0 是已知的输入部分,然后 ChauffeurNet 执行 N 次迭代,并输出一条未来轨迹 { p δ t , p 2 δ t , … , p N δ t } \{p_{\delta t}, p_{2\delta t}, \ldots,p_{N\delta t}\} {pδt,p2δt,…,pNδt} 以及其他属性,如未来速度。这条轨迹可以输入到控制优化器中,该优化器在被驱动车辆的动力学约束下计算详细的驾驶控制(如转向和制动命令)。不同类型的车辆可能需要使用不同的控制输出来实现相同的驾驶轨迹,这反对训练一个网络直接输出低级别的转向和加速控制。然而,请注意,拥有像我们这样的中间表示并不妨碍从传感器到控制的端到端优化。
3.2 Model Design
Broadly, the driving model is composed of several parts as shown in Fig. 2. The main ChauffeurNet model shown in part (a) of the figure consists of a convolutional feature network (FeatureNet) that consumes the input data to create a digested contextual feature representation that is shared by the other networks. These features are consumed by a recurrent agent network (AgentRNN) that iteratively predicts successive points in the driving trajectory. Each point at time t in the trajectory is characterized by its location pt = (xt; yt), heading θt and speed st. The AgentRNN also predicts the bounding box of the vehicle as a spatial heatmap at each future timestep. In part (b) of the figure, we see that two other networks are co-trained using the same feature representation as an input. The Road Mask Network predicts the drivable areas of the field of view (on-road vs. off-road), while the recurrent perception network (PerceptionRNN) iteratively predicts a spatial heatmap for each timestep showing the future location of every other agent in the scene. We believe that doing well on these additional tasks using the same shared features as the main task improves generalization on the main task. Fig. 2© shows the various losses used in training the model, which we will discuss in detail below.
总体而言,驾驶模型由几个部分组成,如 图2 所示。图2(a)部分中显示的主要 ChauffeurNet 模型由一个卷积特征网络(FeatureNet)组成,它处理输入数据以创建一个被其他网络共享的消化后的环境特征表示。这些特征被一个递归代理网络(AgentRNN)使用,该网络迭代地预测驾驶轨迹中的连续点。轨迹中时间 t 的每个点由其位置 p t = ( x t , y t ) p_t = (x_t, y_t) pt=(xt,yt)、方向 θ t \theta_t θt 和速度 s t s_t st 特征化。AgentRNN还预测每个未来时间步的汽车边界框作为空间热图。在图2(b)部分,我们看到另外两个网络使用相同的特征表示作为输入进行共同训练。道路遮罩网络预测视野内可行驶区域(在路上 vs. 非路上),而递归感知网络(PerceptionRNN)迭代地预测每个时间步的空间热图,显示场景中每个其他代理的未来位置。我们认为,使用与主要任务相同的共享特征在这些附加任务上表现良好,可以提高主要任务的泛化能力。图2©显示了训练模型时使用的各种损失,我们将在下面详细讨论。
"空间热图"是一种可视化技术,用于表示空间数据的分布。在递归感知网络(PerceptionRNN)的上下文中,空间热图被用来预测场景中每个其他代理(例如行人、车辆等)的未来位置。这些热图通过在每个时间步迭代地预测,显示了代理在空间中可能的移动路径和位置,从而帮助模型更好地理解和预测场景中的动态变化。简而言之,空间热图是一种将空间位置信息转换为可视化热图的方法,以便于分析和理解。
图2:训练驾驶模型。
(a) 核心ChauffeurNet模型,包括FeatureNet和AgentRNN,
(b) 共同训练的道路遮罩预测网络RoadMaskNet和PerceptionRNN,
(c ) 训练损失以蓝色显示,绿色标签表示真实数据。虚线箭头代表预测从一次迭代到下一次迭代的递归反馈。
Fig. 3 illustrates the ChauffeurNet model in more detail. The rendered inputs shown in Fig. 1 are fed to a large-receptive field convolutional FeatureNet with skip connections, which outputs features F that capture the environmental context and the intent. These features are fed to the AgentRNN which predicts the next point pk on the driving trajectory, and the agent bounding box heatmap Bk, conditioned on the features F from the FeatureNet, the iteration number k 2 f1; : : : ; Ng, the memory Mk−1 of past predictions from the AgentRNN, and the agent bounding box heatmap Bk−1 predicted in the previous iteration.
图3 更详细地说明了ChauffeurNet模型。图1 中显示的渲染输入被送入一个具有跳跃连接的大感受野卷积FeatureNet,它输出捕捉环境上下文和意图的特征F。这些特征被送入AgentRNN,该网络预测驾驶轨迹上的下一个点 p k p_k pk 和代理边界框热图 B k B_k Bk,条件是 FeatureNet 的特征 F 、迭代次数 k ∈ { 1 , … , N } k\in \{1, \ldots, N\} k∈{1,…,N}、来自AgentRNN的过去预测的记忆 M k − 1 M_{k-1} Mk−1,以及在上一次迭代中预测的代理边界框热图 B k − 1 B_{k-1} Bk−1。
图3:(a) ChauffeurNet的示意图。 (b) 多次迭代中的记忆更新。
The memory Mk is an additive memory consisting of a single channel image. At iteration k of the AgentRNN, the memory is incremented by 1 at the location pk predicted by the AgentRNN, and this memory is then fed to the next iteration. The AgentRNN outputs a heatmap image over the next pose of the agent, and we use the arg-max operation to obtain the coarse pose prediction pk from this heatmap. The AgentRNN then employs a shallow convolutional meta-prediction network with a fully-connected layer that predicts a sub-pixel refinement of the pose δpk and also estimates the heading θk and the speed sk. Note that the AgentRNN is unrolled at training time for a fixed number of iterations, and the losses described below are summed together over the unrolled iterations. This is possible because of the non-traditional RNN design where we employ an explicitly crafted memory model instead of a learned memory.
记忆 M k M_k Mk 是一个由单通道图像组成的累加记忆。在 AgentRNN 的第 k k k 次迭代中,记忆在 AgentRNN 预测的位置 p k p_k pk 处增加1,然后这个记忆被送入下一次迭代。AgentRNN 输出一个代表代理下一姿态的热图图像,我们使用 arg-max 操作从这个热图中获得粗略的姿态预测 p k p_k pk。然后,AgentRNN 使用一个带有全连接层的浅层卷积元预测网络,预测姿态的亚像素细化 δ p k \delta p_k δpk,并估计方向 θ k \theta_k θk 和速度 s k s_k sk。请注意,AgentRNN 在训练时会展开固定数量的迭代次数,并且下面描述的损失在展开的迭代中被累加。这是可能的,因为我们采用了非传统的 RNN 设计,其中我们采用了明确构建的记忆模型,而不是学习得到的记忆。
3.3 System Architecture
Fig. 4 shows a system level overview of how the neural net is used within the self-driving system. At each time, the updated state of our agent and the environment is obtained via a perception system that processes sensory output from the real-world or from a simulation environment as the case may be. The intended route is obtained from the router, and is updated dynamically conditioned on whether our agent was able to execute past intents or not. The environment information is rendered into the input images described in Fig. 1 and given to the RNN which then outputs a future trajectory. This is fed to a controls optimizer that outputs the low-level control signals that drive the vehicle (in the real world or in simulation).
图4 展示了神经网络在自动驾驶系统中的使用系统级概览。在每个时间点,通过处理来自现实世界或模拟环境的感知输出,获取我们代理和环境的更新状态。从路由器获取预期路线,并根据我们的代理是否能够执行过去的意图来动态更新。环境信息被渲染成图1中描述的输入图像,并提供给RNN,然后RNN输出未来轨迹。这被送入控制优化器,输出驱动车辆(在现实世界或模拟中)的低级控制信号。
4. Imitating the Expert
In this section, we first show how to train the model above to imitate the expert.
在这一节中,我们首先展示如何训练上述模型来模仿专家。
4.1 Imitation Losses
4.1.1 Agent Position, Heading and Box Prediction
The AgentRNN produces three outputs at each iteration k: a probability distribution Pk(x; y) over the spatial coordinates of the predicted waypoint obtained after a spatial softmax, a heatmap of the predicted agent box at that timestep Bk(x; y) obtained after a per-pixel sigmoid activation that represents the probability that the agent occupies a particular pixel, and a regressed box heading output θk. Given ground-truth data for the above predicted quantities, we can define the corresponding losses for each iteration as:
AgentRNN在每次迭代 k 时产生三个输出:
- 经过空间softmax处理后,预测航点的空间坐标上的概率分布 P k ( x , y ) P_k(x, y) Pk(x,y)。
- 经过每个像素的sigmoid激活后,表示代理占据特定像素的概率的代理框热图 B k ( x , y ) B_k(x, y) Bk(x,y)。
- 回归的框头输出 θ k \theta_k θk。
where the superscript gt denotes the corresponding ground-truth values, and H(a; b) is the cross-entropy function. Note that Pkgt is a binary image with only the pixel at the groundtruth target coordinate bpgt k c set to one.
其中上标 g t gt gt 表示相应的真实值, H ( a , b ) H(a, b) H(a,b) 是交叉熵函数。请注意, P k g t P_k^{gt} Pkgt 是一个二值图像,只有真实目标坐标 [ p k g t ] [p_k^{gt}] [pkgt] 处的像素被设置为1。
4.1.2 Agent Meta Prediction
The meta prediction network performs regression on the features to generate a sub-pixel refinement δpk of the coarse waypoint prediction as well as a speed estimate sk at each iteration. We employ L1 loss for both of these outputs:
元预测网络对特征进行回归,以生成粗略航点预测的亚像素细化 δ p k δpk δpk 以及每次迭代的速度估计 s k s_k sk。我们对这两个输出都使用 L1 损失函数:
where δpgt k = pgt k − bpgt k c is the fractional part of the ground-truth pose coordinates.
其中, δ p k g t = p k g t − ⌊ p k g t ⌋ δp^{gt}_k = p^{gt}_k - ⌊p^{gt}_k⌋ δpkgt=pkgt−⌊pkgt⌋ 是真实姿态坐标的小数部分。
4.2 Past Motion Dropout
During training, the model is provided the past motion history as one of the inputs (Fig. 1(g)). Since the past motion history during training is from an expert demonstration, the net can learn to \cheat" by just extrapolating from the past rather than finding the underlying causes of the behavior. During closed-loop inference, this breaks down because the past history is from the net’s own past predictions. For example, such a trained net may learn to only stop for a stop sign if it sees a deceleration in the past history, and will therefore never stop for a stop sign during closed-loop inference. To address this, we introduce a dropout on the past pose history, where for 50% of the examples, we keep only the current position (u0; v0) of the agent in the past agent poses channel of the input data. This forces the net to look at other cues in the environment to explain the future motion profile in the training example.
在训练过程中,模型被提供了过去运动历史作为输入之一(图1(g))。由于训练期间的过去运动历史来自专家演示,网络可以通过仅仅从过去进行外推来“作弊”,而不是寻找行为的潜在原因。在闭环推理中,这种方法会失效,因为过去的历史来自网络自己的过去预测。例如,这样一个训练好的网络可能只会在看到过去历史中的减速时才学会在停车标志前停车,因此在闭环推理期间永远不会因停车标志而停车。为了解决这个问题,我们对过去的姿态历史引入了dropout,对于50%的例子,我们只保留输入数据中过去代理姿态通道的当前位置 ( u 0 , v 0 ) (u_0, v_0) (u0,v0)。这迫使网络在训练示例中查看环境中的其他线索以解释未来的运动概况。
5. Beyond Pure Imitation
In this section, we go beyond vanilla cloning of the expert’s demonstrations in order to teach the model to arrest drift and avoid bad behavior such as collisions and off-road driving by synthesizing variations of the expert’s behavior.
在这一部分,我们超越了对专家演示的简单克隆,目的是教会模型如何阻止漂移并避免不良行为,如碰撞和越野驾驶,通过合成专家行为的变化。
5.1 Synthesizing Perturbations
Running the model as a part of a closed-loop system over time can cause the input data to deviate from the training distribution. To prevent this, we train the model by adding some examples with realistic perturbations to the agent trajectories. The start and end of a trajectory are kept constant, while a perturbation is applied around the midpoint and smoothed across the other points. Quantitatively, we jitter the midpoint pose of the agent uniformly at random in the range [−0:5; 0:5] meters in both axes, and perturb the heading by [−π=3; π=3] radians. We then fit a smooth trajectory to the perturbed point and the original start and end points. Such training examples bring the car back to its original trajectory after a perturbation. Fig. 5 shows an example of perturbing the current agent location (red point) away from the lane center and the fitted trajectory correctly bringing it back to the original target location along the lane center. We filter out some perturbed trajectories that are impractical by thresholding on maximum curvature. But we do allow the perturbed trajectories to collide with other agents or drive off-road, because the network can then experience and avoid such behaviors even though real examples of these cases are not present in the training data. In training, we give perturbed examples a weight of 1=10 relative to the real examples, to avoid learning a propensity for perturbed driving.
在长时间作为闭环系统的一部分运行时,模型可能会导致输入数据偏离训练分布。为了防止这种情况,我们通过向代理轨迹添加一些具有现实扰动的例子来训练模型。轨迹的起始和结束点保持不变,而在中点周围应用扰动,并在其他点上平滑处理。具体来说,我们在两个轴向上均匀随机地将代理的中点姿态在 [-0.5, 0.5] 米范围内抖动,并将航向扰动 [-π/3, π/3] 弧度。然后我们对扰动点和原始的起始和结束点拟合一条平滑的轨迹。这样的训练例子在扰动后将车辆带回其原始轨迹。图5 展示了一个例子,其中当前代理位置(红点)偏离了车道中心,拟合的轨迹正确地将其带回沿着车道中心的原始目标位置。我们通过最大曲率阈值过滤掉一些不切实际的扰动轨迹。但我们确实允许扰动轨迹与其他代理发生碰撞或驶离道路,因为这样网络就可以体验并避免这种行为,即使这些情况的真实例子在训练数据中不存在。在训练中,我们给扰动的例子相对于真实例子的权重为1/10,以避免学习到倾向于扰动驾驶的倾向。
图5:轨迹扰动。
(a) 一个原始的记录训练示例,其中代理沿着车道中心行驶。
(b) 通过在原始示例中将当前代理位置(红点)扰动偏离车道中心,然后拟合一条新的平滑轨迹,使代理回到沿着车道中心的原始目标位置,从而创建的扰动示例。
5.2 Beyond the Imitation Loss
5.2.1 Collision Loss
Since our training data does not have any real collisions, the idea of avoiding collisions is implicit and will not generalize well. To alleviate this issue, we add a specialized loss that directly measures the overlap of the predicted agent box Bk with the ground-truth boxes of all the scene objects at each timestep.
由于我们的训练数据中没有真实的碰撞,避免碰撞的概念是隐含的,并且不会很好地泛化。为了缓解这个问题,我们添加了一个专门的损失函数,它直接测量在每个时间步长上预测的代理框 B k B_k Bk 与所有场景对象的真实框的重叠程度。
where Bk is the likelihood map for the output agent box prediction, and Objkgt is a binary mask with ones at all pixels occupied by other dynamic objects (other vehicles, pedestrians, etc.) in the scene at timestep k. At any time during training, if the model makes a poor prediction that leads to a collision, the overlap loss would influence the gradients to correct the mistake. However, this loss would be effective only during the initial training rounds when the model hasn’t learned to predict close to the ground-truth locations due to the absence of real collisions in the ground truth data. This issue is alleviated by the addition of trajectory perturbation data, where artificial collisions within those examples allow this loss to be effective throughout training without the need for online exploration like in reinforcement learning settings.
在这个上下文中, B k B_k Bk 是输出代理框预测的概率图,而 O b j k g t Obj_k^{gt} Objkgt 是一个二进制掩码,其中所有像素点为 1 的位置表示在时间步长 k 的场景中被其他动态对象(其他车辆、行人等)占据。在训练的任何时候,如果模型做出一个导致碰撞的糟糕预测,重叠损失将影响梯度以纠正错误。然而,这种损失只有在模型尚未学会预测接近真实位置的初始训练轮次中才有效,因为真实数据中缺少真实碰撞。这个问题通过添加轨迹扰动数据来缓解,在这些例子中,人为制造的碰撞使得这种损失在整个训练过程中都有效,而无需在线探索,如在强化学习设置中那样。
5.2.2 On Road Loss
Trajectory perturbations also create synthetic cases where the car veers off the road or climbs a curb or median because of the perturbation. To train the network to avoid hitting such hard road edges, we add a specialized loss that measures overlap of the predicted agent box Bk in each timestep with a binary mask Roadgt denoting the road and non-road regions within the field-of-view.
轨迹扰动还创造了合成案例,其中车辆因扰动而驶离道路或爬上路缘或中央隔离带。为了训练网络避免撞击这些坚硬的路缘,我们添加了一个专门的损失函数,用于测量每个时间步长中预测的代理框 B k B_k Bk 与表示视野内道路和非道路区域的二进制掩码 R o a d g t Road^{gt} Roadgt 的重叠程度。
5.2.3 Geometry Loss
We would like to explicitly constrain the agent to follow the target geometry independent of the speed profile. We model this target geometry by fitting a smooth curve to the target waypoints and rendering this curve as a binary image in the top-down coordinate system. The thickness of this curve is set to be equal to the width of the agent. We express this loss similar to the collision loss by measuring the overlap of the predicted agent box with the binary target geometry image Geomgt. Any portion of the box that does not overlap with the target geometry curve is added as a penalty to the loss function.
我们希望明确约束代理遵循目标几何形状,而与速度轮廓无关。我们通过将目标航点拟合成一条平滑曲线,并在顶视坐标系统中将这条曲线渲染为二进制图像来模拟这个目标几何形状。这条曲线的厚度被设置为等于代理的宽度。我们通过测量预测的代理框与二进制目标几何图像 ( Geom^{gt} ) 的重叠来表达这种损失,类似于碰撞损失。任何不与目标几何曲线重叠的框部分都作为惩罚项添加到损失函数中。
5.2.4 Auxiliary Losses
Similar to our own agent’s trajectory, the motion of other agents may also be predicted by a recurrent network. Correspondingly, we add a recurrent perception network PerceptionRNN that uses as input the shared features F created by the FeatureNet and its own predictions Objk−1 from the previous iteration, and predicts a heatmap Objk at each iteration. Objk(x; y) denotes the probability that location (x; y) is occupied by a dynamic object at time k. For iteration k = 0, the PerceptionRNN is fed the ground truth objects at the current time.
类似于我们自己的代理轨迹,其他代理的运动也可以通过递归网络进行预测。相应地,我们添加了一个递归感知网络 PerceptionRNN,它使用由 FeatureNet 创建的共享特征 F 以及前一次迭代自己的预测 O b j k − 1 Obj_{k−1} Objk−1 作为输入,并在每次迭代中预测热图 O b j k Obj_k Objk。 O b j k ( x , y ) Obj_k(x, y) Objk(x,y) 表示位置 ( x , y ) (x, y) (x,y) 在时间 k k k 被动态对象占据的概率。对于迭代 k = 0 k = 0 k=0,PerceptionRNN 被输入当前时间的真实对象。
Co-training a PerceptionRNN to predict the future of other agents by sharing the same feature representation F used by the PerceptionRNN is likely to induce the feature network to learn better features that are suited to both tasks. Several examples of predicted trajectories from PerceptionRNN on logged data are shown on our website here.
通过共享 PerceptionRNN 使用的相同特征表示 F 来共同训练 PerceptionRNN 以预测其他代理的未来,可能会促使特征网络学习更适合这两个任务的更好特征。我们网站上展示了几个 PerceptionRNN 在记录数据上预测轨迹的例子。
We also co-train to predict a binary road/non-road mask by adding a small network of convolutional layers to the output of the feature net F. We add a cross-entropy loss to the predicted road mask output Road(x; y) which compares it to the ground-truth road mask Roadgt.
我们还通过在特征网络 F 的输出上添加一小网络的卷积层来共同训练预测二进制道路/非道路掩码。我们为预测的道路掩码输出 R o a d ( x , y ) Road(x, y) Road(x,y) 添加了交叉熵损失,将其与真实道路掩码 R o a d g t Road^{gt} Roadgt 进行比较。
Fig. 6 shows some of the predictions and losses for a single example processed through the model.
图6 展示了通过模型处理的单个示例的一些预测和损失。
图6:在示例输入上对预测和损失函数的可视化。顶行显示的是输入分辨率,而底行展示了围绕当前代理位置的放大视图。
5.3 Imitation Dropout
Overall, our losses may be grouped into two sub-groups, the imitation losses:
总体而言,我们的损失可以分为两个子组,其中模仿损失:
and the environment losses:
环境损失是:
The imitation losses cause the model to imitate the expert’s demonstrations, while the environment losses discourage undesirable behavior such as collisions. To further increase the effectiveness of the environment losses, we experimented with randomly dropping out the imitation losses for a random subset of training examples. We refer to this as \imitation dropout". In the experiments, we show that imitation dropout yields a better driving model than simply under-weighting the imitation losses. During imitation dropout, the weight on the imitation losses wimit is randomly chosen to be either 0 or 1 with a certain probability for each training example. The overall loss is given by:
模仿损失使模型模仿专家的演示,而环境损失则抑制不良行为,如碰撞。为了进一步提高环境损失的有效性,我们尝试对随机子集的训练样本随机丢弃模仿损失。我们称这为“模仿丢弃”。在实验中,我们展示了模仿丢弃比简单地降低模仿损失的权重能产生更好的驾驶模型。在模仿丢弃期间,模仿损失的权重 w i m i t w_{imit} wimit 随机选择为0或1,每个训练样本都有一定的概率。总体损失由以下公式给出:
6. Experiments
6.1 Data
The training data to train our model was obtained by randomly sampling segments of realworld expert driving and removing segments where the car was stationary for long periods of time. Our input field of view is 80m × 80m (W φ = 80) and with the agent positioned at (u0; v0), we get an effective forward sensing range of Rforward = 64m. Therefore, for the experiments in this work we also removed any segments of highway driving given the longer sensing range requirement that entails. Our dataset contains approximately 26 million examples which amount to about 60 days of continuous driving. As discussed in Section 3, the vertical-axis of the top-down coordinate system for each training example is randomly oriented within a range of ∆ = ±25◦ of our agent’s current heading, in order to avoid a bias for driving along the vertical axis. The rendering orientation is set to the agent heading (∆ = 0) during inference. Data about the prior map of the environment (roadmap) and the speed-limits along the lanes is collected apriori. For the dynamic scene entities like objects and traffic-lights, we employ a separate perception system based on laser and camera data similar to existing works in the literature (Yang et al. (2018); Fairfield and Urmson (2011)). Table 1 lists the parameter values used for all the experiments in this paper. The model runs on a NVidia Tesla P100 GPU in 160ms with the detailed breakdown in Table 2.
为了训练我们的模型,我们通过随机抽样真实世界专家驾驶的片段,并移除了汽车长时间静止的片段来获取训练数据。我们的输入视野为80米×80米( W ϕ = 80 W_{\phi}=80 Wϕ=80),并且代理位于 ( u 0 , v 0 ) (u_0, v_0) (u0,v0),我们获得了64米的有效前向感知范围 R f o r w a r d = 64 m R_{forward}=64m Rforward=64m。因此,在这项工作的实验中,我们也移除了任何高速公路驾驶的片段,因为这涉及到更长的感知范围要求。我们的数据集包含大约 2600 万个示例,相当于大约60天的连续驾驶。如第3节所讨论的,每个训练示例的顶视坐标系的垂直轴在我们代理当前航向的±25°范围内随机定向,以避免沿垂直轴驾驶的偏见。在推理期间,渲染方向设置为代理航向(Δ=0)。关于环境的先前地图(路线图)和车道上的速度限制的数据是先验收集的。对于动态场景实体,如物体和交通灯,我们采用了一个基于激光和相机数据的单独感知系统,类似于文献中现有的工作(Yang等人,2018年;Fairfield和Urmson,2011年)。表1 列出了本文所有实验使用的参数值。模型在NVidia Tesla P100 GPU上运行160毫秒,详细分解见表2。
6.2 Models
We train and test not only our final model, but a sequence of models that introduce the ingredients we describe one by one on top of behavior cloning. We start with M0, which does behavior cloning with past motion dropout to prevent using the history to cheat. M1 adds perturbations without modifying the losses. M2 further adds our environment losses Lenv in Section 5.2. M3 and M4 address the fact that we do not want to imitate bad behavior { M3 is a baseline approach, where we simply decrease the weight on the imitation loss, while M4 uses our imitation dropout approach with a dropout probability of 0:5. Table 3 lists the configuration for each of these models.
我们不仅训练和测试我们的最终模型,还训练和测试一系列模型,这些模型逐一引入我们描述的元素,建立在行为克隆的基础上。我们从 M 0 M_0 M0 开始,它使用过去运动的 dropout 进行行为克隆,以防止利用历史数据作弊。 M 1 M_1 M1 增加了扰动,但没有修改损失函数。 M 2 M_2 M2 进一步增加了我们在第5.2节中描述的环境损失 L e n v L_{env} Lenv。 M 3 M_3 M3 和 M 4 M_4 M4 解决了我们不想模仿不良行为的问题—— M 3 M_3 M3 是一种基线方法,我们简单地减少了模仿损失的权重,而 M 4 M_4 M4 使用了我们的模仿dropout方法,dropout概率为0.5。表3列出了这些模型的配置。
这些模型的逐步构建和测试有助于我们理解每个新引入的组件对整体模型性能的影响,从而优化最终模型的设计。通过这种方式,我们可以逐步调整和完善模型,以提高其在实际应用中的有效性和安全性。
6.3 Closed Loop Evaluation
To evaluate our learned model on a specific scenario, we replay the segment through the simulation until a buffer period of max(Tpose; Tscene) has passed. This allows us to generate the first rendered snapshot of the model input using all the replayed messages until now. The model is evaluated on this input, and the fitted controls are passed to the vehicle simulator that emulates the dynamics of the vehicle thus moving the simulated agent to its next pose. At this point, the simulated pose might be different from the logged pose, but our input representation allows us to correctly render the new input for the model relative to the new pose. This process is repeated until the end of the segment, and we evaluate scenario specific metrics like stopping for a stop-sign, collision with another vehicle etc. during the simulation. Since the model is being used to drive the agent forward, this is a closed-loop evaluation setup.
为了在特定场景中评估我们学习到的模型,我们通过模拟回放该片段,直到一个缓冲期 max ( T p o s e , T s c e n e ) \max(T_{pose}, T_{scene}) max(Tpose,Tscene) 过去。这允许我们使用到目前为止重放的所有消息生成模型输入的第一个渲染快照。在这个输入上评估模型,并将拟合的控制传递给车辆模拟器,该模拟器模拟车辆的动态,从而将模拟代理移动到其下一个姿态。此时,模拟的姿态可能与记录的姿态不同,但我们的输入表示允许我们相对于新姿态正确渲染模型的新输入。这个过程一直重复到片段结束,我们在模拟过程中评估特定场景的指标,如在停车标志前停车、与其他车辆碰撞等。由于模型被用来推动代理前进,这是一个闭环评估设置。
6.3.1 Model Ablation Tests
Here, we present results from experiments using the various models in the closed-loop simulation setup. We first evaluated all the models on simple situations such as stopping for stop-signs and red traffic lights, and lane following along straight and curved roads by creating 20 scenarios for each situation, and found that all the models worked well in these simple cases. Therefore, we will focus below on specific complex situations that highlight the differences between these models.
在这里,我们展示了在闭环模拟设置中使用各种模型的实验结果。首先,我们评估了所有模型在简单情况下的表现,例如在停车标志和红灯处停车,以及在直道和弯道上跟随车道,为此我们为每种情况创建了20个场景,并发现所有模型在这些简单情况下都表现良好。因此,我们将以下重点放在特定的复杂情况上,这些情况突出了这些模型之间的差异。
Nudging around a parked car. To set up this scenario, we place the agent at an arbitrary distance from a stop-sign on an undivided two-way street and then place a parked vehicle on the right shoulder between the the agent and the stop-sign. We pick 4 separate locations with both straight and curved roads then vary the starting speed of the agent between 5 different values to create a total of 20 scenarios. We then observe if the agent would stop and get stuck behind, collide with the parked car, or correctly pass around the parked car, and report the aggregate performance in Fig. 7(row 1). We find that other than M4, all other models cause the agent to collide with the parked vehicle about half the time. The baseline M0 model can also get stuck behind the parked vehicle in some of the scenarios. The model M4 nudges around the parked vehicle and then brings the agent back to the lane center. This can be attributed to the model’s ability to learn to avoid collisions and nudge around objects because of training with the collision loss the trajectory perturbation. Comparing model M3 and M4, it is apparent that \imitation dropout" was more effective at learning the right behavior than only re-weighting the imitation losses. Note that in this scenario, we generate several variations by changing the starting speed of the agent relative to the parked car. This creates situations of increasing difficulty, where the agent approaches the parked car at very high relative speed and thus does not have enough time to nudge around the car given the dynamic constraints. A 10% collision rate for M4 is thus not a measure of the absolute performance of the model since we do not have a perfect driver which could have performed well at all the scenarios here. But in relative terms, this model performs the best.
绕过停放的车辆。在设置这个场景时,我们将代理放置在未分隔的双向街道上,距离停车标志任意距离的位置,然后在代理和停车标志之间的右侧路肩上放置一辆停放的车辆。我们选择了4个不同的地点,包括直道和弯道,然后改变代理的起始速度,共有5个不同的值,以创建总共20个场景。然后我们观察代理是否会停下来被卡在后面、与停放的车辆相撞,或者正确地绕过停放的车辆,并将总体表现报告在 图7(第1行)。我们发现,除了 M 4 M_4 M4 模型外,所有其他模型都会导致代理大约有一半的时间与停放的车辆相撞。基线 M 0 M_0 M0 模型在某些场景中也可能被停放的车辆卡在后面。 M 4 M_4 M4 模型能够绕过停放的车辆,然后将代理带回车道中心。这可以归因于模型通过碰撞损失和轨迹扰动的训练,学会了避免碰撞和绕过物体的能力。比较 M 3 M_3 M3 和 M 4 M_4 M4 模型,很明显,“模仿丢弃”在学习正确行为方面比仅仅重新加权模仿损失更有效。请注意,在这种场景中,我们通过改变代理相对于停放车辆的起始速度来生成几种变化。这创造了越来越困难的情况,其中代理以非常高的相对速度接近停放的车辆,因此根据动态约束,没有足够的时间绕过车辆。因此, M 4 M_4 M4 的10%碰撞率并不是模型绝对性能的衡量,因为我们没有完美的驾驶员可以在所有这些场景中表现良好。但相对而言,这个模型的表现是最好的。
Recovering from a trajectory perturbation. To set up this scenario, we place the agent approaching a curved road and vary the starting position and the starting speed of the agent to generate a total of 20 scenario variations. Each variation puts the agent at a different amount of offset from the lane center with a different heading error relative to the lane. We then measure how well the various models are at recovering from the lane departure. Fig. 7(row 2) presents the results aggregated across these scenarios and shows the contrast between the baseline model M0 which is not able to recover in any of the situations and the models M3 and M4 which handle all deviations well. All models trained with the perturbation data are able to handle 50% of the scenarios which have a lower starting speed. At a higher starting speed, we believe that M3 and M4 do better than M1 and M2 because they place a higher emphasis on the imagination losses.
从轨迹扰动中恢复。在设置这个场景时,我们将代理放置在接近弯道的位置,并改变代理的起始位置和起始速度,以生成总共20个场景变化。每种变化都将代理放置在与车道中心不同偏移量的位置,并相对于车道有不同的航向误差。然后我们测量各种模型在从车道偏离中恢复过来的表现如何。图7(第2行)展示了这些场景的汇总结果,并显示了基线模型 M 0 M_0 M0(在任何情况下都无法恢复)与模型 M 3 M_3 M3 和 M 4 M_4 M4(能够很好地处理所有偏差)之间的对比。所有使用扰动数据训练的模型都能够处理50%的起始速度较低的场景。在起始速度较高的情况下,我们认为 M 3 M_3 M3 和 M 4 M_4 M4 比 M 1 M_1 M1 和 M 2 M_2 M2 做得更好,因为它们更强调想象损失。
这个场景的测试结果表明:
- M0模型:在所有情况下都无法从车道偏离中恢复,这表明没有扰动数据和环境损失的训练对于处理车道偏离的复杂情况是不够的。
- M3和M4模型:能够很好地处理所有偏差,这表明通过模仿丢弃或减少模仿损失权重,模型能够更好地学习如何处理车道偏离。
- M1和M2模型:在起始速度较低的情况下能够处理50%的场景,但在起始速度较高的情况下表现不佳,这可能是因为它们没有像M3和M4那样强调环境损失。
Slowing down for a slow car. To set up this scenario, we place the agent on a straight road at varying initial speeds and place another car ahead with a varying but slower constant speed, generating a total of 20 scenario variations, to evaluate the ability to slow for and then follow the car ahead. From Fig. 7(row 3), we see that some models slow down to zero speed and get stuck. For the variation with the largest relative speed, there isn’t enough time for most models to stop the agent in time, thus leading to a collision. For these cases, model M3 which uses imitation loss re-weighting works better than the model M4 which uses imitation dropout. M4 has trouble in two situations due to being over aggressive in trying to maneuver around the slow car and then grazes the left edge of the road. This happens in the two extreme variations where the relative speed between the two cars is the highest.
对慢车减速。在设置这个场景时,我们让代理以不同的初始速度在直道上行驶,并在前方放置一辆以较慢但恒定速度行驶的车辆,共创建了20个场景变化,以评估模型减速并随后跟随前方车辆的能力。从图7(第3行)中,我们可以看到一些模型将速度减到零并卡住。对于相对速度最大的变化,大多数模型没有足够的时间及时停下代理,从而导致碰撞。在这些情况下,使用模仿损失重新加权的模型 M 3 M_3 M3 比使用模仿丢弃的模型 M 4 M_4 M4 表现更好。 M 4 M_4 M4 在两种情况下遇到麻烦,因为它在试图绕过慢车时过于激进,然后擦碰到了道路左侧边缘。这种情况发生在两车相对速度最高的两种极端变化中。
6.3.2 Input Ablation Tests
With input ablation tests, we want to test the final M4 model’s ability to identify the correct causal factors behind specific behaviors, by testing the model’s behavior in the presence or absence of the correct causal factor while holding other conditions constant. In simulation, we have evaluated our model on 20 scenarios with and without stop-signs rendered, and 20 scenarios with and without other vehicles in the scene rendered. The model exhibits the correct behavior in all scenarios, thus confirming that it has learned to respond to the correct features for a stop-sign and a stopped vehicle.
通过输入消融测试,我们希望测试最终 M 4 M_4 M4 模型识别特定行为背后正确因果因素的能力,通过在存在或不存在正确因果因素的情况下测试模型的行为,同时保持其他条件不变。在模拟中,我们已经在有和没有渲染停车标志的20个场景中评估了我们的模型,以及在有和没有渲染场景中其他车辆的20个场景中评估了模型。模型在所有场景中都表现出正确的行为,从而确认它已经学会对停车标志和停下的车辆的正确特征做出响应。
6.3.3 Logged Data Simulated Driving
For this evaluation, we take logs from our real-driving test data (separate from our training data), and use our trained network to drive the car using the vehicle simulator keeping everything else the same i.e. the dynamic objects, traffic-light states etc. are all kept the same as in the logs. Some example videos are shown here and they illustrate the ability of the model in dealing with multiple dynamic objects and road controls.
为了进行这次评估,我们采用了与训练数据分开的真实驾驶测试数据,并使用我们训练好的网络通过车辆模拟器驾驶汽车,保持其他所有条件不变,即动态对象、交通灯状态等都与日志中保持一致。这里展示了一些示例视频,它们展示了模型处理多个动态对象和道路控制的能力。
6.3.4 Real World Driving
We have also evaluated this model on our self-driving car by replacing the existing planner module with the learned model M4 and have replicated the driving behaviors observed in simulation. The videos of several of these runs are available here and they illustrate not only the smoothness of the network’s driving ability, but also its ability to deal with stop-signs and turns and to drive for long durations in full closed-loop control without deviating from the trajectory.
我们还在自动驾驶汽车上评估了这个模型,通过用学习到的模型 M 4 M_4 M4 替换现有的规划模块,并复制了在模拟中观察到的驾驶行为。这些运行的视频可以在此处找到,它们不仅展示了网络驾驶的平滑性,还展示了它处理停车标志、转弯以及在完全闭环控制下长时间驾驶而不偏离轨迹的能力。
6.4 Open Loop Evaluation
In an open-loop evaluation, we take test examples of expert driving data and for each example, compute the L2 distance error between the predicted and ground-truth waypoints. Unlike the closed-loop setting, the predictions are not used to drive the agent forward and thus the network never sees its own predictions as input. Fig. 8a shows the L2 distance metric in this open-loop evaluation setting for models M0 and M4 on a test set of 10,000 examples. These results show that model M0 makes fewer errors than the full model M4, but we know from closed-loop testing that M4 is a far better driver than M0. This shows how open-loop evaluations can be misleading, and closed-loop evaluations are critical while assessing the real performance of such driving models.
在开环评估中,我们采用专家驾驶数据的测试样本,并对每个样本计算预测航点与真实航点之间的 L 2 L2 L2 距离误差。与闭环设置不同,预测结果不用于推动代理前进,因此网络永远不会看到自己的预测作为输入。图8a 显示了在开环评估设置中,模型 M 0 M_0 M0 和 M 4 M_4 M4 在10,000个测试样本上的 L2 距离指标。这些结果表明,模型 M 0 M_0 M0 比完整模型 M 4 M_4 M4 犯的错误更少,但我们知道从闭环测试来看, M 4 M_4 M4 是一个比 M 0 M_0 M0 好得多的驾驶员。这表明开环评估可能是误导性的,而闭环评估在评估这类驾驶模型的真实性能时至关重要。
(a) 在未受扰动的评估数据上,模型 M0 和 M4 的预测误差。
(b) 在受扰动的评估数据上,模型 M0 和 M1 的预测误差。
图8:开环评估结果。
We also compare the performance of models M0 and M1 on our perturbed evaluation data w.r.t the L2 distance metric in Fig. 8b. Note that the model trained without including perturbed data (M0) has larger errors due to its inability to bring the agent back from the perturbation onto its original trajectory. Fig. 9 shows examples of the trajectories predicted by these models on a few representative examples showcasing that the perturbed data is critical to avoiding the veering-off tendency of the model trained without such data.
我们还在 图8b 中比较了模型 M 0 M_0 M0 和 M 1 M_1 M1 在我们扰动评估数据上的 L2 距离指标性能。请注意,未经扰动数据训练的模型($M_0%)由于无法将代理从扰动中恢复到原始轨迹上,因此误差较大。图9 展示了这些模型在一些代表性示例上预测的轨迹示例,展示了扰动数据对于避免未经此类数据训练的模型偏离趋势的重要性。
图9:在两个受扰动的例子中,(a) 真实轨迹与 (b) 模型 M0 和 ( c) 模型 M1 预测的轨迹的比较。红点是参考姿态 ( u 0 , v 0 ) (u_0, v_0) (u0,v0),白点是过去的姿态,绿点是未来的姿态。
6.5 Failure Modes
At our ground resolution of 20 cm/pixel, the agent currently sees 64 m in front and 40 m on the sides and this limits the model’s ability to perform merges on T-junctions and turns from a high-speed road. Specific situations like U-turns and cul-de-sacs are also not currently handled, and will require sampling enough training data. The model occasionally gets stuck in some low speed nudging situations. It sometimes outputs turn geometries that make the specific turn infeasible (e.g. large turning radius). We also see some cases where the model gets over aggressive in novel and rare situations for example by trying to pass a slow moving vehicle. We believe that adequate simulated exploration may be needed for highly interactive or rare situations.
在20厘米/像素的地面分辨率下,我们的代理目前前方能看到64米,两侧能看到40米,这限制了模型在 T 型路口合并以及从高速公路转弯时的性能。特定情况,如 U 型转弯和死胡同,目前也尚未处理,这将需要采集足够的训练数据。模型偶尔也会在一些低速推挤情况下陷入困境。有时它会输出使得特定转弯变得不可行的转弯几何形状(例如,转弯半径过大)。我们还观察到,在一些新颖和罕见的情况下,模型有时会过于激进,例如试图超车慢速行驶的车辆。我们认为,对于高度互动或罕见情况,可能需要进行充分的模拟探索。
6.6 Sampling Speed Profiles
The waypoint prediction from the model at timestep k is represented by the probability distribution Pk(x; y) over the spatial domain in the top-down coordinate system. In this paper, we pick the mode of this distribution pk to update the memory of the AgentRNN. More generally, we can also sample from this distribution to allow us to predict trajectories with different speed profiles. Fig. 10 illustrates the predictions P1(x; y) and P5(x; y) at the first and the fifth iterations respectively, for a training example where the past motion history has been dropped out. Correspondingly, P1(x; y) has a high uncertainity along the longitudinal position and allows us to pick from a range of speed samples. Once we pick a specific sample, the ensuing waypoints get constrained in their ability to pick different speeds and this shows as a centered distribution at the P5(x; y).
在时间步 k k k,模型对航点的预测由概率分布 P k ( x , y ) P_k(x, y) Pk(x,y) 表示,该分布在顶部坐标系统中的空间域上。在本文中,我们选择这个分布的众数 p k p_k pk 来更新 AgentRNN 的记忆。更一般地说,我们也可以从这个分布中采样,以允许我们预测具有不同速度剖面的运动轨迹。图10 分别展示了在第一次和第五次迭代时的训练示例中,过去运动历史被丢弃的情况下的预测 P 1 ( x , y ) P_1(x, y) P1(x,y) 和 P 5 ( x , y ) P_5(x, y) P5(x,y)。相应地, P 1 ( x , y ) P_1(x, y) P1(x,y) 在纵向位置上有很高的不确定性,并允许我们从一系列速度样本中选择。一旦我们选择了一个特定的样本,随后的航点在选择不同速度的能力上受到限制,这在 P 5 ( x , y ) P_5(x, y) P5(x,y) 中表现为一个中心化的分布。
图10:采样速度剖面。模型在时间步 k = 1 k=1 k=1 预测的概率分布 P 1 ( x , y ) P_1(x, y) P1(x,y) 允许我们根据不同的速度剖面进行采样,这些速度剖面将影响后续分布 P 5 ( x , y ) P_5(x, y) P5(x,y),使其受到更多限制。
The use of a probability distribution over the next waypoint also presents the interesting possibility of constraining the model predictions at inference time to respect hard constraints. For example, such constrained sampling may provide a way to ensure that any trajectories we generate strictly obey legal restrictions such as speed limits. One could also constrain sampling of trajectories to a designated region, such as a region around a given reference trajectory.
在推理时对模型预测施加概率分布的约束,也提供了一个有趣的可能,即确保生成的轨迹严格遵守如速度限制等法律约束。例如,这种约束采样可能提供了一种方法,以确保我们生成的任何轨迹都严格遵循法律限制。人们还可以将轨迹采样限制在指定区域内,比如围绕给定参考轨迹的区域。
7. Discussion
In this paper, we presented our experience with what it took to get imitation learning to perform well in real-world driving. We found that key to its success is synthesizing interesting situations around the expert’s behavior and augmenting appropriate losses that discourage undesirable behavior. This constrained exploration is what allowed us to avoid collisions and off-road driving even though such examples were not explicitly present in the expert’s demonstrations. To support it, and to best leverage the expert data, we used middle-level input and output representations which allow easy mixing of real and simulated data and alleviate the burdens of learning perception and control. With these ingredients, we got a model good enough to drive a real car. That said, the model is not yet fully competitive with motion planning approaches but we feel that this is a good step forward for machine learned driving models. There is room for improvement: comparing to end-toend approaches, and investigating alternatives to imitation dropout are among them. But most importantly, we believe that augmenting the expert demonstrations with a thorough exploration of rare and difficult scenarios in simulation, perhaps within a reinforcement learning framework, will be the key to improving the performance of these models especially for highly interactive scenarios.
在这篇论文中,我们分享了我们的经验,即为了让模仿学习在现实世界驾驶中表现良好,需要做些什么。我们发现,其成功的关键在于围绕专家行为合成有趣的情况,并增加适当的损失来阻止不良行为。这种有约束的探索使我们能够避免碰撞和驶离道路,即使在专家的演示中没有明确出现这样的例子。为了支持这一点,并最大限度地利用专家数据,我们使用了中等级别的输入和输出表示,这允许轻松混合真实和模拟数据,并减轻了学习感知和控制的负担。有了这些要素,我们得到了一个足够好的模型来驾驶一辆真正的汽车。话虽如此,该模型尚未完全与运动规划方法竞争,但我们认为这是机器学习驾驶模型向前迈出的良好一步。还有改进的空间:与端到端方法进行比较,以及研究模仿丢弃的替代方案都是其中之一。但最重要的是,我们相信,在模拟中对罕见和困难场景进行彻底探索,尤其是在强化学习框架内,将是通过增强专家演示来提高这些模型性能的关键,特别是对于高度互动的场景。