Learning to Navigate in Complex

2018-06-26 本文已影响21人初七123

Introduction

用强化学习学习机器人导航，存在下面的问题

回报的稀疏性，只存在一个目标位置
环境中包含动态元素，这就要求agent能在不同尺度上使用记忆：目标位置的快速一次性记忆，加上速度信号和视觉观测的短时记忆，以及环境的长期记忆

为了提高学习效率，我们通过增加辅助任务来引导强化学习程序的学习
the ﬁrst one involves reconstruction of a low-dimensional depth map at each time step by predicting one input modality (the depth channel) from others (the colour channels).

The second task directly invokes loop closure from SLAM: the agent is trained to predict if the current location has been previously visited within a local trajectory.

APPROACH

We rely on a end-to-end learning framework that incorporates multiple objectives. Firstly it tries to maximize cumulative reward using an actor-critic approach. Secondly it minimizes an auxiliary loss of inferring the depth map from the RGB observation. Finally, the agent is trained to detect loop closures as an additional auxiliary task that encourages implicit velocity integration.

enc为3层的CNN
图d Nav A3C + 辅助任务

DEPTH PREDICTION
While this formulation, combined with a higher depth resolution, extracts the most information, mean square error imposes a unimodal distribution (van den Oord et al., 2016). To address this possible issue, we also consider a classiﬁcation loss, where depth at each position is discretised into 8 different bands.
用分类问题代替回归问题解决unimodal distribution

LOOP CLOSURE PREDICTION
Speciﬁcally, in a trajectory noted{p0,p1,...,pT}, where pt is the position of the agent at time t, we deﬁne a loop closure label lt that is equal to 1 if the position pt of the agent is close to the position pt' at an earlier time t0.

In order to avoid trivial loop closures on consecutive points of the trajectory,we add an extra condition on an intermediary position pt'' being far from pt. Thresholds η1 and η2 provide these two limits.

RELEATED WORK

EXPERIMENTS

There are sparse ‘fruit’ rewards which serve to encourage exploration. Apples are worth 1 point, strawberries 2 points and goals are 10 points.

In the static variant of the maze, the goal and fruit locations are ﬁxed and only the agent’s start location changes. In the dynamic (Random Goal) variant, the goal and fruits are randomly placed on every episode.

Expert human scores, established by a professional game player, are compared to these results. The Nav A3C+D2 agents reach human-level performance on Static 1 and 2, and attain about 91% and 59% of human scores on Random Goal 1 and 2.

ANALYSIS

POSITION DECODING
we train a position decoder that takes that representation as input, consisting of a linear classiﬁer with multinomial probability distribution over the discretized maze locations

In Random Goal 1, it is Nav A3C+D2 that achieves the best position decoding performance (85.5% accuracy), whereas the FF A3C and the LSTM A3C architectures are at approximately 50%.

In the I-maze, the linear position decoder for this agent is only 68.5% accurate, whereas it is 87.8% in the plain LSTM A3C agent.

STACKED LSTM GOAL ANALYSIS

INVESTIGATING DIFFERENT COMBINATIONS OF AUXILIARY TASKS