Final Report

Improving Exploration for Curious RL Agents

Overview

Despite overall success of reinforcement learning, even recent approaches demonstrate poor performance in certain scenarios due to extremely sparse reward signaling. In the absence of external rewards, curiosity-driven exploration 1,2 can provide the extra source of feedback that allows the agent to explore unseen states and acquire new skills. It has shown notable success in environments with sparse, or even absent rewards. There is however one failure mode that is consistently observed in curiosity-based methods. If the initial state is the same at the beginning of every episode, a curious agent will start exploring a region around this initial state and will eventually surround itself with a “bubble” of explored space where intrinsic reward is no longer generated, essentially creating yet another sparse-reward scenario. This problem is also known as “detachment” 6, and when it happens, existing exploration methods usually aren’t able to make further progress.

To mitigate this issue we extend the intrinsic motivation framework by adding structured navigation in the explored parts of the environment. During the exploration we build topological maps of the environment4 that in conjunction with the learned locomotion policy allow us to efficiently traverse to states on the exploration frontier before any further exploration is attempted.

Go to top

On this page

Problem Formulation

Given an environment E, and an agent that starts from the initial state s0, we formulate a strategy that would efficiently explore the state space. We measure success of the exploration strategy as a portion of the state space that has been explored (qualitatively) and as cumulative sum of rewards over the duration of the episode when sparse extrinsic reward is available. We consider static environments (e.g. environment does not change during the episode, except in response to the actions of the agent) with no other agents present.

Our goal is to formulate an exploration policy \pi(a | s) over an environment \mathbf{E}(s_{t+1} | s_t, a_t) that will combine traversal of previously explored regions and exploration of unseen states.

We chose to develop and train our algorithm on custom maze levels in VizDoom, with randomly generated wall textures on the walls to allow for visual localization. The agent is provided with 7 actions (linear movement in 4 directions, rotation left or right, and idle). A goal object is placed on the opposite side of the map from the starting location, which will provide an extrinsic reward if reached.

We chose VizDoom 5 for easy management of level design and its low resource footprint. Our performance measure is map coverage; more specifically, how much of the map the agent explores over the course of several consecutive episodes.

Go to top

Related work

Intrinsic Curiosity Module (ICM)

The Intrinsic Curiosity Module is a method to generate intrinsic reward based on the agent’s ability to predict a particular (s_t, a_t, s_{t+1}) transition. In other words, the agent received a reward whenever it is “surprised” by a particular transition. This intrinsic reward is summed together with any extrinsic reward, incentivizing the agent to explore novel states.

One issue with this approach is poor scalability as the search space grows large. The ICM will provide rich learning signals in the early iterations of training as the neighborhood around s_0 will be largely unexplored. But as the agent learns to predict environment transitions in the neighborhood of the initial state, it will struggle to receive a learning signal to expand its search horizon.

Intrinsic Curiosity Module. Note how intrinsic reward is derived from disparity between the next state and the predicted next state.

Episodic Curiosity through Reachability (ECR)

Episodic curiosity addresses improves upon ICM by explicitly keeping a memory buffer of distinct explored states. Similar to ICM, the agent explores a sparse-reward environment and looks for novel states, where the notion of novelty is provided by the explicit comparison of the observation to external memory. Once a novel state has been found, an embedding of the state is kept in a memory buffer and is further compared against future states. This way, the agent builds up a buffer of landmarks of interesting, unique states every training iteration. The agent is only rewarded once a new landmark is created, so it is encouraged to explore outside its known neighborhood. This method is called “episodic curiosity” because the landmark memory is cleared after every episode.

To distinguish novel observations from the observations in memory this method uses learned reachability metric parameterized by a deep neural network. Two observations from the same agent trajectory are considered “close” (reachable) when they are separated by at most k environment steps, and considered “far” when there’s at least m*k observations between them (m > 1). The reachability network is trained on this data to predict the close/far label for any given pair of observations. Thus the observation is considered novel when it’s at least some threshold away from all observations in the memory, according to reachability metric.

Semi-Parametric Topological Memory for Navigation (SPTM)

SPTM 4 is designed to solve navigation problems. Given a set of expert demonstrations (sample trajectories in state space) this algorithm builds a topological map of the environment and then uses this topological graph and learned locomotion policy to navigate between the desired states (e.g. to reach goals).

Go to top

Improvements upon state-of-the-art

Use a topological representation for remembering landmarks

Insight: Exploration methods tend to converge to a single trajectory

We observed that both surprise-based methods (e.g. ICM) and explicit episodic curiosity (ECR) eventually stop exploring the environment, converging to a single trajectory that maximizes intrinsic rewards, albeit it happens for a different reason in ICM and ECR.

For ICM the main contributor seems to be the problem of “detachment” explained above. Sooner or later, the forward predictive model in ICM starts to generate very good predictions for all transitions around the initial state, the agent does not receive any intrinsic rewards for most actions anymore and eventually forgets how to get to the exploration boundary due to catastrophic forgetting. In our experiments the agent always converged to a very simple behavior: moving close to the wall back and forth trying to trick forward model and get at least some intrinsic reward.

ECR, on the other hand, does not suffer from detachment because the landmark buffer (memory) is cleared before every episode, thus the source of intrinsic reward is always present. But in our experiments, as the learning progressed, the agent was eventually able to find a trajectory with a very high sum of intrinsic reward (like a long corridor with no obstacles) and collapsed to low-entropy policy, basically repeating this same trajectory over and over again.

Insight: Explicitly encouraging the agent to leave the explored region does not help

To mitigate the tendency of ECR to follow a simple trajectory we conducted a series of experiments where past episodic landmark buffers were gradually added to a single persistent buffer that we kept across episodes. The idea behind this is to discourage the agent from repeating the same trajectory over and over again, because eventually the entire trajectory will be added to the episodic memory, and therefore the agent won’t be able to receive any reward for following the trajectory. When this modification is introduced, we observe the effect very similar to ICM “detachment”: in the beginning the algorithm is learning well but as more and more states are added to episodic memory it struggles to make progress. Eventually the agent forgets how to get to exploration boundary and collapses to very simple behavior, unable to get any reward at all.

The reconstruction of one experiment with modified ECR is provided below. Note that the agent first explores the lower-left part of the map (left histogram). After a few episodes, that region is added to the episodic buffer, and is considered explored. The agent then moves on to exploring another region above the lower-left part of the map (central histogram). However, after all such regions near the reset state have been explored, the agent has no incentive to move towards those landmarks. Hence, it gets stuck around the reset state (right histogram).

Thus, we can see that just having the graph alone is not enough. This incentivized us to add a locomotion module to the algorithm, which can be used for structured locomotion in the explored region of the map, allowing the agent to reliably reach the exploration boundary and continue exploration.

Structured locomotion in explored region

We used a PPO-based reinforcement learning algorithm for locomotion policy. Given a current observation and a goal observation, the agent gets a reward when it is ‘near’ the goal (nearness defined by distance metric, similar to ECR). In this manner, we can learn how to move around parts of the map we know, based on observed states. However, as your explore more, the locomotion policy cannot seem to adapt to learning new transition policies.

Locomotion policy in ever-increasing state space

As we explore, the state space on which we need to learn locomotion increases. In the context of a spatial environment like VizDoom, the agent sees new rooms, lights, textures etc. The transition between two neighboring landmarks in the graph can be non-trivial, and as the explored region grows in size we need to learn transitions between ever growing number of different states, without forgetting the previously learned transitions. Therefore the problem of locomotion policy training essentially becomes a continual learning problem, which is known to be very challenging.

Locomotion success rate (fraction of goals that we managed to reach) decreases as the size of explored region increases

We postulated that learning locomotion policy on “sparse” landmark graph was not the right way to move forward. Our latest effort is to have a dense SPTM-style topological map where the entire exploration trajectory is stored and used later for locomotion. In this sense, we’re basically hand-holding the policy to learn locomotion with the increasing topological map size. This is showing promising results, and our new approach (outlined below) is effectively merging SPTM-style locomotion with ECR-style exploration.


Go to top

Approach

Using the ECR paper as a starting point, we have added the following concepts that need to be learned. First, we learn a distance metric to decide what frames are close by and what’s far away. Then, we use this to generate a topological map of the environment, with the distance metric helping us create edges between different landmarks. Finally, based on this graph and the distance metric, we learn a locomotion policy that traverses the graph. With this dependency in mind, here is our approach:

We start with a stage of bootstrapping to train a distance metric. We then move onto an exploration stage where we use this metric to start exploring using the exploration policy similar to ECR. We create a graph of the explored region as we move forward. After this stage, we have a locomotion stage where we learn a locomotion policy on the graph of the explored region. After this stage, we go back to exploration stage, where we use the locomotion policy to move in the explored region and reach it’s frontier, from which we start exploring again. This continues ad-infinitum, where we keep switching between exploration and locomotion stages.

Here are the resulting visualizations we produce. The left is a histogram of positions that the agent has visited (note the logarithmic scale). The right is the corresponding topological map that the agent generates, overlaid on top the ground truth map. Note that the agent doesn’t know the (x,y) location of each node.

Go to top

Results

Our metric of performance, as defined by the problem formulation, was the degree of exploration in maps with sparse rewards. The figures below show the degree of exploration of our method v/s some baselines. The environment used is a large doom maze where the agent starts in the bottom left, and the goal is in the top right. See the map on the right.

The first baseline, PPO, is a vanilla policy gradient algorithm that is better at solving RL problems in environments with dense rewards. The second baseline is the ICM module we talked about earlier. The third result is the latest results we have from using ECR-style exploration. We will have better results once the dense SPTM-style locomotion policy trains robustly for this environment.

For each baseline, we have the explored region and the generated map, if applicable.

Doom environment map. Agent starts at the red location in the bottom left, goal is in the top right, in green.

Since the reward is very far away PPO policy is effectively just random actions near the home state. We can see that it doesn’t explore as much, and it gets stuck on the walls near the reset state. We can see that ICM explores more than PPO due it’s inherent curiosity to see new states. However, it is possible for the agent to just move around in circles because it only looks at the current frame and previous frame, without memory of all states it has observed in the past. Our method that modifies ECR exploration does store a map of places we’ve been to, in the form of a topological graph. As a result, it can recognize when it’s close to a known state and instead move away towards new parts of the map. We can see all these in the images above.

Go to top

References

  1. Pathak, Deepak, et al. “Curiosity-driven exploration by self-supervised prediction.” ICML 2017.
  2. Burda, Yuri, et al. “Large-scale study of curiosity-driven learning.” ICLR 2019.
  3. Savinov, Nikolay, et al. “Episodic curiosity through reachability.” ICLR 2019.
  4. Savinov, Nikolay, et al. “Semi-parametric topological memory for navigation.” ICLR 2018.
  5. M Wydmuch, M Kempka & W Jaśkowski, ViZDoom Competitions: Playing Doom from Pixels, IEEE Transactions on Games, in print, arXiv:1809.03470
  6. Ecoffet, Adrien, et al. “Go-Explore: a New Approach for Hard-Exploration Problems.” arXiv:1901.10995.

Acknowledgements: Thanks to Aleksei Petrenko and Peter Lillian for setting us up with this curiosity-based research project.

Go to top

Design a site like this with WordPress.com
Get started