The Go Explore Algorithm
Intelligent exploration, especially when rewards are sparse or deceptive
A grand challenge in reinforcement learning is producing intelligent exploration, especially when rewards are sparse or deceptive. Go-Explore is a new algorithm for such ‘hard exploration problems.’ see the video of Jeff Clune, Senior Research Scientist & Founding Member at Uber AI Labs
Video : The go explore algorithm
In deep reinforcement learning (RL), solving the Atari games Montezuma’s Revenge and Pitfall has been a grand challenge. These games represent a broad class of challenging, real-world problems called “hard-exploration problems,” where an agent has to learn complex tasks with very infrequent or deceptive feedback.
Go-Explore builds up an archive of interestingly different game states; By explicitly storing a variety of stepping stones in an archive, Go-Explore remembers and returns to promising areas for exploration.
===
Three key insights
Go-Explore performs so well on hard-exploration problems because of three key principles:
- 1 Remember good exploration stepping stones (interestingly different states visited so far)
- 2 First return to a state, then explore
- 3 First solve a problem, then robustify (if necessary)
These principles do not exist in most RL algorithms, but it would be interesting to weave them in. As discussed above, contemporary RL algorithms do not do number 1. Number 2 is important because current RL algorithms explore by randomly perturbing the parameters or actions of the current pol icy in the hope of exploring new areas of the environment, which is ineffective when most changes break or substantially change a policy such that it cannot first return to hard-to-reach states before further exploring from them. This problem becomes worse the longer, more complex, and more precise the sequence of actions required to reach a state is. Go-Explore solves this problem by first returning to a state and then exploring from there. Doing so enables deep exploration that can find a solution to the problem, which can then be robustified to produce a reliable policy (principle number 3).
for more information see Uber blog - go-explore
Related Posts