I've been playing with deep reinforcement learning for a while. I originally
started with a simple DQN, added all improvements from the Rainbow paper, and
finally changed C51 for a quantile regression (and plan to swap it for an
Implicit Quantile Network).
After implementing C51 (which was my first time with distributional RL) I
started playing with policies that take advantage of the learned distributions :
By independently taking N samples from each action-value distribution, scoring
actions by averaging the samples, and picking the greedy action with respect to
these scores, I was able to make the agent learn faster than similar agents
using only NoisyNets or an epsilon-greedy policy (I'm still using NoisyNet, this
is done on top of it). In the limiting cases, N=1 is just Thompson Sampling
and N=+Infinity is just a plain greedy policy.
Finding an optimal value for N proved to be a challenge, so I decided to pick
a random value for it at the start of each episode (N = 2**rng.uniform(8,12)
for a QR-DQN with 32 quantiles/action works well in my experiments), which led
to even better results.
I later found out about
DLTV which made the
agent discover new behaviors, but performed worse than previous experiments
overall. Inspired by it, I tried something I did not find in previous works and
got the best results out of all my previous experiments :
At each time step, compute an exploration_score as the ratio of "intra-action
variance" over "inter-action variance"
(rendered latex equation).
I then take N/exploration_score samples from each distribution, and pick an
action as described above. (more details at the end of this post)
For anyone reading this, I have a few questions :
- Are you aware of any previous work I missed that tries similar exploration
policies with distributional RL (interpolating between Thompson sampling and
the greedy policy)
- Most papers I found about learning from multiple exploration policies seem to
be in the context of multi-actor parallelization. Is there any novelty in
randomizing the policy parameters at the start of each episode, especially in
the single-actor case ?
- Is any part of what I'm doing worth the time it would take to quantitatively
evaluate it ? I've been doing it mainly for learning and fun and have only
qualitatively evaluated it so far. However, if there's a chance I can
contribute to the field, I'll gladly make some time to compare it to
published papers on ALE.
A few more details
I actually track a moving average and standard deviation of the exploration
score, which lets me shift/rescale its values to a target average and standard
deviation, and divide N by the shifted/rescaled value. I initially started with
a target average of 1 and standard deviation of 1 as well (which gave good
results), then tried randomizing these parameters at the start of each episode
as well. This led to a lot more diversity in the policies and even better
results.
Since this worked so well, I additionally randomized the noise strength in the
NoisyNet layers.
Overall, this made the agent a lot more robust to deviating from what it
considers to be the optimal trajectory, and allowed it to learn complex
behaviors previous iterations were never able to learn (e.g. taking a few steps
back to gain momentum, waiting for good cycles, or dodging hammer bros)
Watch it learn
For anyone interested, I made a
live stream of the training in progress with graphs
and some more details on the experiments I'm running. The current training run
was started ~2.5 days ago. The agent has finished and unlocked levels up to 5-1, and is currently learning 5-2.
A lot more details
Long text hidden, click to expand
Available actions : The agent does not have access to the up and down
buttons, the available actions only use left, right, A and B.
Adding the down button would double the total number of actions (because down
can be pressed on top of all available actions).
Reward function : It mainly consists of
reward(t) = max(0, x(t) - previous_best_x) + a larger reward for beating a
stage. I had to tweak the scaling of both components.
I initially had penalties for time and death, but one made the agent suicidal in
front of hard-to-overcome obstacles, while the other made it fear them too much
and hug the left side of the screen. Removing both proved to increase the
performance.
One trick that seems to help with most '*-3' levels (which have a lot of void
to fall into) was to hold the reward while the vertical velocity of Mario is
negative (meaning it is falling). Without this trick, the agent would sometimes
get stuck learning to jump the farthest it can into the void.
Stage scheduling : Each episode is one attempt on one level. At the start of
each episode, a stage is randomly picked with probability proportional to
1/(number of times the stage was beaten) among the unlocked stages. Each stage
is unlocked after the previous one has been beaten 30 times, with only 1-1
unlocked at the start of the training.
Available stages : The first iterations of the agent were unable to learn
maze castles (4-3, 7-3 and 8-4), so I removed them all. The reward function will
give rewards for the first path the agent tries, then the agent will be
teleported back by the game and no reward is received until it finds the right
path and gets past the point where the game teleported it back. I plan to test
newer (better) versions of the agent on these stages only and see if mazes can
be re-added to the pool.
I've also removed underwater stages (2-2 and 7-2). The agent can learn them
fine, but the game dynamics are really different from all other stages and
they're really boring to watch. Since I already removed a bunch of stages, I
figured I could remove these as well but I may re-add them with mazes because
beating every level is cooler than beating a cherry-picked selection.
Since 8-4 is the only stage that requires going down a pipe, I considered it was
not worth it to add the down action and will likely never re-add it to the pool,
which would unfortunately be really anti-climactic...
Replay buffer warm-up : After initially using the standard approach of
filling the buffer with transitions sampled from a random policy before training
the neural net, I came-up with a "soft warm-up" scheme in which the first
gradient updates happen after only 2000 transitions, but initially happen every
few thousand transitions and gradually become more frequent until the replay
buffer is full. Together with my custom exploration policy, this works very well
: the agent very quickly starts behaving similar to a "right + random button"
policy before learning to actually jump and run.
Custom n-step bootstrapping : When I initially implemented n-step bootstrap
targets, I initially used n=3 from the Rainbow paper, noting the same
instabilities as the paper did for higher n values. I then found
the Retrace(\lambda) paper which seems to
successfully address this by increasing n until the online network disagrees
with the action choice from a stored transition. This makes n larger where the
replay buffer data is on-policy, and smaller when it becomes off-policy. Since
my GPU is already maxed and the training is already slow (20.8t/s when real-time
is 20t/s) I could not afford the additional computations (building a training
sample (s(t), a(t), sum(r(t+0..n)), s(t+n)) needs up to n_max transitions to
go through the online network).
I'm trying to achieve similar sample efficiency gains by using cheaper
alternatives as proxies for "how off-policy is a given transition" : I'm using
the number of times a transition has been sampled, with
n = int(max(n_min, n_max * k**times_sampled)) ; 0<k<1. The currently running
experiment uses n_max=14, n_min=1 and k=1/1.3. I'm pretty sure it helps
early in the training, and it does not collapse like a constant n=14 does
Stream setup : As I said, this is something I do for my own fun, and I
really wanted to be able to see the agent learn in real time. The code runs a
separate process, to which frames from training episodes are sent in a queue.
The process then sends the frames as raw RGB24 to an local UDP socket, to which
GStreamer connects and encodes the stream.
With a simple MediaMTX configuration, I can manage the
Gstreamer process and have the stream available through WebRTC on my LAN.
Then I figured someone else might have fun watching this, so I added a line to
my MediaMTX config to send the stream to twitch and youtube. The overlay is a
headless browser displaying custom HTML/JS (using d3.js for the graphs) piping
raw frames to ffmpeg. GStreamer handles compositing the
two streams together into the side-by-side view.