rlmodels: Reinforcement Learning Plus Tidy Code
Posted in Reinforcement Learning
rlmodels is a well-documented Reinforcement Learning library designed to be easy to use, extend and play with. It supports specifying hyper-parameter values as functions of the time-step, which allows for fine-tuning.
Reinforcement Learning (RL) is certainly a hot topic right now, and with good reason: since DeepMind's breakthrough in 2013 in which they achieved superhuman performance in many Atari games, RL models have broken record after record and achieved stunning feats including some of the best protein shape predictions to date, beating Go's world champion and endowing robotic hands with impressive dexterity. Thanks to RL's exploding success and popularity, now it is possible to find a plethora of algorithms, tutorials and implementations online, and I think that on the whole this is great news, since it allows people interested in RL to get started in no time, but most of these implementation are script-like and lack any documentation, being designed to be copy-pasted into a Python session, which makes it cumbersome to tweak them or to play with the model's hyper-parametes.
RL is one of my favorite topics, and having toyed with it for some time already I decided to take what I had and try to turn it into an extensible, readable and maintainable RL Python package. This was not a trivial task, since as any RL practitioner knows algorithms tend to differ from one another in fundamental ways even though there are commonalities, and since presently all RL algorithms tend to be very sensitive to their hyper-parameters, the program had to allow enough tuning flexibility so as to avoid slowing down experimentation.
The result was the
rlmodels library, which runs on top of Pytorch and is already in PyPI so it can be installed using
pip. At the moment only 4 algorithms are included,
- Double Q-Network (DQN) with prioritized experience replay (PER)
- Deep Deterministic Policy Gradient (DDPG) with PER
- Covariance Matrix Adaptation Evolution Strategy (CMAES)
Together they cover both continuous and discrete action spaces, and in any case, CMAES being an evolutionary algorithm means that it can tackle any kind of problem since it does not place restrictions on the mappings from features to decisions, as long as we map the neural network's (NN) output to the decision space appropriately.
To develop a loosely coupled program, the general RL pipeline was decomposed in the way depicted in the diagram below
this gives us the flexibility to fine-tune each one of the moving parts, while making everything much more manageable.
The package is fully documented and a Quickstart guide can be found in its repo's README, so in this post I'm going to limit the discussion to some of the high-level features that I had in mind when writing it.
For the sake of usability and convenience the package works with environments like the ones from OpenAI's gym library, so all the environments from said package are immediately available; in the case of custom environments, it is straight forward to put them inside a gym-like wrapper:
class MyEnvWrapper(object): def __init__(self,myenv): self.env = myenv def step(self,action): ## some logic return s,r, terminated, info #need to output these 4 things (info can be None) def reset(self): #some logic def seed(self): #some logic
The package includes 3 basic agent classes that makes it straightforward to create an appropriate agent for whatever problem at hand. All of them use ReLU activations:
- FullyConnected: neural network (NN) with a custom deterministic output activation function
- DiscretePolicy: NN with a softmax output layer that simulates actions accordingly
- ContinuousPolicy: NN with an uncorrelated normally distributed output layer that simulates actions accordingly
The following is a very simple example of a FullyConnected instance for the Cartpole environment and a CMAES model:
# agent output is argmax from a 2-dimensional output vector (values not related to Q function!) def binary_output(x): #CMAES does not need derivatives, so the action choice can be an argmax function return np.argmax(x.detach().numpy()) agent = FullyConnected([6,6],4,2,binary_output)
This creates an agent with an input size of 4, two 6-node layers and an output of 2 neurons; the action is selected by applying the custom
binary_output function to the output layer.
Those classes are mostly for convenience, and if a more exotic architecture is needed, all it has to have to work with the package is having a
forward method that returns the agent's action.
RL is, most of the time, an iterative process over the hyper-parameter space, and depending on the fitting algorithm being used, this space can be quite large: step sizes, batch sizes, exploration rates, number of steps per update, and so on; running a sequential script time and time again can make us lose track of what we've actually tried. With this in mind, the other important abstraction in
rlmodels are the scheduler objects. There is one per RL algorithm, and each one takes as many functional arguments as their respective algorithm's number of hyper-parameters. Each of such functions should output the corresponding hyper-parameter value as a function of the step counter; the following is an example for a DQN scheduler:
dqn_scheduler = DQNScheduler( batch_size = lambda t: 200, #constant exploration_rate = lambda t: max(0.01,0.05 - 0.01*int(t/2500)), #decrease exploration rate every 2,500 timesteps PER_alpha = lambda t: 1, #PER = prioritized experience replay PER_beta = lambda t: 1, #constant tau = lambda t: 100, #constant agent_lr_scheduler_fn = lambda t: 1.25**(-int(t/1000)), #learning rate shrinkage parameter steps_per_update = lambda t: 1) #constant
exploration_rate parameter as an example in the code above, it'll start at 5% and decrease at a rate of 1% for every 2,500 iterations, after which it'll become constant. Using the
lr_scheduler module from Pytorch, it is possible to also change the optimization algorithm's learning rate on the fly through the
agent_lr_scheduler_fn parameter; at the moment, the package only supports
LambdaLR schedulers, but it has been enough so far.
All of this helps us fine-tune all moving parts of out agent as it allows modifying hyper-parameters at runtime, and makes it much easier to keep tack of them.
Even if we are sure that our code is working as intended, it can still be challenging to find the right hyper-parameter combination for our problem. In this case it is convenient to have a closer look at how the agent is changing iteration after iteration. Thinking of this, it is possible to see some useful information by setting the logging level to DEBUG; by writing:
import logging FORMAT = '%(asctime)-15s: %(message)s' logging.basicConfig( level=logging.DEBUG, format=FORMAT, filename="my_log_file.log", filemode="a")
we'll be able to see the trace of mean batch loss improvement (i.e., the loss over the current batch before and after the gradient descent steps), which can help us determine whether the step size sequence is the right one and how different hyper-parameter configurations compare. One must be careful no to set the number of iterations too large when running diagnostics, since each one will have to write to the log file and this can make the program significantly slower.
What is missing
The most important missing feature I can think of is distributed model fitting: at the moment everything runs sequentially, so the package can only be used for relatively small problems. It would also be nice to have other more recent RL algorithms, though I haven't had the time to do any of this.
If you feel any of this can be useful to you please go ahead and have a look at the example scripts at the repository to get started!