Skip to content

rlmodels: Reinforcement Learning Plus Tidy Code

Nestor Sanchez

Posted in Reinforcement Learning

Results from rlmodel's Double Q-Network implementation in the lunar lander environment
Results from rlmodel's Double Q-Network implementation in the lunar lander environment

rlmodels is a well-documented Reinforcement Learning library designed to be easy to use, extend and play with. It supports specifying hyper-parameter values as functions of the time-step, which allows for fine-tuning.

Reinforcement Learning (RL) is certainly a hot topic right now, and with good reason: since DeepMind's breakthrough in 2013 in which they achieved superhuman performance in many Atari games, RL models have broken record after record and achieved stunning feats including some of the best protein shape predictions to date, beating Go's world champion and endowing robotic hands with impressive dexterity. Thanks to RL's exploding success and popularity, now it is possible to find a plethora of algorithms, tutorials and implementations online, and I think that on the whole this is great news, since it allows people interested in RL to get started in no time, but most of these implementation are script-like and lack any documentation, being designed to be copy-pasted into a Python session, which makes it cumbersome to tweak them or to play with the model's hyper-parametes.

The rlmodels package

RL is one of my favorite topics, and having toyed with it for some time already I decided to take what I had and try to turn it into an extensible, readable and maintainable RL Python package. This was not a trivial task, since as any RL practitioner knows algorithms tend to differ from one another in fundamental ways even though there are commonalities, and since presently all RL algorithms tend to be very sensitive to their hyper-parameters, the program had to allow enough tuning flexibility so as to avoid slowing down experimentation. The result was the rlmodels library, which runs on top of Pytorch and is already in PyPI so it can be installed using pip. At the moment only 4 algorithms are included,

  1. Double Q-Network (DQN) with prioritized experience replay (PER)
  2. Deep Deterministic Policy Gradient (DDPG) with PER
  3. Covariance Matrix Adaptation Evolution Strategy (CMAES)
  4. Actor-Critic(A2C)

Together they cover both continuous and discrete action spaces, and in any case, CMAES being an evolutionary algorithm means that it can tackle any kind of problem since it does not place restrictions on the mappings from features to decisions, as long as we map the neural network's (NN) output to the decision space appropriately.

To develop a loosely coupled program, the general RL pipeline was decomposed in the way depicted in the diagram below

rlmodels map

this gives us the flexibility to fine-tune each one of the moving parts, while making everything much more manageable.

The package is fully documented and a Quickstart guide can be found in its repo's README, so in this post I'm going to limit the discussion to some of the high-level features that I had in mind when writing it.

The environment

For the sake of usability and convenience the package works with environments like the ones from OpenAI's gym library, so all the environments from said package are immediately available; in the case of custom environments, it is straight forward to put them inside a gym-like wrapper:

class MyEnvWrapper(object):
	def __init__(self,myenv):
		self.env = myenv
	def step(self,action):
		## some logic
		return s,r, terminated, info #need to output these 4 things (info can be None)
	def reset(self):
		#some logic
	def seed(self):
		#some logic

The agent

The package includes 3 basic agent classes that makes it straightforward to create an appropriate agent for whatever problem at hand. All of them use ReLU activations:

  1. FullyConnected: neural network (NN) with a custom deterministic output activation function
  2. DiscretePolicy: NN with a softmax output layer that simulates actions accordingly
  3. ContinuousPolicy: NN with an uncorrelated normally distributed output layer that simulates actions accordingly

The following is a very simple example of a FullyConnected instance for the Cartpole environment and a CMAES model:

# agent output is argmax from a 2-dimensional output vector (values not related to Q function!) 
def binary_output(x): 
  #CMAES does not need derivatives, so the action choice can be an argmax function
  return np.argmax(x.detach().numpy()) 

agent = FullyConnected([6,6],4,2,binary_output)

This creates an agent with an input size of 4, two 6-node layers and an output of 2 neurons; the action is selected by applying the custom binary_output function to the output layer. Those classes are mostly for convenience, and if a more exotic architecture is needed, all it has to have to work with the package is having a forward method that returns the agent's action.

Hyper-parameter tuning

RL is, most of the time, an iterative process over the hyper-parameter space, and depending on the fitting algorithm being used, this space can be quite large: step sizes, batch sizes, exploration rates, number of steps per update, and so on; running a sequential script time and time again can make us lose track of what we've actually tried. With this in mind, the other important abstraction in rlmodels are the scheduler objects. There is one per RL algorithm, and each one takes as many functional arguments as their respective algorithm's number of hyper-parameters. Each of such functions should output the corresponding hyper-parameter value as a function of the step counter; the following is an example for a DQN scheduler:

dqn_scheduler = DQNScheduler(
	batch_size = lambda t: 200, #constant
	exploration_rate = lambda t: max(0.01,0.05 - 0.01*int(t/2500)), #decrease exploration rate every 2,500 timesteps
	PER_alpha = lambda t: 1, #PER = prioritized experience replay
	PER_beta = lambda t: 1, #constant
	tau = lambda t: 100, #constant
	agent_lr_scheduler_fn = lambda t: 1.25**(-int(t/1000)), #learning rate shrinkage parameter
	steps_per_update = lambda t: 1) #constant

Taking the exploration_rate parameter as an example in the code above, it'll start at 5% and decrease at a rate of 1% for every 2,500 iterations, after which it'll become constant. Using the lr_scheduler module from Pytorch, it is possible to also change the optimization algorithm's learning rate on the fly through the agent_lr_scheduler_fn parameter; at the moment, the package only supports LambdaLR schedulers, but it has been enough so far. All of this helps us fine-tune all moving parts of out agent as it allows modifying hyper-parameters at runtime, and makes it much easier to keep tack of them.


Even if we are sure that our code is working as intended, it can still be challenging to find the right hyper-parameter combination for our problem. In this case it is convenient to have a closer look at how the agent is changing iteration after iteration. Thinking of this, it is possible to see some useful information by setting the logging level to DEBUG; by writing:

import logging
FORMAT = '%(asctime)-15s: %(message)s'

we'll be able to see the trace of mean batch loss improvement (i.e., the loss over the current batch before and after the gradient descent steps), which can help us determine whether the step size sequence is the right one and how different hyper-parameter configurations compare. One must be careful no to set the number of iterations too large when running diagnostics, since each one will have to write to the log file and this can make the program significantly slower.

What is missing

The most important missing feature I can think of is distributed model fitting: at the moment everything runs sequentially, so the package can only be used for relatively small problems. It would also be nice to have other more recent RL algorithms, though I haven't had the time to do any of this.

If you feel any of this can be useful to you please go ahead and have a look at the example scripts at the repository to get started!