Unity ML-Agents Gym Wrapper

A common way in which machine learning researchers interact with simulation environments is via a wrapper provided by OpenAI called gym. For more information on the gym interface, see here.

We provide a gym wrapper and instructions for using it with existing machine learning algorithms which utilize gym. Our wrapper provides interfaces on top of our UnityEnvironment class, which is the default way of interfacing with a Unity environment via Python.

Installation

The gym wrapper is part of the mlagents_envs package. Please refer to the mlagents_envs installation instructions.

Using the Gym Wrapper

The gym interface is available from gym_unity.envs. To launch an environment from the root of the project repository use:

from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper

env = UnityToGymWrapper(unity_env, uint8_visual, flatten_branched, allow_multiple_obs)

unity_env refers to the Unity environment to be wrapped.
uint8_visual refers to whether to output visual observations as uint8 values (0-255). Many common Gym environments (e.g. Atari) do this. By default they will be floats (0.0-1.0). Defaults to False.
flatten_branched will flatten a branched discrete action space into a Gym Discrete. Otherwise, it will be converted into a MultiDiscrete. Defaults to False.
allow_multiple_obs will return a list of observations. The first elements contain the visual observations and the last element contains the array of vector observations. If False the environment returns a single array (containing a single visual observations, if present, otherwise the vector observation). Defaults to False.
action_space_seed is the optional seed for action sampling. If non-None, will be used to set the random seed on created gym.Space instances.

The returned environment env will function as a gym.

Limitations

It is only possible to use an environment with a single Agent.
By default, the first visual observation is provided as the observation, if present. Otherwise, vector observations are provided. You can receive all visual and vector observations by using the allow_multiple_obs=True option in the gym parameters. If set to True, you will receive a list of observation instead of only one.
The TerminalSteps or DecisionSteps output from the environment can still be accessed from the info provided by env.step(action).
Stacked vector observations are not supported.
Environment registration for use with gym.make() is currently not supported.
Calling env.render() will not render a new frame of the environment. It will return the latest visual observation if using visual observations.

Running OpenAI Baselines Algorithms

OpenAI provides a set of open-source maintained and tested Reinforcement Learning algorithms called the Baselines.

Using the provided Gym wrapper, it is possible to train ML-Agents environments using these algorithms. This requires the creation of custom training scripts to launch each algorithm. In most cases these scripts can be created by making slight modifications to the ones provided for Atari and Mujoco environments.

These examples were tested with baselines version 0.1.6.

Example - DQN Baseline

In order to train an agent to play the GridWorld environment using the Baselines DQN algorithm, you first need to install the baselines package using pip:

pip install git+git://github.com/openai/baselines

Next, create a file called train_unity.py. Then create an /envs/ directory and build the environment to that directory. For more information on building Unity environments, see here. Note that because of limitations of the DQN baseline, the environment must have a single visual observation, a single discrete action and a single Agent in the scene. Add the following code to the train_unity.py file:

import gym

from baselines import deepq
from baselines import logger

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper


def main():
  unity_env = UnityEnvironment( < path - to - environment >)
  env = UnityToGymWrapper(unity_env, uint8_visual=True)
  logger.configure('./logs')  # Change to log in a different directory
  act = deepq.learn(
    env,
    "cnn",  # For visual inputs
    lr=2.5e-4,
    total_timesteps=1000000,
    buffer_size=50000,
    exploration_fraction=0.05,
    exploration_final_eps=0.1,
    print_freq=20,
    train_freq=5,
    learning_starts=20000,
    target_network_update_freq=50,
    gamma=0.99,
    prioritized_replay=False,
    checkpoint_freq=1000,
    checkpoint_path='./logs',  # Change to save model in a different directory
    dueling=True
  )
  print("Saving model to unity_model.pkl")
  act.save("unity_model.pkl")


if __name__ == '__main__':
  main()

To start the training process, run the following from the directory containing train_unity.py:

python -m train_unity

Other Algorithms

Other algorithms in the Baselines repository can be run using scripts similar to the examples from the baselines package. In most cases, the primary changes needed to use a Unity environment are to import UnityToGymWrapper, and to replace the environment creation code, typically gym.make(), with a call to UnityToGymWrapper(unity_environment) passing the environment as input.

A typical rule of thumb is that for vision-based environments, modification should be done to Atari training scripts, and for vector observation environments, modification should be done to Mujoco scripts.

Some algorithms will make use of make_env() or make_mujoco_env() functions. You can define a similar function for Unity environments. An example of such a method using the PPO2 baseline:

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.envs import UnityToGymWrapper
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.bench import Monitor
from baselines import logger
import baselines.ppo2.ppo2 as ppo2

import os

try:
  from mpi4py import MPI
except ImportError:
  MPI = None


def make_unity_env(env_directory, num_env, visual, start_index=0):
  """
  Create a wrapped, monitored Unity environment.
  """

  def make_env(rank, use_visual=True):  # pylint: disable=C0111
    def _thunk():
      unity_env = UnityEnvironment(env_directory, base_port=5000 + rank)
      env = UnityToGymWrapper(unity_env, uint8_visual=True)
      env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
      return env

    return _thunk

  if visual:
    return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
  else:
    rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
    return DummyVecEnv([make_env(rank, use_visual=False)])


def main():
  env = make_unity_env( < path - to - environment >, 4, True)
  ppo2.learn(
    network="mlp",
    env=env,
    total_timesteps=100000,
    lr=1e-3,
  )


if __name__ == '__main__':
  main()

Run Google Dopamine Algorithms

Google provides a framework Dopamine, and implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of Rainbow. Using the Gym wrapper, we can run Unity environments using Dopamine.

First, after installing the Gym wrapper, clone the Dopamine repository.

git clone https://github.com/google/dopamine

Then, follow the appropriate install instructions as specified on Dopamine's homepage. Note that the Dopamine guide specifies using a virtualenv. If you choose to do so, make sure your unity_env package is also installed within the same virtualenv as Dopamine.

Adapting Dopamine's Scripts

First, open dopamine/atari/run_experiment.py. Alternatively, copy the entire atari folder, and name it something else (e.g. unity). If you choose the copy approach, be sure to change the package names in the import statements in train.py to your new directory.

Within run_experiment.py, we will need to make changes to which environment is instantiated, just as in the Baselines example. At the top of the file, insert

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.envs import UnityToGymWrapper

to import the Gym Wrapper. Navigate to the create_atari_environment method in the same file, and switch to instantiating a Unity environment by replacing the method with the following code.

    game_version = 'v0' if sticky_actions else 'v4'
    full_game_name = '{}NoFrameskip-{}'.format(game_name, game_version)
    unity_env = UnityEnvironment(<path-to-environment>)
    env = UnityToGymWrapper(unity_env, uint8_visual=True)
    return env

<path-to-environment> is the path to your built Unity executable. For more information on building Unity environments, see here, and note the Limitations section below.

Note that we are not using the preprocessor from Dopamine, as it uses many Atari-specific calls. Furthermore, frame-skipping can be done from within Unity, rather than on the Python side.

Limitations

Since Dopamine is designed around variants of DQN, it is only compatible with discrete action spaces, and specifically the Discrete Gym space. For environments that use branched discrete action spaces, you can enable the flatten_branched parameter in UnityToGymWrapper, which treats each combination of branched actions as separate actions.

Furthermore, when building your environments, ensure that your Agent is using visual observations with greyscale enabled, and that the dimensions of the visual observations is 84 by 84 (matches the parameter found in dqn_agent.py and rainbow_agent.py). Dopamine's agents currently do not automatically adapt to the observation dimensions or number of channels.

Hyperparameters

The hyperparameters provided by Dopamine are tailored to the Atari games, and you will likely need to adjust them for ML-Agents environments. Here is a sample dopamine/agents/rainbow/configs/rainbow.gin file that is known to work with a simple GridWorld.

import dopamine.agents.rainbow.rainbow_agent
import dopamine.unity.run_experiment
import dopamine.replay_memory.prioritized_replay_buffer
import gin.tf.external_configurables

RainbowAgent.num_atoms = 51
RainbowAgent.stack_size = 1
RainbowAgent.vmax = 10.
RainbowAgent.gamma = 0.99
RainbowAgent.update_horizon = 3
RainbowAgent.min_replay_history = 20000  # agent steps
RainbowAgent.update_period = 5
RainbowAgent.target_update_period = 50  # agent steps
RainbowAgent.epsilon_train = 0.1
RainbowAgent.epsilon_eval = 0.01
RainbowAgent.epsilon_decay_period = 50000  # agent steps
RainbowAgent.replay_scheme = 'prioritized'
RainbowAgent.tf_device = '/cpu:0'  # use '/cpu:*' for non-GPU version
RainbowAgent.optimizer = @tf.train.AdamOptimizer()

tf.train.AdamOptimizer.learning_rate = 0.00025
tf.train.AdamOptimizer.epsilon = 0.0003125

Runner.game_name = "Unity" # any name can be used here
Runner.sticky_actions = False
Runner.num_iterations = 200
Runner.training_steps = 10000  # agent steps
Runner.evaluation_steps = 500  # agent steps
Runner.max_steps_per_episode = 27000  # agent steps

WrappedPrioritizedReplayBuffer.replay_capacity = 1000000
WrappedPrioritizedReplayBuffer.batch_size = 32

This example assumed you copied atari to a separate folder named unity. Replace unity in import dopamine.unity.run_experiment with the folder you copied your run_experiment.py and trainer.py files to. If you directly modified the existing files, then use atari here.

Starting a Run

You can now run Dopamine as you would normally:

python -um dopamine.unity.train \
  --agent_name=rainbow \
  --base_dir=/tmp/dopamine \
  --gin_files='dopamine/agents/rainbow/configs/rainbow.gin'

Again, we assume that you've copied atari into a separate folder. Remember to replace unity with the directory you copied your files into. If you edited the Atari files directly, this should be atari.

Example: GridWorld

As a baseline, here are rewards over time for the three algorithms provided with Dopamine as run on the GridWorld example environment. All Dopamine (DQN, Rainbow, C51) runs were done with the same epsilon, epsilon decay, replay history, training steps, and buffer settings as specified above. Note that the first 20000 steps are used to pre-fill the training buffer, and no learning happens.

We provide results from our PPO implementation and the DQN from Baselines as reference. Note that all runs used the same greyscale GridWorld as Dopamine. For PPO, num_layers was set to 2, and all other hyperparameters are the default for GridWorld in config/ppo/GridWorld.yaml. For Baselines DQN, the provided hyperparameters in the previous section are used. Note that Baselines implements certain features (e.g. dueling-Q) that are not enabled in Dopamine DQN.

Dopamine on GridWorld