Skip to content

Training Configuration File

Table of Contents

Common Trainer Configurations

One of the first decisions you need to make regarding your training run is which trainer to use: PPO, SAC, or POCA. There are some training configurations that are common to both trainers (which we review now) and others that depend on the choice of the trainer (which we review on subsequent sections).

Setting Description
trainer_type (default = ppo) The type of trainer to use: ppo, sac, or poca.
summary_freq (default = 50000) Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard.
time_horizon (default = 64) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions.

Typical range: 32 - 2048
max_steps (default = 500000) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same max_steps count.

Typical range: 5e5 - 1e7
keep_checkpoints (default = 5) The maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the checkpoint_interval option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint.
even_checkpoints (default = false) If set to true, ignores checkpoint_interval and evenly distributes checkpoints throughout training based on keep_checkpointsand max_steps, i.e. checkpoint_interval = max_steps / keep_checkpoints. Useful for cataloging agent behavior throughout training.
checkpoint_interval (default = 500000) The number of experiences collected between each checkpoint by the trainer. A maximum of keep_checkpoints checkpoints are saved before old ones are deleted. Each checkpoint saves the .onnx files in results/ folder.
init_path (default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents.

You can provide either the file name or the full path to the checkpoint, e.g. {} or ./models/{run-id}/{behavior_name}/{}. This option is provided in case you want to initialize different behaviors from different runs or initialize from an older checkpoint; in most cases, it is sufficient to use the --initialize-from CLI parameter to initialize all models from the same run.
threaded (default = false) Allow environments to step while updating the model. This might result in a training speedup, especially when using SAC. For best performance, leave setting to false when using self-play.
hyperparameters -> learning_rate (default = 3e-4) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase.

Typical range: 1e-5 - 1e-3
hyperparameters -> batch_size Number of experiences in each iteration of gradient descent. This should always be multiple times smaller than buffer_size. If you are using continuous actions, this value should be large (on the order of 1000s). If you are using only discrete actions, this value should be smaller (on the order of 10s).

Typical range: (Continuous - PPO): 512 - 5120; (Continuous - SAC): 128 - 1024; (Discrete, PPO & SAC): 32 - 512.
hyperparameters -> buffer_size (default = 10240 for PPO and 50000 for SAC)
PPO: Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. This should be multiple times larger than batch_size. Typically a larger buffer_size corresponds to more stable training updates.
SAC: The max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences.

Typical range: PPO: 2048 - 409600; SAC: 50000 - 1000000
hyperparameters -> learning_rate_schedule (default = linear for PPO and constant for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally.

linear decays the learning_rate linearly, reaching 0 at max_steps, while constant keeps the learning rate constant for the entire training run.
network_settings -> hidden_units (default = 128) Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger.

Typical range: 32 - 512
network_settings -> num_layers (default = 2) The number of hidden layers in the neural network. Corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.

Typical range: 1 - 3
network_settings -> normalize (default = false) Whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems.
network_settings -> vis_encode_type (default = simple) Encoder type for encoding visual observations.

simple (default) uses a simple encoder which consists of two convolutional layers, nature_cnn uses the CNN implementation proposed by Mnih et al., consisting of three convolutional layers, and resnet uses the IMPALA Resnet consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. match3 is a smaller CNN (Gudmundsoon et al.) that can capture more granular spatial relationships and is optimized for board games. fully_connected uses a single fully connected dense layer as encoder without any convolutional layers.

Due to the size of convolution kernel, there is a minimum observation size limitation that each encoder type can handle - simple: 20x20, nature_cnn: 36x36, resnet: 15 x 15, match3: 5x5. fully_connected doesn't have convolutional layers and thus no size limits, but since it has less representation power it should be reserved for very small inputs. Note that using the match3 CNN with very large visual input might result in a huge observation encoding and thus potentially slow down training or cause memory issues.
network_settings -> conditioning_type (default = hyper) Conditioning type for the policy using goal observations.

none treats the goal observations as regular observations, hyper (default) uses a HyperNetwork with goal observations as input to generate some of the weights of the policy. Note that when using hyper the number of parameters of the network increases greatly. Therefore, it is recommended to reduce the number of hidden_units when using this conditioning_type

Trainer-specific Configurations

Depending on your choice of a trainer, there are additional trainer-specific configurations. We present them below in two separate tables, but keep in mind that you only need to include the configurations for the trainer selected (i.e. the trainer setting above).

PPO-specific Configurations

Setting Description
hyperparameters -> beta (default = 5.0e-3) Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease beta.

Typical range: 1e-4 - 1e-2
hyperparameters -> epsilon (default = 0.2) Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.

Typical range: 0.1 - 0.3
hyperparameters -> beta_schedule (default = learning_rate_schedule) Determines how beta changes over time.

linear decays beta linearly, reaching 0 at max_steps, while constant keeps beta constant for the entire training run. If not explicitly set, the default beta schedule will be set to hyperparameters -> learning_rate_schedule.
hyperparameters -> epsilon_schedule (default = learning_rate_schedule) Determines how epsilon changes over time (PPO only).

linear decays epsilon linearly, reaching 0 at max_steps, while constant keeps the epsilon constant for the entire training run. If not explicitly set, the default epsilon schedule will be set to hyperparameters -> learning_rate_schedule.
hyperparameters -> lambd (default = 0.95) Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate (GAE). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process.

Typical range: 0.9 - 0.95
hyperparameters -> num_epoch (default = 3) Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.

Typical range: 3 - 10
hyperparameters -> shared_critic (default = False) Whether or not the policy and value function networks share a backbone. It may be useful to use a shared backbone when learning from image observations.

SAC-specific Configurations

Setting Description
hyperparameters -> buffer_init_steps (default = 0) Number of experiences to collect into the buffer before updating the policy model. As the untrained policy is fairly random, pre-filling the buffer with random actions is useful for exploration. Typically, at least several episodes of experiences should be pre-filled.

Typical range: 1000 - 10000
hyperparameters -> init_entcoef (default = 1.0) How much the agent should explore in the beginning of training. Corresponds to the initial entropy coefficient set at the beginning of training. In SAC, the agent is incentivized to make its actions entropic to facilitate better exploration. The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy coefficient is automatically adjusted to a preset target entropy, so the init_entcoef only corresponds to the starting value of the entropy bonus. Increase init_entcoef to explore more in the beginning, decrease to converge to a solution faster.

Typical range: (Continuous): 0.5 - 1.0; (Discrete): 0.05 - 0.5
hyperparameters -> save_replay_buffer (default = false) Whether to save and load the experience replay buffer as well as the model when quitting and re-starting training. This may help resumes go more smoothly, as the experiences collected won't be wiped. Note that replay buffers can be very large, and will take up a considerable amount of disk space. For that reason, we disable this feature by default.
hyperparameters -> tau (default = 0.005) How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability.

Typical range: 0.005 - 0.01
hyperparameters -> steps_per_update (default = 1) Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size batch_size from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly steps_per_update steps an update will be made, only that the ratio will hold true over many steps. Typically, steps_per_update should be greater than or equal to 1. Note that setting steps_per_update lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) steps_per_update equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing steps_per_update may improve training speed. We can also change steps_per_update to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow.

Typical range: 1 - 20
hyperparameters -> reward_signal_num_update (default = steps_per_update) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. Kostrikov et. al, Blondé et. al), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change steps_per_update of SAC to N, as well as reward_signal_steps_per_update under reward_signals to N / M to accomplish this. By default, reward_signal_steps_per_update is set to steps_per_update.

MA-POCA-specific Configurations

MA-POCA uses the same configurations as PPO, and there are no additional POCA-specific parameters.

NOTE: Reward signals other than Extrinsic Rewards have not been extensively tested with MA-POCA, though they can still be added and used for training on a your-mileage-may-vary basis.

Reward Signals

The reward_signals section enables the specification of settings for both extrinsic (i.e. environment-based) and intrinsic reward signals (e.g. curiosity and GAIL). Each reward signal should define at least two parameters, strength and gamma, in addition to any class-specific hyperparameters. Note that to remove a reward signal, you should delete its entry entirely from reward_signals. At least one reward signal should be left defined at all times. Provide the following configurations to design the reward signal for your training run.

Extrinsic Rewards

Enable these settings to ensure that your training run incorporates your environment-based reward signal:

Setting Description
extrinsic -> strength (default = 1.0) Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal.

Typical range: 1.00
extrinsic -> gamma (default = 0.99) Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1.

Typical range: 0.8 - 0.995

Curiosity Intrinsic Reward

To enable curiosity, provide these settings:

Setting Description
curiosity -> strength (default = 1.0) Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal.

Typical range: 0.001 - 0.1
curiosity -> gamma (default = 0.99) Discount factor for future rewards.

Typical range: 0.8 - 0.995
curiosity -> network_settings Please see the documentation for network_settings under Common Trainer Configurations. The network specs used by the intrinsic curiosity model. The value should of hidden_units should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations.

Typical range: 64 - 256
curiosity -> learning_rate (default = 3e-4) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable.

Typical range: 1e-5 - 1e-3

GAIL Intrinsic Reward

To enable GAIL (assuming you have recorded demonstrations), provide these settings:

Setting Description
gail -> strength (default = 1.0) Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.

Typical range: 0.01 - 1.0
gail -> gamma (default = 0.99) Discount factor for future rewards.

Typical range: 0.8 - 0.9
gail -> demo_path (Required, no default) The path to your .demo file or directory of .demo files.
gail -> network_settings Please see the documentation for network_settings under Common Trainer Configurations. The network specs for the GAIL discriminator. The value of hidden_units should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times.

Typical range: 64 - 256
gail -> learning_rate (Optional, default = 3e-4) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable.

Typical range: 1e-5 - 1e-3
gail -> use_actions (default = false) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower.
gail -> use_vail (default = false) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand.

RND Intrinsic Reward

Random Network Distillation (RND) is only available for the PyTorch trainers. To enable RND, provide these settings:

Setting Description
rnd -> strength (default = 1.0) Magnitude of the curiosity reward generated by the intrinsic rnd module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal.

Typical range: 0.001 - 0.01
rnd -> gamma (default = 0.99) Discount factor for future rewards.

Typical range: 0.8 - 0.995
rnd -> network_settings Please see the documentation for network_settings under Common Trainer Configurations. The network specs for the RND model.
curiosity -> learning_rate (default = 3e-4) Learning rate used to update the RND module. This should be large enough for the RND module to quickly learn the state representation, but small enough to allow for stable learning.

Typical range: 1e-5 - 1e-3

Behavioral Cloning

To enable Behavioral Cloning as a pre-training option (assuming you have recorded demonstrations), provide the following configurations under the behavioral_cloning section:

Setting Description
demo_path (Required, no default) The path to your .demo file or directory of .demo files.
strength (default = 1.0) Learning rate of the imitation relative to the learning rate of PPO, and roughly corresponds to how strongly we allow BC to influence the policy.

Typical range: 0.1 - 0.5
steps (default = 0) During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. steps corresponds to the training steps over which BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run.
batch_size (default = batch_size of trainer) Number of demonstration experiences used for one iteration of a gradient descent update. If not specified, it will default to the batch_size of the trainer.

Typical range: (Continuous): 512 - 5120; (Discrete): 32 - 512
num_epoch (default = num_epoch of trainer) Number of passes through the experience buffer during gradient descent. If not specified, it will default to the number of epochs set for PPO.

Typical range: 3 - 10
samples_per_update (default = 0) Maximum number of samples to use during each imitation update. You may want to lower this if your demonstration dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 to train over all of the demonstrations at each update step.

Typical range: buffer_size

Memory-enhanced Agents using Recurrent Neural Networks

You can enable your agents to use memory by adding a memory section under network_settings, and setting memory_size and sequence_length:

Setting Description
network_settings -> memory -> memory_size (default = 128) Size of the memory an agent must keep. In order to use a LSTM, training requires a sequence of experiences instead of single experiences. Corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network of the policy. This value must be a multiple of 2, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task.

Typical range: 32 - 256
network_settings -> memory -> sequence_length (default = 64) Defines how long the sequences of experiences must be while training. Note that if this number is too small, the agent will not be able to remember things over longer periods of time. If this number is too large, the neural network will take longer to train.

Typical range: 4 - 128

A few considerations when deciding to use memory:

  • LSTM does not work well with continuous actions. Please use discrete actions for better results.
  • Adding a recurrent layer increases the complexity of the neural network, it is recommended to decrease num_layers when using recurrent.
  • It is required that memory_size be divisible by 2.


Training with self-play adds additional confounding factors to the usual issues faced by reinforcement learning. In general, the tradeoff is between the skill level and generality of the final policy and the stability of learning. Training against a set of slowly or unchanging adversaries with low diversity results in a more stable learning process than training against a set of quickly changing adversaries with high diversity. With this context, this guide discusses the exposed self-play hyperparameters and intuitions for tuning them.

If your environment contains multiple agents that are divided into teams, you can leverage our self-play training option by providing these configurations for each Behavior:

Setting Description
save_steps (default = 20000) Number of trainer steps between snapshots. For example, if save_steps=10000 then a snapshot of the current policy will be saved every 10000 trainer steps. Note, trainer steps are counted per agent. For more information, please see the migration doc after v0.13.

A larger value of save_steps will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.

Typical range: 10000 - 100000
team_change (default = 5 * save_steps) Number of trainer_steps between switching the learning team. This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents per team switch.

A larger value of team-change will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies and so the agent may fail against the next batch of opponents.

The value of team-change will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we recommend setting this value as a function of the save_steps parameter discussed previously.

Typical range: 4x-10x where x=save_steps
swap_steps (default = 10000) Number of ghost steps (not trainer steps) between swapping the opponents policy with a different snapshot. A 'ghost step' refers to a step taken by an agent that is following a fixed policy and not learning. The reason for this distinction is that in asymmetric games, we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number of trainer steps corresponds to the same number of opponent swaps for each team. The formula for swap_steps if a user desires x swaps of a team with num_agents agents against an opponent team with num_opponent_agents agents during team-change total steps is: (num_agents / num_opponent_agents) * (team_change / x)

Typical range: 10000 - 100000
play_against_latest_model_ratio (default = 0.5) Probability an agent will play against the latest opponent policy. With probability 1 - play_against_latest_model_ratio, the agent will play against a snapshot of its opponent from a past iteration.

A larger value of play_against_latest_model_ratio indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an auto-curricula of more increasingly challenging situations which may lead to a stronger final policy.

Typical range: 0.0 - 1.0
window (default = 10) Size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a window size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. A larger value of window means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the save_steps hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training.

Typical range: 5 - 30

Note on Reward Signals

We make the assumption that the final reward in a trajectory corresponds to the outcome of an episode. A final reward of +1 indicates winning, -1 indicates losing and 0 indicates a draw. The ELO calculation (discussed below) depends on this final reward being either +1, 0, -1.

The reward signal should still be used as described in the documentation for the other trainers. However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.

Note on Swap Steps

As an example, in a 2v1 scenario, if we want the swap to occur x=4 times during team-change=200000 steps, the swap_steps for the team of one agent is:

swap_steps = (1 / 2) * (200000 / 4) = 25000 The swap_steps for the team of two agents is:

swap_steps = (2 / 1) * (200000 / 4) = 100000 Note, with equal team sizes, the first term is equal to 1 and swap_steps can be calculated by just dividing the total steps by the desired number of swaps.

A larger value of swap_steps means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected.