Training ML-Agents

Table of Contents

Training ML-Agents
Training with mlagents-learn
Training Configurations

For a broad overview of reinforcement learning, imitation learning and all the training scenarios, methods and options within the ML-Agents Toolkit, see ML-Agents Toolkit Overview.

Once your learning environment has been created and is ready for training, the next step is to initiate a training run. Training in the ML-Agents Toolkit is powered by a dedicated Python package, mlagents. This package exposes a command mlagents-learn that is the single entry point for all training workflows (e.g. reinforcement leaning, imitation learning, curriculum learning). Its implementation can be found at ml-agents/mlagents/trainers/learn.py.

Training with mlagents-learn

Starting Training

mlagents-learn is the main training utility provided by the ML-Agents Toolkit. It accepts a number of CLI options in addition to a YAML configuration file that contains all the configurations and hyperparameters to be used during training. The set of configurations and hyperparameters to include in this file depend on the agents in your environment and the specific training method you wish to utilize. Keep in mind that the hyperparameter values can have a big impact on the training performance (i.e. your agent's ability to learn a policy that solves the task). In this page, we will review all the hyperparameters for all training methods and provide guidelines and advice on their values.

To view a description of all the CLI options accepted by mlagents-learn, use the --help:

mlagents-learn --help

The basic command for training is:

mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier>

where

<trainer-config-file> is the file path of the trainer configuration YAML. This contains all the hyperparameter values. We offer a detailed guide on the structure of this file and the meaning of the hyperparameters (and advice on how to set them) in the dedicated Training Configurations section below.
<env_name>(Optional) is the name (including path) of your Unity executable containing the agents to be trained. If <env_name> is not passed, the training will happen in the Editor. Press the Play button in Unity when the message "Start training by pressing the Play button in the Unity Editor" is displayed on the screen.
<run-identifier> is a unique name you can use to identify the results of your training runs.

See the Getting Started Guide for a sample execution of the mlagents-learn command.

Observing Training

Regardless of which training methods, configurations or hyperparameters you provide, the training process will always generate three artifacts, all found in the results/<run-identifier> folder:

Summaries: these are training metrics that are updated throughout the training process. They are helpful to monitor your training performance and may help inform how to update your hyperparameter values. See Using TensorBoard for more details on how to visualize the training metrics.
Models: these contain the model checkpoints that are updated throughout training and the final model file (.onnx). This final model file is generated once either when training completes or is interrupted.
Timers file (under results/<run-identifier>/run_logs): this contains aggregated metrics on your training process, including time spent on specific code blocks. See Profiling in Python for more information on the timers generated.

These artifacts are updated throughout the training process and finalized when training is completed or is interrupted.

Stopping and Resuming Training

To interrupt training and save the current progress, hit Ctrl+C once and wait for the model(s) to be saved out.

To resume a previously interrupted or completed training run, use the --resume flag and make sure to specify the previously used run ID.

If you would like to re-run a previously interrupted or completed training run and re-use the same run ID (in this case, overwriting the previously generated artifacts), then use the --force flag.

Loading an Existing Model

You can also use this mode to run inference of an already-trained model in Python by using both the --resume and --inference flags. Note that if you want to run inference in Unity, you should use the Sentis.

Additionally, if the network architecture changes, you may still load an existing model, but ML-Agents will only load the parts of the model it can load and ignore all others. For instance, if you add a new reward signal, the existing model will load but the new reward signal will be initialized from scratch. If you have a model with a visual encoder (CNN) but change the hidden_units, the CNN will be loaded but the body of the network will be initialized from scratch.

Alternatively, you might want to start a new training run but initialize it using an already-trained model. You may want to do this, for instance, if your environment changed and you want a new model, but the old behavior is still better than random. You can do this by specifying --initialize-from=<run-identifier>, where <run-identifier> is the old run ID.

Training Configurations

The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods and options. As such, specific training runs may require different training configurations and may generate different artifacts and TensorBoard statistics. This section offers a detailed guide into how to manage the different training set-ups withing the toolkit.

More specifically, this section offers a detailed guide on the command-line flags for mlagents-learn that control the training configurations:

<trainer-config-file>: defines the training hyperparameters for each Behavior in the scene, and the set-ups for the environment parameters (Curriculum Learning and Environment Parameter Randomization)

It is important to highlight that successfully training a Behavior in the ML-Agents Toolkit involves tuning the training hyperparameters and configuration. This guide contains some best practices for tuning the training process when the default parameters don't seem to be giving the level of performance you would like. We provide sample configuration files for our example environments in the config/ directory. The config/ppo/3DBall.yaml was used to train the 3D Balance Ball in the Getting Started guide. That configuration file uses the PPO trainer, but we also have configuration files for SAC and GAIL.

Additionally, the set of configurations you provide depend on the training functionalities you use (see ML-Agents Toolkit Overview for a description of all the training functionalities). Each functionality you add typically has its own training configurations. For instance:

Use PPO or SAC?
Use Recurrent Neural Networks for adding memory to your agents?
Use the intrinsic curiosity module?
Ignore the environment reward signal?
Pre-train using behavioral cloning? (Assuming you have recorded demonstrations.)
Include the GAIL intrinsic reward signals? (Assuming you have recorded demonstrations.)
Use self-play? (Assuming your environment includes multiple agents.)

The trainer config file, <trainer-config-file>, determines the features you will use during training, and the answers to the above questions will dictate its contents. The rest of this guide breaks down the different sub-sections of the trainer config file and explains the possible settings for each. If you need a list of all the trainer configurations, please see Training Configuration File.

NOTE: The configuration file format has been changed between 0.17.0 and 0.18.0 and between 0.18.0 and onwards. To convert an old set of configuration files (trainer config, curriculum, and sampler files) to the new format, a script has been provided. Run python -m mlagents.trainers.upgrade_config -h in your console to see the script's usage.

Adding CLI Arguments to the Training Configuration file

Additionally, within the training configuration YAML file, you can also add the CLI arguments (such as --num-envs).

Reminder that a detailed description of all the CLI arguments can be found by using the help utility:

mlagents-learn --help

These additional CLI arguments are grouped into environment, engine, checkpoint and torch. The available settings and example values are shown below.

Environment settings

env_settings:
  env_path: FoodCollector
  env_args: null
  base_port: 5005
  num_envs: 1
  timeout_wait: 10
  seed: -1
  max_lifetime_restarts: 10
  restarts_rate_limit_n: 1
  restarts_rate_limit_period_s: 60

Engine settings

engine_settings:
  width: 84
  height: 84
  quality_level: 5
  time_scale: 20
  target_frame_rate: -1
  capture_frame_rate: 60
  no_graphics: false

Checkpoint settings

checkpoint_settings:
  run_id: foodtorch
  initialize_from: null
  load_model: false
  resume: false
  force: true
  train_model: false
  inference: false

Torch settings:

torch_settings:
  device: cpu

Behavior Configurations

The primary section of the trainer config file is a set of configurations for each Behavior in your scene. These are defined under the sub-section behaviors in your trainer config file. Some of the configurations are required while others are optional. To help us get started, below is a sample file that includes all the possible settings if we're using a PPO trainer with all the possible training functionalities enabled (memory, behavioral cloning, curiosity, GAIL and self-play). You will notice that curriculum and environment parameter randomization settings are not part of the behaviors configuration, but in their own section called environment_parameters.

behaviors:
  BehaviorPPO:
    trainer_type: ppo

    hyperparameters:
      # Hyperparameters common to PPO and SAC
      batch_size: 1024
      buffer_size: 10240
      learning_rate: 3.0e-4
      learning_rate_schedule: linear

      # PPO-specific hyperparameters
      beta: 5.0e-3
      beta_schedule: constant
      epsilon: 0.2
      epsilon_schedule: linear
      lambd: 0.95
      num_epoch: 3
      shared_critic: False

    # Configuration of the neural network (common to PPO/SAC)
    network_settings:
      vis_encode_type: simple
      normalize: false
      hidden_units: 128
      num_layers: 2
      # memory
      memory:
        sequence_length: 64
        memory_size: 256

    # Trainer configurations common to all trainers
    max_steps: 5.0e5
    time_horizon: 64
    summary_freq: 10000
    keep_checkpoints: 5
    checkpoint_interval: 50000
    threaded: false
    init_path: null

    # behavior cloning
    behavioral_cloning:
      demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
      strength: 0.5
      steps: 150000
      batch_size: 512
      num_epoch: 3
      samples_per_update: 0

    reward_signals:
      # environment reward (default)
      extrinsic:
        strength: 1.0
        gamma: 0.99

      # curiosity module
      curiosity:
        strength: 0.02
        gamma: 0.99
        encoding_size: 256
        learning_rate: 3.0e-4

      # GAIL
      gail:
        strength: 0.01
        gamma: 0.99
        encoding_size: 128
        demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
        learning_rate: 3.0e-4
        use_actions: false
        use_vail: false

    # self-play
    self_play:
      window: 10
      play_against_latest_model_ratio: 0.5
      save_steps: 50000
      swap_steps: 2000
      team_change: 100000

Here is an equivalent file if we use an SAC trainer instead. Notice that the configurations for the additional functionalities (memory, behavioral cloning, curiosity and self-play) remain unchanged.

behaviors:
  BehaviorSAC:
    trainer_type: sac

    # Trainer configs common to PPO/SAC (excluding reward signals)
    # same as PPO config

    # SAC-specific configs (replaces the hyperparameters section above)
    hyperparameters:
      # Hyperparameters common to PPO and SAC
      # Same as PPO config

      # SAC-specific hyperparameters
      # Replaces the "PPO-specific hyperparameters" section above
      buffer_init_steps: 0
      tau: 0.005
      steps_per_update: 10.0
      save_replay_buffer: false
      init_entcoef: 0.5
      reward_signal_steps_per_update: 10.0

    # Configuration of the neural network (common to PPO/SAC)
    network_settings:
      # Same as PPO config

    # Trainer configurations common to all trainers
      # <Same as PPO config>

    # pre-training using behavior cloning
    behavioral_cloning:
      # same as PPO config

    reward_signals:
      # environment reward
      extrinsic:
        # same as PPO config

      # curiosity module
      curiosity:
        # same as PPO config

      # GAIL
      gail:
        # same as PPO config

    # self-play
    self_play:
      # same as PPO config

We now break apart the components of the configuration file and describe what each of these parameters mean and provide guidelines on how to set them. See Training Configuration File for a detailed description of all the configurations listed above, along with their defaults. Unless otherwise specified, omitting a configuration will revert it to its default.

Default Behavior Settings

In some cases, you may want to specify a set of default configurations for your Behaviors. This may be useful, for instance, if your Behavior names are generated procedurally by the environment and not known before runtime, or if you have many Behaviors with very similar settings. To specify a default configuration, insert a default_settings section in your YAML. This section should be formatted exactly like a configuration for a Behavior.

default_settings:
  # < Same as Behavior configuration >
behaviors:
  # < Same as above >

Behaviors found in the environment that aren't specified in the YAML will now use the default_settings, and unspecified settings in behavior configurations will default to the values in default_settings if specified there.

Environment Parameters

In order to control the EnvironmentParameters in the Unity simulation during training, you need to add a section called environment_parameters. For example you can set the value of an EnvironmentParameter called my_environment_parameter to 3.0 with the following code :

behaviors:
  BehaviorY:
    # < Same as above >

# Add this section
environment_parameters:
  my_environment_parameter: 3.0

Inside the Unity simulation, you can access your Environment Parameters by doing :

Academy.Instance.EnvironmentParameters.GetWithDefault("my_environment_parameter", 0.0f);

Environment Parameter Randomization

To enable environment parameter randomization, you need to edit the environment_parameters section of your training configuration yaml file. Instead of providing a single float value for your environment parameter, you can specify a sampler instead. Here is an example with three environment parameters called mass, length and scale:

behaviors:
  BehaviorY:
    # < Same as above >

# Add this section
environment_parameters:
  mass:
    sampler_type: uniform
    sampler_parameters:
        min_value: 0.5
        max_value: 10

  length:
    sampler_type: multirangeuniform
    sampler_parameters:
        intervals: [[7, 10], [15, 20]]

  scale:
    sampler_type: gaussian
    sampler_parameters:
        mean: 2
        st_dev: .3

Setting	Description
`sampler_type`	A string identifier for the type of sampler to use for this `Environment Parameter`.
`sampler_parameters`	The parameters for a given `sampler_type`. Samplers of different types can have different `sampler_parameters`

Supported Sampler Types

Below is a list of the sampler_type values supported by the toolkit.

uniform - Uniform sampler
Uniformly samples a single float value from a range with a given minimum and maximum value (inclusive).
parameters - min_value, max_value
gaussian - Gaussian sampler
Samples a single float value from a normal distribution with a given mean and standard deviation.
parameters - mean, st_dev
multirange_uniform - Multirange uniform sampler
First, samples an interval from a set of intervals in proportion to relative length of the intervals. Then, uniformly samples a single float value from the sampled interval (inclusive). This sampler can take an arbitrary number of intervals in a list in the following format: [[interval_1_min, interval_1_max], [interval_2_min, interval_2_max], ...]
parameters - intervals

The implementation of the samplers can be found in the Samplers.cs file.

Training with Environment Parameter Randomization

After the sampler configuration is defined, we proceed by launching mlagents-learn and specify trainer configuration with parameter randomization enabled. For example, if we wanted to train the 3D ball agent with parameter randomization, we would run

mlagents-learn config/ppo/3DBall_randomize.yaml --run-id=3D-Ball-randomize

We can observe progress and metrics via TensorBoard.

Curriculum

To enable curriculum learning, you need to add a curriculum sub-section to your environment parameter. Here is one example with the environment parameter my_environment_parameter :

behaviors:
  BehaviorY:
    # < Same as above >

# Add this section
environment_parameters:
  my_environment_parameter:
    curriculum:
      - name: MyFirstLesson # The '-' is important as this is a list
        completion_criteria:
          measure: progress
          behavior: my_behavior
          signal_smoothing: true
          min_lesson_length: 100
          threshold: 0.2
        value: 0.0
      - name: MySecondLesson # This is the start of the second lesson
        completion_criteria:
          measure: progress
          behavior: my_behavior
          signal_smoothing: true
          min_lesson_length: 100
          threshold: 0.6
          require_reset: true
        value:
          sampler_type: uniform
          sampler_parameters:
            min_value: 4.0
            max_value: 7.0
      - name: MyLastLesson
        value: 8.0

Note that this curriculum only applies to my_environment_parameter. The curriculum section contains a list of Lessons. In the example, the lessons are named MyFirstLesson, MySecondLesson and MyLastLesson. Each Lesson has 3 fields :

name which is a user defined name for the lesson (The name of the lesson will be displayed in the console when the lesson changes)
completion_criteria which determines what needs to happen in the simulation before the lesson can be considered complete. When that condition is met, the curriculum moves on to the next Lesson. Note that you do not need to specify a completion_criteria for the last Lesson
value which is the value the environment parameter will take during the lesson. Note that this can be a float or a sampler.

There are the different settings of the completion_criteria :

Setting	Description
`measure`	What to measure learning progress, and advancement in lessons by. `reward` uses a measure of received reward, `progress` uses the ratio of steps/max_steps, while `Elo` is available only for self-play situations and uses Elo score as a curriculum completion measure.
`behavior`	Specifies which behavior is being tracked. There can be multiple behaviors with different names, each at different points of training. This setting allows the curriculum to track only one of them.
`threshold`	Determines at what point in value of `measure` the lesson should be increased.
`min_lesson_length`	The minimum number of episodes that should be completed before the lesson can change. If `measure` is set to `reward`, the average cumulative reward of the last `min_lesson_length` episodes will be used to determine if the lesson should change. Must be nonnegative. Important: the average reward that is compared to the thresholds is different than the mean reward that is logged to the console. For example, if `min_lesson_length` is `100`, the lesson will increment after the average cumulative reward of the last `100` episodes exceeds the current threshold. The mean reward logged to the console is dictated by the `summary_freq` parameter defined above.
`signal_smoothing`	Whether to weight the current progress measure by previous values.
`require_reset`	Whether changing lesson requires the environment to reset (default: false)
##### Training with a Curriculum

Once we have specified our metacurriculum and curricula, we can launch mlagents-learn to point to the config file containing our curricula and PPO will train using Curriculum Learning. For example, to train agents in the Wall Jump environment with curriculum learning, we can run:

mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum

We can then keep track of the current lessons and progresses via TensorBoard. If you've terminated the run, you can resume it using --resume and lesson progress will start off where it ended.

Training Using Concurrent Unity Instances

In order to run concurrent Unity instances during training, set the number of environment instances using the command line option --num-envs=<n> when you invoke mlagents-learn. Optionally, you can also set the --base-port, which is the starting port used for the concurrent Unity instances.

Some considerations:

Buffer Size - If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase buffer_size in the trainer config file. A common practice is to multiply buffer_size by num-envs.
Resource Constraints - Invoking concurrent Unity instances is constrained by the resources on the machine. Please use discretion when setting --num-envs=<n>.
Result Variation Using Concurrent Unity Instances - If you keep all the hyperparameters the same, but change --num-envs=<n>, the results and model would likely change.