Skip to content

feat(scripts): Add TOML linting workflow#26

Merged
agreaves-ms merged 1 commit intomainfrom
feat/rsl_train_osmo
Dec 11, 2025
Merged

feat(scripts): Add TOML linting workflow#26
agreaves-ms merged 1 commit intomainfrom
feat/rsl_train_osmo

Conversation

@akzaidi
Copy link
Contributor

@akzaidi akzaidi commented Dec 11, 2025

RSL_RL Backend for OSMO

PR Soundtrack: Little Brother - Altitudes

Summary

This pull request introduces support for selecting the training backend (between skrl and rsl_rl) for the OSMO training workflow scripts, and updates the RSL-RL integration to be compatible with RSL-RL 3.x, including proper handling of the new TensorDict observation format. It also updates dependencies and improves environment variable and argument handling for consistency.

Training backend selection:

  • Added a new --backend/-b CLI argument and TRAINING_BACKEND environment variable to scripts/submit-osmo-training.sh, allowing users to select the training backend (skrl or rsl_rl). This value is passed through the workflow and made available in training jobs. [1] [2] [3] [4] [5] [6] [7]

RSL-RL 3.x compatibility:

  • Updated the RSL-RL dependency to version 3.1.2 and added tensordict as a required dependency.
  • Introduced a RslRl3xCompatWrapper in train.py to ensure all environment observations conform to the TensorDict format required by RSL-RL 3.x, supporting backward compatibility with older wrappers. [1] [2]
  • Updated the runner configuration and instantiation logic to ensure required fields (like obs_groups and class_name) are present and correct for RSL-RL 3.x. [1] [2] [3]

Argument and environment improvements:

  • Added support for --checkpoint-mode and --register-checkpoint arguments in launch_rsl_rl.py for improved checkpoint handling.
  • add RslRl3xCompatWrapper for TensorDict observation format in RSL-RL 3.x
  • add --backend/-b flag to select training backend (skrl, rsl_rl)
  • upgrade rsl-rl-lib to 3.1.2 with tensordict dependency
  • pass training_backend parameter through OSMO workflow template

🤖 - Generated by Copilot

…ackend selection

* add RslRl3xCompatWrapper for TensorDict observation format in RSL-RL 3.x
* add --backend/-b flag to select training backend (skrl, rsl_rl)
* upgrade rsl-rl-lib to 3.1.2 with tensordict dependency
* pass training_backend parameter through OSMO workflow template

🤖 - Generated by Copilot
Copilot AI review requested due to automatic review settings December 11, 2025 00:10
@agreaves-ms agreaves-ms merged commit 4986caa into main Dec 11, 2025
5 checks passed
@agreaves-ms agreaves-ms deleted the feat/rsl_train_osmo branch December 11, 2025 00:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces support for selecting between skrl and rsl_rl training backends in the OSMO workflow and upgrades RSL-RL to version 3.1.2 with TensorDict observation format compatibility. The changes enable flexible backend selection while maintaining backward compatibility through a new compatibility wrapper that handles the TensorDict requirements of RSL-RL 3.x.

Key changes:

  • Added --backend CLI flag and TRAINING_BACKEND environment variable to select between skrl (default) and rsl_rl backends
  • Upgraded rsl-rl-lib from 3.0.1 to 3.1.2 and added tensordict dependency for RSL-RL 3.x compatibility
  • Implemented RslRl3xCompatWrapper to ensure observations conform to TensorDict format required by RSL-RL 3.x

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
workflows/osmo/train.yaml Added training_backend parameter to workflow template with default value "skrl"
src/training/scripts/rsl_rl/train.py Upgraded to RSL-RL 3.x with TensorDict import, RslRlOnPolicyRunnerCfg, and new RslRl3xCompatWrapper class for observation format compatibility
src/training/scripts/launch_rsl_rl.py Added --checkpoint-mode and --register-checkpoint CLI arguments for improved checkpoint handling
src/training/requirements.txt Upgraded rsl-rl-lib to 3.1.2 and added tensordict>=0.7.0 dependency
scripts/submit-osmo-training.sh Added --backend/-b flag for training backend selection with environment variable support


# Convert config to dict and ensure RSL-RL 3.x required fields are present
agent_cfg_dict = agent_cfg.to_dict()
# RSL-RL 3.x requires 'obs_groups' with a 'policy' key mapping to observation group names
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "RSL-RL 3.x requires 'obs_groups' with a 'policy' key mapping to observation group names" but the default value sets it to {"policy": ["policy"]}, which maps "policy" to a list containing "policy". The comment could be clearer about what this structure represents - specifically that the outer key is the policy name and the inner list contains the observation group names to use for that policy.

Suggested change
# RSL-RL 3.x requires 'obs_groups' with a 'policy' key mapping to observation group names
# RSL-RL 3.x requires 'obs_groups' to be a dict mapping each policy name (outer key)
# to a list of observation group names (inner list) to use for that policy.

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +175
-b|--backend)
BACKEND_VALUE="$2"
shift 2
;;
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --backend argument accepts any arbitrary string value without validation. Based on the documentation (line 38), only "skrl" and "rsl_rl" are valid backends. Consider adding validation after the argument parsing to ensure BACKEND_VALUE is one of the supported values, similar to how checkpoint_mode is validated via normalize_checkpoint_mode().

Copilot uses AI. Check for mistakes.
rsl-rl-lib==3.0.1
rsl-rl-lib==3.1.2
skrl>=1.4.3
tensordict>=0.7.0 # Required by RSL-RL 3.x for TensorDict observations
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tensordict dependency uses a minimum version constraint (>=0.7.0) while most other dependencies in this file use exact version pins (==). For consistency and reproducibility, consider using an exact version pin like tensordict==0.7.0 or tensordict==0.8.0, matching the pinning pattern used for other dependencies such as rsl-rl-lib, azure-core, and pyperclip.

Suggested change
tensordict>=0.7.0 # Required by RSL-RL 3.x for TensorDict observations
tensordict==0.7.0 # Required by RSL-RL 3.x for TensorDict observations

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +221
return TensorDict(obs, batch_size=[self._env.num_envs])
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=[self._env.num_envs])
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=[self._env.num_envs])
return TensorDict({"policy": obs_data}, batch_size=[self._env.num_envs])
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TensorDict initialization uses batch_size=[self._env.num_envs] but TensorDict expects batch_size to be a torch.Size or tuple representing the batch dimensions, not a list. While lists are implicitly converted, it would be more explicit and correct to use batch_size=torch.Size([self._env.num_envs]) or batch_size=(self._env.num_envs,). This same issue appears in lines 212, 215, 220, and 221.

Suggested change
return TensorDict(obs, batch_size=[self._env.num_envs])
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=[self._env.num_envs])
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=[self._env.num_envs])
return TensorDict({"policy": obs_data}, batch_size=[self._env.num_envs])
return TensorDict(obs, batch_size=(self._env.num_envs,))
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=(self._env.num_envs,))
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=(self._env.num_envs,))
return TensorDict({"policy": obs_data}, batch_size=(self._env.num_envs,))

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +221
return TensorDict(obs, batch_size=[self._env.num_envs])
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=[self._env.num_envs])
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=[self._env.num_envs])
return TensorDict({"policy": obs_data}, batch_size=[self._env.num_envs])
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TensorDict initialization uses batch_size=[self._env.num_envs] but TensorDict expects batch_size to be a torch.Size or tuple representing the batch dimensions, not a list. While lists are implicitly converted, it would be more explicit and correct to use batch_size=torch.Size([self._env.num_envs]) or batch_size=(self._env.num_envs,).

Suggested change
return TensorDict(obs, batch_size=[self._env.num_envs])
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=[self._env.num_envs])
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=[self._env.num_envs])
return TensorDict({"policy": obs_data}, batch_size=[self._env.num_envs])
return TensorDict(obs, batch_size=(self._env.num_envs,))
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=(self._env.num_envs,))
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=(self._env.num_envs,))
return TensorDict({"policy": obs_data}, batch_size=(self._env.num_envs,))

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +221
return TensorDict(obs, batch_size=[self._env.num_envs])
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=[self._env.num_envs])
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=[self._env.num_envs])
return TensorDict({"policy": obs_data}, batch_size=[self._env.num_envs])
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TensorDict initialization uses batch_size=[self._env.num_envs] but TensorDict expects batch_size to be a torch.Size or tuple representing the batch dimensions, not a list. While lists are implicitly converted, it would be more explicit and correct to use batch_size=torch.Size([self._env.num_envs]) or batch_size=(self._env.num_envs,).

Suggested change
return TensorDict(obs, batch_size=[self._env.num_envs])
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=[self._env.num_envs])
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=[self._env.num_envs])
return TensorDict({"policy": obs_data}, batch_size=[self._env.num_envs])
return TensorDict(obs, batch_size=(self._env.num_envs,))
if isinstance(obs, torch.Tensor):
# Single tensor - wrap as 'policy' observation group
return TensorDict({"policy": obs}, batch_size=(self._env.num_envs,))
if isinstance(obs, tuple):
# Tuple (obs_tensor, extras) from older wrappers
obs_data = obs[0]
if isinstance(obs_data, dict):
return TensorDict(obs_data, batch_size=(self._env.num_envs,))
return TensorDict({"policy": obs_data}, batch_size=(self._env.num_envs,))

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +73
parser.add_argument(
"--checkpoint-mode",
type=_optional_str,
default="from-scratch",
help="Checkpoint handling mode (fresh is treated as from-scratch)",
)
parser.add_argument(
"--register-checkpoint",
type=_optional_str,
default=None,
help="Register the final checkpoint as this Azure ML model name",
)
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added arguments --checkpoint-mode and --register-checkpoint are parsed but never used in the code. The parsed values (args.checkpoint_mode and args.register_checkpoint) are not referenced anywhere in the function implementations or passed to the training script. Either these arguments should be utilized in the workflow or they should be removed if they're not needed yet.

Suggested change
parser.add_argument(
"--checkpoint-mode",
type=_optional_str,
default="from-scratch",
help="Checkpoint handling mode (fresh is treated as from-scratch)",
)
parser.add_argument(
"--register-checkpoint",
type=_optional_str,
default=None,
help="Register the final checkpoint as this Azure ML model name",
)

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +202
# Proxy all attributes to the underlying env
for attr in dir(env):
if not attr.startswith("_") and attr not in ("get_observations", "step", "reset"):
try:
setattr(self, attr, getattr(env, attr))
except AttributeError:
pass
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attribute proxying logic in the init method is redundant and potentially problematic. The loop copies attributes to self, but getattr already provides dynamic attribute access to self._env. This dual approach can lead to stale attributes if the wrapped environment's attributes change after initialization. Consider removing the loop (lines 196-202) and relying solely on getattr for attribute delegation, which is the standard pattern for wrapper classes.

Suggested change
# Proxy all attributes to the underlying env
for attr in dir(env):
if not attr.startswith("_") and attr not in ("get_observations", "step", "reset"):
try:
setattr(self, attr, getattr(env, attr))
except AttributeError:
pass

Copilot uses AI. Check for mistakes.
@WilliamBerryiii WilliamBerryiii changed the title Adds RSL_RL training backed for OSMO workflows feat(scripts): Add TOML linting workflow Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants