You can follow along using the code available in our GitHub repository.

1. Key Features:

  • Built on Isaac Sim & Isaac Lab for high-fidelity robotics simulation.
  • Train & evaluate locomotion policies for humanoid robots.
  • Supports sim-to-real transfer via domain randomization.

2. Code Structure

The structure of st_gym:

st_gym
├── exts
│   └── legged_robots
│       ├── legged_robots
│       │   ├── assets/              # Robot models configuration (e.g., lumos.py)
│       │   ├── tasks/               # Task configurations (env, agents, AMP, etc.)
│       │   │   ├── config/          
│       │   │   │   └── lus2/        # lus2 task configuration
│       │   │   ├── mdp/             # RL MDP components (actions,observations,rewards,etc.)
│       │   │   └── utils/           # Utilities and wrappers (logger, wrappers)
│       │   └── tests/               # Unit tests
│       └── setup.py                 # Module installation entry
│
├── scripts
│   ├── inves_train.py               # inves_training entry script
│   ├── list_envs.py                 # Environment listing tool
│   ├── skrl/                        # skrl algorithm train and play interface
│   └── st_rl/                       # st_rl algorithm interface and utilities
│       ├── conf/                    # Training configuration files
│       ├── train.py                 # Main training script
│       ├── play.py                  # Policy evaluation and replay
│       ├── ros2interface.py         # ROS2 communication interface
│       ├── sim2mujoco.py            # Sim-to-Sim transfer tool (IsaacSim -> MuJoCo)
│       └── sim2sim.py               # General Sim-to-Sim transfer
│
├── third_party
│   └── refmotion_manager/           # Reference motion manager 
│
├── run.sh                           # Quick start script
└── setup.sh                         # Environment initialization script

3. Configuration Details

Configuration files for different environments and algorithms are located in:

st_gym/exts/legged_robots/tasks/config/

└── lus2/
​    ├── agents/
​    │   └── st_rl_ppo_cfg.py       # PPO training parameters
​    ├── __init__.py                     # register train and play environments
​    ├── flat_env_cfg.py              # Flat configuration
​    ├── rough_env_cfg.py          # Rough configuration
​    └── amp_mimic_cfg.py         # Motion data, AMP and mimic configuration

To modify motion sources or environmental properties, edit the corresponding files above.

Next, we will introduce the parameters in rough_env_cfg.py and amp_mimic_cfg.py in detail.

3.1 InteractiveScene(rough_env_cfg.py)

  • Objects: terrain, light, sky_light
  • Articulation: robot
  • Sensors: height_scanner, contact_forces

3.2 Event(rough_env_cfg.py)

  • reset_base: Resets robot base pose & velocity ranges
  • reset_robot_joints: Resets joint positions and velocities
  • Domain randomization
    • physics_material: Randomizes rigid body material properties (friction, restitution)
    • reset_robot_rigid_body_mass: Randomizes robot rigid body masses
    • reset_robot_base_com: Randomizes center of mass of torso and hip
    • randomize_actuator_gains: Randomizes actuator stiffness & damping gains
    • randomize_joint_parameters: Randomizes joint parameters (friction, armature, limits)
  • External disturbance
    • push_robot: Applies random pushes to the robot at intervals
    • Interval: Applies random external forces/torques on the torso

3.3 Curriculum(rough_env_cfg.py)

Curriculum Term Function / Purpose Weight Num Steps
terrain_levels Adjusts terrain difficulty according to robot velocity N/A N/A
alive_rew Modifies weight of the “alive” reward term 1 500
action_rate_l2 Penalizes high action rate (L2 norm) -0.1 500
action_smooothness_2 Penalizes jerkiness in actions -0.1 500
dof_torques_l2 Penalizes large joint torques (L2 norm) -0.000001 1000
dof_acc_l2 Penalizes high joint accelerations (L2 norm) -5E-8 1000
contact_forces Penalizes excessive contact forces -0.0001 1000

3.4 Reward Keys and Weights

Defined inrough_env_cfg.py, and the weights that are often modified are defined in amp_mimic_cfg.py

Task rewards
Parameter Description Example
termination_penalty Penalize termination -450.0
alive Survival reward 5.0
track_lin_vel_xy_exp Reward for tracking target linear velocity in XY plane 2.0
track_ang_vel_z_exp Reward for tracking target angular velocity around Z axis 2.0
feet_air_time Reward for swing foot airtime 0.8
feet_slide Penalty for sliding feet -1.0
joint_deviation_hip Penalize deviation of hip joint from nominal posture -0.1
joint_deviation_feet Penalize deviation of feet joint from nominal posture -0.1
joint_deviation_arms Penalize deviation of arms joint from nominal posture -0.1
joint_deviation_wrist Penalize deviation of wrist joint from nominal posture -0.2
joint_deviation_torso Penalize deviation of torso joint from nominal posture -0.1
energy_cost Penalize energy consumption -2.0e-7
feet_parallel_v1 Reward for keeping feet parallel 1.0
undesired_contacts Penalize undesired body-ground contacts -20.0
feet_stumble Penalize stumbling feet -500.0
action_smooothness_2 Penalize jerky actions (smoothness regularization) -5.0e-2
action_rate_l2 Penalize high action change rate -5.0e-2
dof_acc_l2 Penalize large joint accelerations -1e-8
dof_torques_l2 Penalize large joint torques -4.0e-6
dof_pos_limits Penalize exceeding joint position limits -10.0
dof_vel_limits Penalize exceeding joint velocity limits -5.0
dof_torques_limits Penalize exceeding torque limits -1.0
contact_forces Penalize large contact forces -2.0e-5
AMP style reward
Parameter Description Example
track_style_goal_exp Reward for tracking style goals in AMP 1.47
Mimic tracking rewards
Parameter Description Example
track_upper_joint_pos_exp Track upper body joint positions 20.4
track_upper_joint_vel_exp Track upper body joint velocities 3.5
track_lower_joint_pos_exp Track lower body joint positions 15.0
track_lower_joint_vel_exp Track lower body joint velocities 2.2
track_feet_joint_pos_exp Track feet joint positions 2.0
track_feet_joint_vel_exp Track feet joint velocities 1.0
track_link_pos_exp Track link positions 0.0
track_link_vel_exp Track link velocities 0.0
track_root_pos_exp Track root position 2.0
track_root_quat_exp Track root quaternion (absolute orientation) 0.0
track_root_rotation_exp Track root rotation 2.0
track_root_lin_vel_exp Track root linear velocity 1.0
track_root_ang_vel_exp Track root angular velocity 1.0

3.5 Observation and Action(rough_env_cfg.py)

In our environments, the agent receives observations and outputs actions that include:

  • Policy Observation Space:
    • Proprioception states
      • Base angle velocities
      • Project gravity
      • Joint positions
      • Joint velocities
      • Last actions
    • Goal states
      • Velocity commands
      • optional
        • style_goal_commands (If use amp)
        • expressive_goal_commands (If use mimic)
  • Action Space:
    • Joint position control: joint angle/desired position
  • Critic Policy Observation Space:
    • Root states
      • Base position
      • Base orientation
      • Base linear velocity
      • Base angular velocity
      • Projected gravity
    • Joint & action states
      • Joint positions
      • Joint velocities
      • Last actions
    • Body states (for Mimic)
      • Expressive link positions (body pos)
      • Expressive link velocities (body lin vel)
    • Goal states from the next time frame (future)
      • Velocity commands
      • Optional
        • style_goal_commands (If use amp)
        • expressive_goal_commands (If use mimic)
      • Privileged information
        • Masses, contact forces, joint stiffness/damping, friction coeff

3.6 Commands(rough_env_cfg.py)

Command Type / Class Resampling Time (s) Velocity / Heading Ranges Notes
base_velocity BaseVelocityCommand (5.0, 10.0) Linear X: (-0.2,0.2), Linear Y: (-0.1,0.1), Angular Z: (-0.1,0.1), Heading: (-π, π) Debug visualization enabled
style_goal_commands StyleCommand (if using AMP) (10.0, 10.0) Linear & angular velocities: (-1.0, 1.0), Heading: (-π, π) Number of commands = len(style_goal_fields)
expressive_goal_commands ExpressiveCommand (if using Mimic) (0.0, 0.0) Linear & angular velocities: (-1.0, 1.0), Heading: (-π, π) Number of commands = len(expressive_goal_fields)

3.7 Terminations(rough_env_cfg.py)

Termination Condition Function Parameters / Notes
time_out mdp.time_out Ends episode when max time reached
base_height mdp.root_height_below_minimum Minimum pelvis height = 0.4 m
bad_orientation mdp.bad_orientation Limit angle = 1.0 rad
tracking_lower_dof_error st_mdp.tracking_error_adaptive_termination Monitors lower joint position error (min 0.3, max 1.5)
tracking_upper_dof_error st_mdp.tracking_error_adaptive_termination Monitors upper joint position error (min 0.2, max 1.5)
(Optional) tracking_root_pos_error st_mdp.tracking_error_adaptive_termination Monitors root position error (min 0.4, max 2.0)

3.8 Environment Parameters

Parameter Description Example
amp_mimic_cfg.py    
num_envs Number of parallel environments for training 4096
using_21_joint Whether to use the 21-joint robot model (otherwise 27 joints) True
motion_files Reference motion files used for imitation learning (AMP / mimic tasks) dance2_subject4_1871_6771_fps25.pkl
random_start Randomize start the robot True
amp_obs_frame_num Number of consecutive frames for AMP observation (history length) 2
INIT_STATE_FIELDS Initial state variables (root state + joint DOF pos/vel) root_pos_x, root_rot_w, …
style_fields State features used for style tracking reward root_rot_w, joint_dof_pos, joint_dof_vel, …
style_goal_fields Target style features for goal tracking (optional, often None) None
style_reward_coef Reward coefficient for style tracking 10.0
expressive_goal_fields Features used in expressive imitation (joint DOF states + link positions/vels) joint_dof_pos, joint_dof_vel, link_pos_x_b, …
ref_motion_cfg.ref_length_s Duration of reference trajectory segment (seconds) 2.0
ref_motion_cfg.time_between_frames Time interval between motion frames 0.02
trajectory_num Number of trajectories sampled per environment 4096
specify_init_values Customized initial posture values (optional) dict of joint positions (stand pose)
episode_length_s Episode duration in seconds 2
st_rl_ppo_cfg.py    
save_interval Checkpoint saving frequency 500
max_iterations Maximum training iterations 20000
experiment_name Name of the experiment lus2_flat
algorithm_name Algorithm used PPO
policy_name Policy architecture ActorCritic
runner_name Runner type AmpPolicyRunner

3.9 PPO Parameters(st_rl_ppo_cfg.py)

Parameter Description Example
clip_param Clipping range for PPO ratio 0.2
entropy_coef Coefficient for entropy regularization 0.01
value_loss_coef Weight of value function loss 1.0
num_learning_epochs Number of learning epochs per update 5
num_mini_batches Mini-batch splits per epoch 4
learning_rate Policy optimization learning rate 1e-3
gamma Discount factor 0.99
lam GAE lambda 0.95
desired_kl Target KL divergence 0.01
max_grad_norm Maximum gradient clipping 1.0

3.10 Policy Parameters(st_rl_ppo_cfg.py)

Parameter Description Example
actor_hidden_dims Hidden layers for actor network [512, 256, 128]
critic_hidden_dims Hidden layers for critic network [512, 256, 128]
activation Nonlinear activation function ELU
rnn_type Recurrent module type (for temporal correlation) LSTM
init_noise_std(amp_mimic_cfg.py) Initial exploration noise standard deviation 1.2