Sim to Real +

ACT Sim to Real

Fig 1. Sim to Real, pick up works, transfer is close
ACT model trossen_ai_stationary_sim_act13 learned from dataset trossen_ai_stationary_sim_transfer_40mm_cube_13.

The goal was to see how well an ACT model trained in a simulated environment works on the physical robot. We used control_sim_robot.py in our lerobot fork to build a dataset in the mujoco environment -- adapted from trossen_arm_mujoco -- for the Trossen AI Stationary robot. The heuristic task -- again adapted from trossen_arm_mujoco -- was to pick up a 40mm red cube using one robot arm and transfer it to the other arm. The following experiments use the baseline ACT algorithm. Note that in all real robot policy rollouts, we set robot.max_relative_target=0.05. This parameter sets the maximum 1 step joint angle change, implemented through clipping, and is critical to smoothing the rollouts and getting good results for ACT. (Full lerobot dataset names include ANRedlich/...)

  • Conclusion: ACT sim to real results are very sensitive to matching the simulated and real environments. For the best environmental match, the robot was able to pick up the cube, but it was not able to complete the transfer, see Fig 1.
  • Best model: trossen_ai_stationary_sim_act13 with ~75% correct pick up, but no completed transfers. It was trained on the trossen_ai_stationary_sim_transfer_40mm_cube_13 dataset which is the closest match to the real environment. The simulated env is the same one shown in Fig 4.
  • Robustness: adding multiple cube colors and sizes, tabletop textures, backgrounds, and lighting variations to the dataset does not seem to improve performance for the ACT algorithm in this context.
  • Cube color: moderately sensitive: sim to real performance dropped to ~33% when the cube color was changed from red to a slightly darker red, even though all other environmental parameters were held constant.
  • Cube size: very sensitive.
  • Tabletop: moderately sensitive. The closest match used a simulated tabletop texture derived from a photo of the real tabletop.
  • Background: moderately sensitive. The closest match used photos of the real robot surroundings, although they were very crudely aligned.
  • Lighting: very sensitive to the simulated environment lighting, and also the real robot lighting.
  • Joint angles and Arm base positions: unlike for real to sim, see below, adjusting joint angle and base position did not help sim to real performance. Not sure why?
  • Calibration:
    Replay: using the replay option in control_sim_robot.py in our lerobot fork, any of the real robot datasets can be replayed in simulation. Likewise, any of the simulated datasets can be replayed by control_robot.py on the real robot. This allows precise alignment of sim and real for an actual task.
    Joint angles: to get the real and sim replays to work perfectly, it was necessary to shift joints 1 and 2 by -0.025 and 0.025 radians, respectively, using the arms_ref option in gym-aloha which is implemented in sim.py using physics.named.model.qpos0. We believe this compensates for some slight sag in the real robot due to gravity.
    Arm base position: using the arms_pos option in gym-aloha, implemented with physics.model.body_pos in sim.py, the simulated robot base was moved to the y=0.0 position, which is consistent with physical measurement on the real robot.
Back to top

ACT Real to Sim

Fig 2. Real to Sim
ACT model trossen_ai_stationary_real_act2_3 learned from real dataset trossen_ai_stationary_transfer_40mm_cube_02.

In this case, a cube transfer dataset was collected using the real robot, and then tested in the simulated environment, see Fig 2.

  • Conclusion: Real to sim for ACT is extremely sensitive to matching real and simulated environments, as was the case for sim to real. For real to sim, robot alignment, using the above calibration, is also important
  • Best model: ANRedlich/trossen_ai_stationary_real_act2_3, which only gets ~20% correct in the simulated environment.
  • Environment: except for lighting, the best environment is the same as for sim to real, the one that best matches real and simulated environments.
  • Lighting: the best simulated lighting is different than used for sim to real: it is closer to the lighting in the real robot dataset.
  • Joint angles: the arms_ref env option in gym-aloha adds a +/- shift (pos0) to the simulated robot joint angles. In the calibration, see above, it was discovered that this was necessary for joints 1 and 2, to get real and sim to match. We believe this is due to gravity weighing down the real arms.
  • Arm base position: the arms_pos env option was used to place the simulated arm base positions where they should be according to the replay calibration, above, which is consistent with measurements on the real robot.
Back to top

pi0 Sim to Real

Fig 3. pi0 sim to real
Same policy as Fig 4, zero-shot to real environment.

The pi0 model in openpi was trained on the same simulated dataset which gave the best results for ACT, above. It was then tested on both the real, Fig 3, and the simulated robot, Fig 4. For the simulated robot, we linked our gym-aloha, which contains the Trossen AI Stationary robot simulator, and created a new test example folder called aloha_sim_trossen_ai. To test on the real robot, we adapted the real robot example in the Trossen fork of openpi to our older lerobot implementation, which uses the older Trossen arm drivers. Our real robot example folder is called trossen_ai.

  • Conclusion: As can be seen by comparing Fig 1 to Fig 3, the sim to real transfer for pi0 is much more robust than it was for ACT. As discussed below, we believe this robustness might be due to the pi0 pre-training.
  • LoRA fine tuning: starting with the base policy, pi0_base, we trained on the trossen_ai_stationary_sim_transfer_40mm_cube_13 dataset from the Trossen AI Stationary robot simulator. On an Ubuntu computer with Nvidia RTX 5090 GPU, 20K steps of training took about 16 hours. When tested on new examples from the same simulated environment, the performance is 95-100%, as long as success is defined as touching either left or right finger, not just the default left finger. See Fig 4.
  • Robust to env changes: As seen by comparing Fig 3 to Fig 4, the real robot environment has lighting and cube color which are very different than the sim env, Fig 4, and yet the real robot picks up and transfers the cube successfully ~90% of the time!
  • Out of distribution robustness: in both sim to real, Fig 4 -> Fig 3, and sim to sim, Fig 4 -> Fig 5, there is evidence of out-of-distribution robustness: The simulated dataset was created using noise-free waypoint interpolation, see scripted_policy.py in lerobot, so it is very clean. The real robot introduces noise so the path often diverges away from the simulated path. When this happens using the ACT algorithm, the real robot most often fails. However, pi0 seems to pull the robot back onto the correct path. We believe this is seen in Fig 3 as the robot gets close to picking up the yellow cube. It slows down and chugga-chugga makes its way to the cube. It then gets back on path and completes the transfer. This behavior was not learned from the simulate dataset, so we believe it is prior knowledge in pi0 coming from its large scale pre-training.
  • Calibration: To match simulated model actions to real robot actions, a set of small systematic adjustments was required, based on the 'replay' calibration discussed in ACT Sim to Real, above: In main.py of the trossen_ai example, for the right arm, a[joint1]-= 0.025, a[joint2]+=0.025, and a[base]=1.05*(a[base]+0.01). The last uses the base angle to compensate for the difference in base positions between the sim and real robots.
Fig 4. pi0 LoRA fine tuned policy
pi0 model trossen_ai_stationary_sim_pi013 learned from dataset trossen_ai_stationary_sim_transfer_40mm_cube_13.
Back to top

pi0 Sim to Sim Generalization

Fig 5. pi0 generalization to new environment!
Same model as Fig 4, completely different env parameters.

The above policy, trained on the simulated dataset trossen_ai_stationary_sim_transfer_40mm_cube_13 dataset, was also tested on a simulated environment with very different environmental parameters, Fig 5. In this case, the wood tabletop -> black tabletop, the background -> no background, the lighting goes from medium -> bright lights, and the red cube -> blue cube (or other colors). Compare Fig 5 to Fig 4. Still, performance is ~75%! We believe this pi0 generalization ability is likely a combination of using the PaliGemma VLM, together with large scale robot pre-training.

Back to top

pi0 Original Aloha Example

Fig 6. pi0 LoRA fine tuned policy for Aloha sim.

Before adding the aloha_sim_trossen_ai and real robot trossen_ai examples to our openpi fork, we first experimented with the existing aloha_sim simulation example. In case they might be useful, here are a couple of observations from those experiments:

  • action_horizon: Just running this example with the given pi0_aloha_sim model gave about ~40% correct performance. However, increasing the default action_horizon: int = 10 in main.py to 50 -- the default during learning -- improved performance to ~85%.
  • LoRA fine tuning: starting with the base policy, pi0_base, we trained on the repo_id=lerobot/aloha_sim_transfer_cube_human dataset from the original Aloha robot simulator. On an Ubuntu computer with RTX 5090 GPU, 100K training steps took about 4 hours. The results were better than the pre-trained example policy, above, achieving about 95-100% correct performance.
Back to top

Pre-training

The question we wanted to answer here is whether pre-training in a simulated environment would reduce the number of additional training steps required in another simulated environment or even in a real environment.

  • Pre-training: To answer this question, we first pre-trained an ACT policy on the dataset ANRedlich/trossen_ai_stationary_sim_transfer_40mm_cube_07. This dataset was built using the environment in Fig 5, although with a red cube.
  • Sim to sim: We then continued training, but on the dataset ANRedlich/trossen_ai_stationary_sim_transfer_40mm_cube_13, which has the environment shown in Fig 4. After only 10K steps, the model was correct 98% of the time on out-of-sample examples. This compares to only 90% correct learning from scratch using 100K steps. Training for 10K steps from scratch did not work well.
  • Sim to real with sim pre-training: After only 10K steps, the sim model using pre-training was approximately as good on sim to real as the model trained from scratch for 100K steps, see above.
  • Real to real with sim pre-training: The sim model trained on ANRedlich/trossen_ai_stationary_sim_transfer_40mm_cube_13 was used as the pre-trained model to continue training on the real dataset ANRedlich/trossen_ai_stationary_transfer_40mm_cube_02 for 10K steps. This gave as good a result on real to real as training from scratch for 100K steps. Training for 10K steps from scratch did not work well.
  • Discussion: These pre-training results are very strong, but we are not sure whether they will generalize, since the simulated examples were created using noise-free waypoint trajectories, which may be easy to learn, independent of the environment.
Back to top