Sim to Real +

Results at a Glance

A pi0 policy trained with LoRA fine-tuning on a simulated — mujoco — transfer-cube task, correctly performed the task on a real robot! The task was 90% successful even when the real and simulated lighting, background, and cube color were different.
The same transfer-cube pi0 policy that succeeded for sim to real, also succeeded 75% of the time in an extremely different simulated environment, demonstrating very robust generalization.
For the ACT model, sim-to-real required precise matching of simulated and real environments, and even then it didn't work very well.
An ACT model pre-trained on a simulated dataset and then post-trained for as little as 1K steps on a real dataset did about as well as from-scratch training on the real dataset for 100K steps!

Resources: openpi, lerobot, gym-aloha · Implementation details · Notes & Optimizations · Datasets

pi0 Sim to Real (and Sim to Sim)

Fig 1. pi0 sim to real

Trained on simulated dataset in Fig 2, zero-shot to real environment.

A LoRA fine-tuned pi0 model in openpi was trained on the same simulated dataset, Fig 2, which gives the best result for ACT, below. It was then tested on the physical robot, Fig 1. It was also tested, sim to sim, on the simulated robot using our gym-aloha environment, to which we added a Trossen AI Stationary robot simulator — adapted from trossen_arm_mujoco. Evaluation code for the physical robot — adapted from Trossen's openpi fork — is in the trossen_ai directory in our openpi fork, while code for the simulated robot is in the aloha_sim_trossen_ai directory.

Sim to real results: Fig 1. As can be seen by comparing Fig 1 to Fig 4, the sim-to-real transfer for pi0 is much more robust than it is for ACT. The physical robot successfully picks up and transfers the cube ~90% of the time.
Sim to sim results: Fig 3. When tested on new examples from the same simulated environment in Fig 2, the performance is 95-100%, as long as success is defined as either the left or the right gripper finger touching the cube (the original code defined success as the left finger touching). More dramatically, when tested on the very different environment in Fig 3, performance is still pretty good: ~75%!
pi0 large-scale pretraining: In both sim to real, Fig 2 -> Fig 1, and sim to sim, Fig 2 -> Fig 3, there is evidence of out-of-distribution robustness: The simulated dataset, which was created using noise-free waypoint interpolation, is very smooth and clean — see scripted_policy.py in lerobot. The real robot introduces noise so the actual path often diverges from the simulated path. When this happens using the ACT algorithm, below, the real robot most often fails. However, pi0 seems to pull the robot back onto the correct path. We believe this is seen in Fig 1 as the robot gets close to picking up the yellow cube. It slows down and, chugga-chugga, makes its way to the cube. It then gets back on path and completes the transfer. This behavior was not learned from the simulated dataset, so we believe it is prior knowledge in pi0 coming from its large-scale pre-training.
Generalization: As seen by comparing Fig 1 to Fig 2, the real robot environment has lighting, background, and cube color which are different from the sim env, Fig 2. An even greater difference in environments is seen in comparing simulated environments in Fig 2 and Fig 3. This level of generalization is not found in ACT and we believe it is further evidence of the value of pi0 pretraining. It may also be due to the prior training of the PaliGemma VLM.
Calibration: To match simulated model actions to real robot actions, a set of small systematic adjustments was required, based on the 'replay' calibration discussed in ACT Sim to Real, below: In main.py of the trossen_ai example, for the right arm, a[joint1] -= 0.025, a[joint2] += 0.025, and a[base] = 1.05*(a[base]+0.01), where a[] is the robot action variable. The a[base] adjustment compensates for the difference in base positions between the sim and real robots.

Fig 2. An episode from the simulated dataset used by pi0 and ACT.

Fig 3. pi0 generalization to new environment!

Trained on the dataset in Fig 2. Completely different environmental parameters.

ACT Sim to Real (and Sim to Sim)

Fig 4. Sim to Real, pick up works, transfer is close.

ACT model trained on simulated dataset in Fig 2.

The goal was to see how well an ACT model trained in a simulated environment works on the physical robot. We used control_sim_robot.py in our lerobot fork to build a simulated dataset. The environment was provided by our gym-aloha fork to which we added a mujoco simulator for the Trossen Stationary AI robot — environment and task adapted from trossen_arm_mujoco. An episode from this dataset is shown in Fig 2. Note that for the physical robot, we set robot.max_relative_target = 0.05, which is the maximum 1-step joint angle change in radians. We have found that this clipping is necessary to get smooth robot performance with ACT models.

Best sim to real: Fig 4. The best model picks up the cube ~75% of the time, but does not complete the cube transfer. It was trained on our simulated dataset — see episode in Fig 2 — which is the closest match to the real environment.
Pre-training in sim boosts real training: While zero-shot sim to real did not work well for ACT, initializing real robot training with a pre-trained sim model reduced the amount of real dataset training by a very large factor: starting with a sim pretrained model, only 1K to 10K additional training steps on the real robot dataset gave performance similar to from-scratch real dataset training for 100K steps!
Pre-training in sim boosts sim training: Pre-training on one simulated dataset, and then post-training on another simulated dataset with a very different environment, boosted performance to 98% after only 2K additional training steps on the second dataset! This compares to 68% zero-shot performance, and 90% performance after 100K training steps from scratch.
Adding env variety: Adding multiple cube colors and sizes, tabletop textures, backgrounds, and lighting variations to the dataset does not seem to improve performance for the ACT algorithm in this context.
Sensitivity to the env: Sim to real performance for ACT was found to be extremely sensitive to environmental parameters: performance dropped to ~33% when the cube color was changed from red to a slightly darker red; performance was also very sensitive to cube size and environmental lighting; and moderately sensitive to tabletop and background textures.
Calibration using replay: Using the replay option in control_sim_robot.py in our lerobot fork, any of the real robot datasets can be replayed in simulation. Likewise, any of the simulated datasets can be replayed using control_robot.py on the real robot. This allows precise alignment of sim and real for an actual task.
Joint angle calibration: Replay sim/real requires shifting joints 1 and 2 by -0.025 and 0.025 radians, respectively. We implement this using the arms_ref option in gym-aloha which sets the physics.named.model.qpos0 variable in sim.py. We believe this compensates for some slight sag in the real robot due to gravity.
Arm base calibration: The simulated robot base was moved to the y=0.0 position to be consistent with both replay and physical measurement on the real robot. This is set using the arms_pos option in gym-aloha, which implements it using the physics.model.body_pos variable in sim.py.

ACT Real to Sim

Fig 5. Real to Sim

ACT model trained on a real robot dataset.

In this case, a cube transfer dataset was collected using the real robot, and then tested in the simulated environment, see Fig 5. As for sim to real, real to sim is extremely sensitive to matching real and simulated environments. The best real robot model gave the best result when tested in the simulated environment that best matches the real environment. This is the same simulated environment that gave the best sim-to-real result. Even then, the simulated robot only gets ~20% correct in the simulated environment. In addition to environmental matching, real-to-sim robot alignment, using calibration, was also necessary.

pi0 Original Aloha Example

Fig 6. pi0 LoRA fine-tuned policy for Aloha sim.

Before adding the aloha_sim_trossen_ai and real robot trossen_ai examples to our openpi fork, we first experimented with the existing examples/aloha_sim. As discussed in the example readme, uv run scripts/serve_policy.py --env ALOHA_SIM serves the pre-trained pi0_aloha_sim policy. Then examples/aloha_sim/main.py calls the policy in the aloha environment. In case they might be useful, here are a couple of observations from those experiments:

action_horizon: Just running this example with the default parameters gave about ~40% correct performance. However, increasing the action_horizon: int = 10 in main.py to 50 — the value during learning — improved performance to ~85%.
LoRA fine-tuning: starting with the base policy, pi0_base, we trained on the repo_id = lerobot/aloha_sim_transfer_cube_human dataset from the original Aloha robot simulator. On an Ubuntu computer with RTX 5090 GPU, 100K training steps took about 4 hours. The results were better than the pre-trained example policy, above, achieving about 95-100% correct performance.