ACT
Fig 1. Pop lid off container.
ACT model trained on dataset trossen_ai_stationary_pop_lid_06.
One of the goals of the ACT algorithm and the Aloha robot, was to solve problems requiring significant dexterity. The ACT algorithm can solve single task problems by training from scratch with no robot pre-training. It does, on the other hand have some prior image understanding coming from its ResNet18. Here, we use the ACT algorithm as a baseline, and we find that it often does a pretty good job!
- Successes:
Pop lid: This task, see Fig 1, learned from trossen_ai_stationary_pop_lid_06, works well, but only if the "takeout" container is positioned carefully on the tabletop! The dataset did not have much position variety, so further experiments are planned. Also, the lid was very snug, so some crushing was necessary, even by a human using only two fingers, so again further experiments with better containers are planned. However, it does succeed!
Transfer cube: This task, see Fig 2, for either a 20mm or 40mm cube, was easily learned by ACT from e.g. ANRedlich/trossen_ai_stationary_transfer_20mm_cube_01.
Pour cup to cup: This task, see Fig 3, was easily learned by ACT from ANRedlich/trossen_ai_stationary_pour_box_05. It works for the same range of cup placements of ~2-3 inches as in the datasets.
Place lids: The dataset ANRedlich/trossen_ai_stationary_place_lids_04 has many different pot and lid colors and shapes at many locations. ACT did pretty well, but was inconsistent. The dataset, however, is small relative to the object variety. See Figs 6,7 for examples of pi0 performing this task.
- Failures:
Multiple cube colors, sizes, and orientations: The ACT algorithm did not learn the task in dataset ANRedlich/trossen_ai_stationary_transfer_multi_cube_03. See Fig 5 for examples of pi0 performing this task. It may be that the number of examples needs increasing, but we suspect that there is just too much task variety for ACT.
Place lids: As mentioned, ACT was inconsistent on this dataset, although more data might improve performance.
Beads on a string: See Fig 4, which shows pi0.5 succeeding at this task. An ACT model was trained on this task, but was not able to place the bead on the string. It was, however, able to pickup the string and move it from one gripper to the other.
- Discussion: ACT seems to work well for tasks with limited task and environmental variety. Not sure if this is because our datasets are too small, or if this is a fundamental limitation of ACT.
Fig 2. Transfer 20mm cube.
ACT model trained on dataset trossen_ai_stationary_transfer_20mm_cube_01.
Fig 3. Pour little red cube from one cup to another.
ACT model trained on dataset trossen_ai_stationary_pour_box_05.
pi0/pi0.5
Fig 4. pi0.5 policy for placing a bead on the string.
pi0 and pi0.5 were designed to reason about robot tasks and respond to multiple prompts. Here, however, we ask only how precisely they can perform single high dexterity tasks.
- Successes:
Beads on a string: A full fine tune pi0.5 policy was trained on trossen_ai_stationary_place_bead_on_string_10 for 40K steps using a single H100 rented on runpod.io. Training took about 24 hours. This policy was then able to place a small bead on a string about 20% of the time, Fig 4. This is a difficult task and the dataset was small -- 50 examples, 30 sec each -- so it is encouraging that pi0.5 could accomplish this task at least some of the time. However, doubling the size of the dataset and re-training only improved performance a little. An ACT model trained on the same dataset was not able to accomplish the task!
Multiple cube colors, sizes, and orientations: We trained a pi0 policy for this small (50 examples, 12min total) but moderately difficult dataset, ANRedlich/ trossen_ai_stationary_transfer_multi_cube_03, see Figs 5a,b, which had failed to be learned by ACT. LoRA training was used for 10K steps, with batch_size=64, which took about 12 hours on a remote H100PCIe gpu at runpod.io. (We believe a 20K step run on our local RTX5090 with default batch_size=32 would give a similar result). The real robot picked up and transferred blue cubes correctly about 80% of the time, see Fig5a, while with yellow cubes it achieved ~50% success, Fig 5b, and with green and red ~30-50% success. These results are very encouraging given the complexity of the problem and the small number of dataset examples. They are much much better than we achieved with ACT on the same dataset!
Place lids: The dataset ANRedlich/ trossen_ai_stationary_place_lids_04, see Figs 6,7, has 6 lids and 8 pots of multiple colors and shapes at many locations, but is small: 50 episodes, 12 min total. Our first attempt was LoRA training for 20K steps on our local RTX5090 for 16hours, with poor results, so we resumed training for another 20K steps and achieved good results for some of the lid/pot combos, including one small lid which requires high accuracy, see Fig6a. We then trained a pi0 model from scratch using full fine tuning for 20K steps on a H100PCIe remote gpu which took about 12hours. The results were somewhat improved and overall very encouraging, again given the dataset size vs complexity. The robot with pi0 model is able to pick up at least 3/6 of the lids and place and drop them crudely on the pots, see Figs 6b,c,d, and it comes very close to picking up the other 3 lids. We also trained a pi0.5 model for 40k steps which seems to perform slightly better than pi0, typically picking up 4/6 lids, but sometimes more depending on lid position. For even better results, see our policy improvement experiment!
- Failures or close calls:
Multiple cube colors, sizes, and orientations: As mentioned above, the pi0 LoRA policy does well with this difficult dataset, getting 30-80% correct, depending on cube color, and when it fails it gets pretty close, although sometimes it gets confused and rotates the wrist in the wrong direction. Most likely this dataset is too small!
Place lids: As mentioned above, even with pi0.5 2/6 lids are not picked up, but the policy can clearly see the lids and comes very close to picking them up, see Figs 7a,b. However, with policy improvement we were able to get the robot to pick up the lids ~95% of the time!
Beads on a string: On the high dexterity, small object dataset, ANRedlich/ trossen_ai_stationary_place_bead_on_string_10, pi0.5 was able to place a small bead on a string, but it did fail about 80% of the time. The dataset is small, but one question is whether the policy can actually 'see' the string, which is necessary to learn to self-correct. This will require further investigation.
Fig 5a. pi0 lora policy success!
ACT failed to learn a policy for this dataset. Note colors, orientations, positions.
Fig 5b. pi0 lora policy success!
ACT failed to learn a policy for this dataset. Note colors, orientations, positions.
Fig 6a. pi0 lora policy pickup success! place is close.
Note multiple shapes, materials, positions.
Fig 6b. pi0 full policy pickup success! place is close.
Note multiple shapes, materials, positions.
Fig 6c. pi0 full policy pickup and place success!
Note multiple shapes, materials, positions.
Fig 6d. pi0 full policy pickup success! place is close.
Note multiple shapes, materials, positions.
Fig 7a. pi0 full policy pickup failure.
Just misses lid pickup.
Fig 7b. pi0 full policy pickup failure.
Just misses lid pickup.
Discussion
- pi0 vs ACT: ACT seems to have more difficulty with tasks that have a variety of object types, orientations, and locations. This is evident from the multi-cube and lids-on-pots datasets. pi0 and pi0.5 seem to have an easier time with such datasets, perhaps showing greater scene and object understanding.
- pi0 vs pi0.5: While both pi0 and pi0.5 seem to show pretty good object understanding, pi0.5 seems to perform better than both pi0 and ACT on tasks requiring precision, such as the bead on a string task in Fig 4.
- LoRA vs full finetune: Full training seems to be able to learn a greater variety of objects, as is evident from the lids and pots dataset. For example, although pi0-full does not pick up the lid in Fig 7b, it gets much closer than pi0-lora which doesn't seem to "see" the metal lid at all and gets confused (not shown).
- Dataset size: Although we are seeing very encouraging results with the above datasets, the results are not perfect. We believe that this is partly due to the small size of the datasets relative to their complexity. They each have 50-100 examples for a total of 12-24mins of data. This compares to 5-100 hours of data for task fine tuning in pi0.
- Video resolution: One question is whether the video resolution input to pi0 and pi0.5 is high enough to 'see' the string in Fig 4. The required input resolution for the pi0 and pi0.5 models is 224x224 which is the resolution expected by PaliGemma. Currently the 640x480 video is resized to 224x224 with padding. To effectively zoom in and increase resolution, one solution is to crop the 640x480 images before the resize. See Openpi Experimental Details and also our openpi fork (development branch). Whether this helps or not is still an open question.
- Policy improvement: As mentioned a couple of times, above, a big boost in performance can be achieved by adding episodes where a person intervenes when the robot is about to fail. The current policy is then further trained on the combined imitation plus intervention dataset. See our policy improvement experiment for details and videos.