High Dexterity Experiments

Results at a Glance

We have yet to find a high dexterity task that a teleoperator of the robot could perform but which pi0.5 could not learn at least some fraction of the time. However, we have many high dexterity tasks still to test.
A pi0.5 policy was able to insert a tie wrap tip 25% of the time, and get close another 50% of the time, after only one iteration of policy improvement using human interventions. Without interventions, the tie wrap tip was never inserted successfully.
A pi0.5 policy was able to place a bead on a string 50% of the time, and get close an additional 40-50% of the time, after two iterations of policy improvement. Prior to policy improvement the bead was placed correctly 10-20% of the time, and was close another 30% of the time.
On a dataset of 6 lids and 8 pans of multiple colors and shapes at multiple locations, a pi0.5 policy was able to pick up even small lids 95% of the time and place them to stay on the pan or close to staying on the pan 95% of the time, after one iteration of policy improvement.
An ACT policy trained on the lids and pans dataset picked up the lids and placed them near the pans 50% of the time. This is roughly the same as the result obtained with a full fine-tuned pi0 (not pi0.5) policy. ACT did not place the bead on the string, but did position the string correctly.

Results Table

Task	ACT	pi0 LoRA	pi0.5	pi0.5 + PI1	pi0.5 + PI2
Tie-wrap inserted	—	—	0%	25%	—
Tie-wrap in+close	—	—	40%	75%	—
Bead on string	0%	—	20%	10-50%	50%
Bead on+close	0%	—	50%	90-100%	100%
Place lids on pans	0%	0%	15%	30%	—
Place on+close	50%	33%	66%	95%	—
Pickup lids	50%	60%	66%	95%	—
Multiple cubes	0%	30-80%	—	—	—

Approximate single-task success rates on the real Trossen Stationary AI Robot. in+close and on+close combine episodes where the task was completed and those where the robot got close. — indicates the combination was not tested. PI1,2 denotes 1 or 2 iterations of policy improvment with human interventions.

Resources: openpi, lerobot · Implementation details · Notes & Optimizations · Datasets

pi0/pi0.5

Fig 1. pi0.5 policy for closing a tie wrap. 1 iteration of policy improvement.

Fig 2. pi0.5 policy for placing a bead on a string. 2 iterations of policy improvement.

pi0 and pi0.5 were designed to reason about robot tasks and respond to multiple prompts. Here, however, we ask only how precisely they can perform single high dexterity tasks.

Close tie wrap: Fig 1. A full fine-tune pi0.5 policy was trained for 40K steps on our tie-wrap dataset, which has 50 episodes, totaling about 20 min of recording. This initial policy was never successful at inserting the tie wrap tip into the ratchet head, although it was able to get close about 40% of the time. After one iteration of policy improvement using human interventions, adding 15K steps of training, the policy was able to insert the tip into the head 25% of the time! It was also able to get very close 50% of the time. The augmented dataset is here. This is a difficult task even for a human teleoperator.
Place beads on a string: Fig 2. A full fine-tune pi0.5 policy was trained for 40K steps on our bead-on-a-string dataset which has 50 episodes, totaling about 30 min of recording. This policy was able to place a small bead on a string about 20% of the time. Doubling the size of the dataset and re-training only improved performance a little. However, two iterations of policy improvement using human interventions produced a large boost in performance, with the bead now placed on the string 50% of the time, vs 20% before. This improved policy also picks up the bead 90% of the time, vs 50% before. The augmented dataset is here. Again, this is a difficult task even for a teleoperator. An ACT model trained on the same initial dataset was not able to accomplish the task!
Place lids: Fig 3. Our place-lids dataset has 6 lids and 8 pans of multiple colors and shapes at many locations, but is small: 50 episodes, totaling about 12 min of recording. Our first attempt was LoRA fine-tuning a pi0 model for 20K steps, with poor results, so we resumed training for another 20K steps and achieved better results — approximately 33% of lids were placed close — for some of the lid/pan combos, including the small lid in Fig 3a which requires high accuracy (LoRA not shown). We then trained a pi0 model from scratch for 20K steps using full fine-tuning with roughly the same results. We then full fine-tune trained a pi0.5 model for 40k steps which improved over the pi0 models, now able to get close to correct placement about 66% of the time. For much better results, one iteration of policy improvement using human interventions, adding 10K steps of training, produced a policy that picks up the lids about 95% of the time. Lid placement is also improved, with lids staying on the pan (although not perfectly aligned) now 30% vs 15% before, and the total fraction of lids on or close-to-on going from 66% to 95%.
Multiple cube colors, sizes, and orientations: Fig 4. This multi-cube dataset contains 4 colors — blue, yellow, green, and red — and 3 sizes — 20, 25, 40mm — at random orientations. The dataset has 50 episodes, totaling about 12 min of recording. While cubes should be easy to pick up, wrist twisting dexterity is required to pick up the cubes at angles. We LoRA-fine-tuned a pi0 policy for 10K steps, with batch_size=64 which should be equivalent to 20K steps with the default batch_size=32. The robot picked up and transferred blue cubes correctly about 80% of the time, see Fig 4a, with yellow cubes it achieved ~50% success, Fig 4b, while with green and red cubes it achieved about 30-50% success. These results would most likely improve with full fine-tuning of a pi0.5 policy, and with human interventions. They are much much better than we achieved with ACT on the same dataset!

Fig 3a. pi0.5 policy after 1 iteration of policy improvement.

Note multiple shapes, materials, positions.

Fig 3b. pi0.5 policy after 1 iteration of policy improvement.

Note multiple shapes, materials, positions.

Fig 3c. pi0.5 policy after 1 iteration of policy improvement.

Note multiple shapes, materials, positions.

Fig 3d. pi0.5 policy after 1 iteration of policy improvement.

Note multiple shapes, materials, positions.

Fig 4a. pi0 lora policy.

Note multiple colors, orientations, positions.

Fig 4b. pi0 lora policy.

Note multiple colors, orientations, positions.

ACT

Fig 5. Pop lid off container.

ACT model trained on dataset trossen_ai_stationary_pop_lid_06.

One of the goals of the ACT algorithm and the Aloha robot, was to solve problems requiring significant dexterity. The ACT algorithm can learn single tasks by training from scratch with no pre-training. It does, on the other hand, have some prior image understanding coming from its ResNet18. Here, we use the ACT algorithm as a baseline, and we find that it often does a pretty good job!

Pop lid: Fig 5. An ACT model trained on our pop-lid dataset works well, but only if the "takeout" container is positioned carefully on the tabletop! The dataset — 50 episodes, ~15 min — did not have much position variety, so further experiments are planned. Also, the lid was very snug, so some crushing was necessary, even by a human using only two fingers, so further experiments with better containers are planned. However, it does succeed!
Transfer cube: Fig 6. This task, in our transfer-cube dataset — 50 episodes, ~12 min — was easily learned by ACT. What makes it somewhat difficult is the small size of the cube in the video input.
Pour cup to cup: Fig 7. This task in our pour-box dataset — 50 episodes, ~15 min — was easily learned by ACT. It works for the same range of cup placements of ~2-3 inches as in the dataset.
Place lids: This task, discussed above, contains a lot of object shape, color, size, and placement variety, and yet an ACT model was able to place lids almost centered on pans about 50% of the time. ACT did surprisingly well, given the dataset size relative to the object variety, and given the difficulty of picking up some of the smaller lids. See Fig 3 for examples of pi0.5 performing this task.
Multiple cube colors, sizes, and orientations: The ACT algorithm did not learn the task. It may be that the number of episodes needs increasing, but we suspect that there is just too much task variety for ACT. See Fig 4 for examples of pi0 performing this task.
Place beads on a string: Fig 2 shows pi0.5 succeeding at this task. An ACT model was not able to place the bead on the string. It was, however, able to pick up the string and move it from one gripper to the other.

Fig 6. Transfer 20mm cube.

ACT model trained on dataset trossen_ai_stationary_transfer_20mm_cube_01.

Fig 7. Pour little red cube from one cup to another.

ACT model trained on dataset trossen_ai_stationary_pour_box_05.

Discussion

pi0 vs ACT: ACT seems to have more difficulty with tasks that have a variety of object types, orientations, and locations. This is evident from the multi-cube and lids-on-pans datasets. pi0 and pi0.5 seem to have an easier time with such datasets, perhaps showing greater scene and object understanding, perhaps due to their pre-training. On the other hand, ACT did almost as well as pi0 on the original pans and lid dataset, before policy improvement.
pi0 vs pi0.5: While both pi0 and pi0.5 seem to show pretty good object understanding, pi0.5 seems to perform better than both pi0 and ACT on tasks requiring precision, such as the bead-on-a-string task in Fig 2.
LoRA vs full fine-tune: Qualitatively, full fine-tuning was noticeably better than LoRA, but we did not perform a side-by-side comparison.
Human interventions: As discussed, above, a big boost in performance can be achieved by adding episodes where a person intervenes when the robot is about to fail while running the current policy. The current policy is then further trained on the combined imitation plus intervention dataset. See our policy improvement using human interventions section for details.
Dataset size: Our datasets are relatively small, 50-150 episodes for a total of 12-60 min of data. This compares to 5-100 hours of data for task fine-tuning in pi0. Surprisingly, these small datasets have been mostly sufficient for their tasks, at least when combined with human interventions. However, larger datasets might be required for future tasks, especially those with longer horizons and greater task variety.
Video resolution: It is surprising that pi0.5 can 'see' the tie-wrap tip and the bead string, since its video resolution is scaled down to 224x224 pixels for the PaliGemma VLM! It is possible that it is not using vision for precise alignment in these tasks. However, we do see some evidence that it is using vision since it does not grab the tie wrap at the same spot every time, and it seems to be making small scale alignment corrections. More experiments are needed.
Video cropping: To effectively increase video resolution, one option is to zoom-in by cropping the video before resizing to 224 x 224. A square crop also eliminates the padding — which wastes tokens — which is currently used to resize the 640 x 480 video to 224 x 224. We have experimented with this for the bead-on-a-string dataset, but whether it helps or not is still an open question. However, it may be useful/necessary in the future. See Implementation details: openpi and also our openpi fork for more details.