Multiple tasks/prompts
Fig 1. pi0 learns multiple tasks/prompts plus generalization.
'pick up green cube and place in silver pan'
Note: Green and white cubes NOT in dataset.
- Multiple task prompts: To teach the robot to perform different tasks for different prompts, we used full fine tuning to train the ANRedlich / trossen_ai_stationary_pick_and_place_07 dataset, in which the robot picks up one of two cubes and places it in one of two pans, corresponding to the prompt 'pick up {cube_color} cube and place in {pan_color} pan', where cube_color=['red','blue','pink','yellow','brown'] and pan_color=['silver','yellow']. We then tested by giving the robot one of the dataset prompts, Fig 1. It chose the correct cube to pick up 100% of the time, although it did fail to physically pick up the cube about 25% of the time. After picking up a cube, it placed it in the correct pan 100% of the time.
- Implementation details: To train with different prompts for different tasks, the lerobot dataset must contain a prompt/task for each episode in meta/episodes.jsonl and a list of unique prompts/tasks in meta/tasks.jsonl. For dataset creation, we added prompting to the record option of control_robot.py in our lerobot. Also, for openpi to use prompts in training and policy serving, TrainConfig in training/config.py needs a couple of lines added, see Openpi Experimental Details.
- Task generalization: To test task generalization, we used cubes with colors NOT in the training set. For example, we tested with a green and a white cube using the prompt 'pick up green cube and place in silver pan', see Figs 1,2 and 'pick up white cube and place in silver pan', see Figs 3,4. This shows that task learning in pi0 is able to generalize, most likely coming from language understanding in PaliGemma and/or from pre-training of the pi0 model.
Fig 2. pi0 learns multiple tasks/prompts plus generalization.
'pick up green cube and place in silver pan'
Compared to Fig 1, shows pi0 is using color, not position.
Fig 3. pi0 learns multiple tasks/prompts plus generalization.
Compared to Fig 1,2, shows pi0 wasn't just lucky in knowing green.
Fig 4. pi0 learns multiple tasks/prompts plus generalization.
Compared to Fig 3, again, shows pi0 is using color, not position.
Sub-task learning
Fig 5. pi0.5 sub-task policy with voice control.
'pick up {color} cube and place in green bucket'
pi0.5 performs sub-tasks smoothly in any order.
- Building a sub-task dataset: The multi-task dataset used above was obtained one high level task at a time with the robot starting at home position before each high level task. In this case, the high level task is 'put all cubes in the bucket'. Sub-tasks, on the other hand, cannot be recorded one at a time, returning to home each time, because they need to flow into each other naturally. So to create a sub-task dataset, we first record a high level task containing multiple sub-tasks, and then use our dataset_splitter.py tool to split this high level episode into multiple sub-task episodes. The dataset used in this section, ANRedlich/trossen_ai_stationary_pick_and_place_09, was built in this way, thus splitting the high level task 'put all cubes in the bucket' into sub-tasks of the form 'place the red cube in the green bucket'.
- Human voice control: To facilitate human high level control of the robot, we added an ReSpeaker microphone, together with Silero VAD to detect speech automatically and faster-whisper for speech to text conversion; see voice_command.py in our openpi trossen_ai example. When the VAD detects speech such as 'pick up the blue cube', the closest prompt is sent to the pi0 or pi0.5 policy and that sub-task is performed, see Fig 5.
- pi0.5 vs pi0: pi0.5 does much better than pi0 when trained on the same multi-sub-task dataset: Compare Fig 5 to 6. Although pi0 does seem to conceptually understand the sub-tasks -- it tries to pick up the correct cube -- it is not as good as pi0.5 at precisely grabbing the cubes.
Fig 6. pi0 sub-task policy with voice control.
pi0/pi0.5 both understand sub-tasks, but pi0.5 picks up objects better.
High-level Control
Fig 7. pi0.5 sub-task policy with Gemini Robotics ER-1.5 HL control.
GR is served remotely, hence the latency between sub-tasks.
GR replaces human voice control in Fig 4,5. Turn on sound to hear GR.
-
We also replaced human high level (HL) voice control with Gemini Robotics ER-1.5 (GR), Fig 7. GR is a VLM specifically trained as a HL robot controller. The GR model can be integrated into python code, but is served over the web remotely, which does introduce latency of at least 3 seconds, and intermittently takes up to 15 seconds. However, it can control a pi0 or pi0.5 policy to execute sub-tasks in the desired order to complete a high level task. It should be noted that, in principle, pi0.5 can learn HL task control, but this option does not seem available through openpi.