High-level Control and Sub-task Learning

Results at a Glance

A single pi0 policy was trained on a multi-prompt dataset to pick up one of 5 colors of cubes and place in one of 2 colors of pans. The policy generalizes to cube colors not in the dataset.
A new splitter tool in our lerobot fork was used to split a long horizon task into sub-tasks which flow naturally into each other and can then be called in any order. A single pi0.5 policy learned the set of sub-tasks.
Voice control was used as high-level control to choose the next sub-task prompt to send to the pi0.5 policy, in any order. Voice control was then replaced by the Gemini Robotics ER-1.5 VLM to 'place all cubes in the bucket'. This worked well, but required some prompt engineering and there is some web latency.

Resources: openpi, lerobot · Implementation details · Notes & Optimizations · Datasets · Gemini Robotics

Multiple Tasks/Prompts

Fig 1. pi0 learns multiple tasks/prompts plus generalization.

'pick up green cube and place in silver pan'

Note: Green and white cubes NOT in dataset.

Multiple task prompts: A pi0 policy was trained on a multi-task dataset with 10 tasks, each with a prompt 'pick up {cube_color} cube and place in {pan_color} pan', where cube_color=[red,blue,pink,yellow,brown] and pan_color=[silver,yellow]. The cubes and pans can be in multiple locations. We then tested by giving the robot one of the dataset prompts, Figs 1-4. It chooses the correct cube to pick up 100% of the time, although it did fail to physically pick up the cube about 25% of the time — a pi0.5 policy, see below, picks up 100% of the time. After picking up a cube, it places it in the correct pan 100% of the time.
Implementation details: To train with different prompts for different tasks, the lerobot dataset must contain a prompt/task for each episode in meta/episodes.jsonl and a list of unique prompts/tasks in meta/tasks.jsonl. For dataset creation, we added prompting to the record option of control_robot.py in our (deprecated) lerobot. Also, for openpi to use prompts in training and policy serving, TrainConfig in training/config.py needs a couple of lines added, see Implementation details: openpi.
Task generalization: To test task generalization, we used cubes with colors NOT in the training set. For example, we tested with a green and a white cube using the prompt 'pick up green cube and place in silver pan', see Figs 1, 2 and 'pick up white cube and place in silver pan', see Figs 3, 4. Note that we tried all four color and position combinations, to prevent the policy from choosing a default position and ignoring color when it doesn't understand the prompt. This shows that task learning in pi0 is able to generalize, most likely coming from language and image understanding in PaliGemma and/or from pre-training of the pi0 model.

Fig 2. pi0 learns multiple tasks/prompts plus generalization.

'pick up green cube and place in silver pan'

Compared to Fig 1, shows pi0 is using color, not position.

Fig 3. pi0 learns multiple tasks/prompts plus generalization.

Compared to Fig 1,2, shows pi0 wasn't just lucky in knowing green.

Fig 4. pi0 learns multiple tasks/prompts plus generalization.

Compared to Fig 3, again, shows pi0 is using color, not position.

Sub-task Learning

Fig 5. pi0.5 sub-task policy with voice control.

'pick up {color} cube and place in green bucket'

pi0.5 performs sub-tasks smoothly in any order.

Splitting a task into sub-tasks: The multi-task dataset used above was obtained one task at a time: the robot returns to home position at the start of every task. If we were to use these tasks as sub-tasks they would not chain naturally. Therefore, instead of recording sub-tasks, one at a time, we recorded a long high level task and then split it into sub-tasks using our new dataset_splitter.py tool in our lerobot fork. Our high level task is 'put all cubes in the green bucket' which we split into sub-task episodes, each with a prompt such as 'pick up the red cube and place in the green bucket'. After splitting, our multi-task dataset contains 500 episodes — ~1 hour of recording — with every episode having one of 5 sub-task prompts corresponding to the five cube colors — see the episodes.jsonl and tasks.jsonl files.
Human voice control: To facilitate human high-level control of the robot, we added a ReSpeaker microphone, together with Silero VAD (voice activity detection) to detect speech automatically and faster-whisper for speech to text conversion — for details, see voice_command.py in the trossen_ai example directory in our openpi fork. When the VAD detects speech, such as 'pick up the blue cube', faster-whisper converts it into text and the closest prompt is sent to the pi0.5 policy and the corresponding sub-task is performed, see Fig 5.
pi0.5 vs pi0: pi0.5 does much better than pi0 when trained on the same multi-sub-task dataset: compare Fig 5 to 6. Although pi0 does seem to conceptually understand the sub-tasks — it tries to pick up the correct cube — it is not as good as pi0.5 at precisely grabbing the cubes.

Fig 6. pi0 sub-task policy with voice control.

pi0/pi0.5 both understand sub-tasks, but pi0.5 picks up objects better.

High-level Control

Fig 7. pi0.5 sub-task policy with Gemini Robotics ER-1.5 HL control.

GR is served remotely, hence the latency between sub-tasks.

GR replaces human voice control in Figs 5, 6. Turn on sound to hear GR.

We replaced human voice control with Gemini Robotics ER-1.5 (GR), Fig 7. GR is a VLM specifically trained as a high-level (HL) robot controller. It takes images and a high-level prompt as input, and outputs text commands which can be used as input to a VLA such as pi0.5.
Our implementation of GR control is in examples/trossen_ai/gemini_planner.py and is called from examples/trossen_ai/main.py, in our openpi fork. To access GR, we had to sign up at Google AI Studio, and create an API key at aistudio.google.com/app/apikey. With the key set as the GEMINI_API_KEY environment variable, the Google Gen AI SDK picks it up automatically. We used gemini-robotics-er-1.5-preview, but the newer 1.6 is now available.
Because the GR model is served over the web remotely, there is latency of at least 3 seconds, and intermittently as much as 15 seconds. This might be an issue for real-time deployment, but was adequate for our research objectives.
Getting the GR model to solve our problem, 'put all cubes in the bucket', did require a bit of prompt engineering. The GR controller must pick from a list of allowed prompts of the form 'pick up pink cube and place in green bucket'. Here is the high-level prompt that worked best:
prompt = f"""Look at these images of a robot workspace. The first image is an overhead view, the second is a front view. There are five cubes (red, blue, yellow, brown, pink) and one green bucket. For each cube, report: "on table" if you can see it on the table, "in bucket" if you can see it in the bucket or cannot see it at all, or "unsure" if you are not confident. Choose which cube the robot should pick up next: - Only choose a cube that is "on table" - Prefer the cube closest to the right side of the overhead image - If no cubes are on the table, set task to "ALL_DONE" Pick from these tasks only: {chr(10).join(f'- "{p}"' for p in allowed_prompts)} Return JSON only: {{"narration": "Red cube: on table. Blue cube: in bucket. ...", "task": "pick up red cube and place in green bucket", "task_cube_color": "red"}}"""