Policy Improvement using Human Interventions

DAgger

Fig 1. pi0.5 after human intervention and DAgger.
Both lid pickup and placement improved in this example.
  • Human Intervention: Imitation learning (behavioral cloning) can work surprisingly well for initial policy training, but can result in fragile policies. This is because the distribution of states encountered by real robots includes many states/situations not in the imitation learning dataset, D0. In other words, the robot following the imitation learning policy, p0, may find itself on a trajectory different from any it was trained on. This can cause the robot to fail to accomplish its task. One solution is to have a human intervene when the robot, following p0, is about to fail. The human intervenes by stopping the robot mid-trajectory and then teleoperating the robot. Such episodes with policy rollout plus human intervention can be recorded and saved as a new dataset D1. This new dataset contains states/situations not in the original imitation learning dataset D0, so it can be used to teach the robot how to behave in a greater variety of situations. To improve the initial policy, p0, one can combine D0 and D1 and retrain to obtain a more robust policy p1. This process can then be iterated.
  • DAgger: In our case, we intervene only once, when the robot is about to fail, and then teleoperate to the end of the episode. We also train the next policy pi+1 by initializing with the policy pi , and training on the combination of all datasets: \(\sum_{j=0}^{i+1} D_j\). This is inspired by the DAgger paper in which the authors prove the benefit of using the current policy to explore the state space while having an expert replace the current policy actions in that space with expert actions. In DAgger, a large fraction of actions are replaced in this way which creates a very rich dataset with correct behavior over a wide range of states/situations. Also, in the original DAgger approach the new policy pi+1 is trained from scratch. However, in practice it is common to intervene only selectively and to continue training starting with the previous policy, which is what we do here.
  • Place lid on pan: We applied this approach to improve an imitation learning pi0.5 policy trained on ANRedlich/trossen_ai_stationary_place_lids_04, which has 50 examples created using teleoperation. We then added another 50 examples using human intervention. The combined dataset is ANRedlich/trossen_ai_stationary_place_lids_13. There are two types of error the human corrected. First, as shown in 2b and 3b the robot failed to pick up the lid. By intervening just before this failure, the lid is picked up and then placed on the pan. Second, the robot picks up the lid correctly but does not place it well on the pan, as shown in Fig 4b, in which case the human intervenes just before the lid is misplaced. Starting from the original pi0.5 policy, which had been trained for 40K steps, training was restarted and continued for another 5K steps on the combined 100 episode dataset (+10K steps was slightly worse). The improved performance is shown in Fig 2a, 3a, and 4a. In one test of multiple lids and pans, the robot went from picking up 66% of lids to picking up 95%. Placement went from 15% to 30%, and lids were placed more closely, though still not perfectly, about 60% of the time. Only one iteration was performed, but more are planned.
Fig 2a. pi0.5 after human intervention and DAgger.
Lid is correctly picked up, compare to Fig 2b.
Fig 2b. pi0.5 initial imitation learning only.
Fails to pick up lid.
Fig 3a. pi0.5 after human intervention and DAgger.
Lid is correctly picked up, compare to Fig 3b.
Fig 3b. pi0.5 initial imitation learning only.
Fails to pick up lid.
Fig 4a. pi0.5 after human intervention and DAgger.
Lid placement is improved compared to Fig 4b.
Fig 4b. pi0.5 initial imitation learning only.
Fails to place lid.
Back to top

Human Intervention Implementation

Fig 5. Policy rollout with human intervention.

Fig 5 shows our implementation of policy rollout with human intervention. This is done in the examples/trossen_ai/record.py function in our openpi fork, which also saves episodes in the required lerobot dataset format. This record function runs the current pi0.5 (or another policy) up until the down arrow key is pressed, at which point the robot arm is frozen. In Fig 5, this happens just before the robot attempts to pick up the lid, which it would fail to do. Next the leader arm is sent to the same position as the frozen follower arm. Once the leader arm is in place, pressing the down arrow key again puts the robot arms into teleoperation mode, and the person completes the episode. Notice, Fig 6, that the recorded dataset video smoothly splices together the rollout and teleoperated trajectories. The record.py script also implements 'early exit', 'rerecord episode', and 'stop recording' as in control_robot.py in lerobot.

Fig 6. Left wrist camera dataset video for above intervention.
Back to top