Foundation Reinforcement Learning: towards Embodied Generalist Agents with Foundation Prior Assistance.

1Tsinghua University, 2Shanghai Artificial Intelligence Laboratory, 3Shanghai Qi Zhi Institute, 4UC Berkeley

Abstract

Recently, people have shown that large-scale pre-training from diverse internet-scale data is the key to building a generalist model, as witnessed in the natural language processing (NLP) area. To build an embodied generalist agent, we, as well as many other researchers, hypothesize that such foundation prior is also an indispensable component. However, it is unclear what is the proper concrete form we should represent those embodied foundation priors and how those priors should be used in the downstream task.

In this paper, we propose an intuitive and effective set of embodied priors that consist of foundation policy, foundation value, and foundation success reward. The proposed priors are based on the goal-conditioned Markov decision process formulation of the task. To verify the effectiveness of the proposed priors, we instantiate an actor-critic method with the assistance of the priors, called Foundation Actor-Critic (FAC). We name our framework as Foundation Reinforcement Learning (FRL), since our framework completely relies on embodied foundation priors to explore, learn and reinforce.

The benefits of our framework are threefold. (1) Sample efficient learning. With the foundation prior, FAC learns significantly faster than traditional RL. Our evaluation on the Meta-World has proved that FAC can achieve 100% success rates for 7/8 tasks under less than 200k frames, which outperforms the baseline method with careful manual-designed rewards under 1M frames. (2) Robust to noisy priors. Our method tolerates the unavoidable noise in embodied foundation models. We have shown that FAC works well even under heavy noise or quantization errors. (3) Minimal human intervention: FAC completely learns from the foundation priors, without the need of human-specified dense reward, or providing teleoperated demonstrations. Thus, FAC can be easily scaled up. We believe our FRL framework could enable the future robot to autonomously explore and learn without human intervention in the physical world. In summary, our proposed FRL framework is a novel and powerful learning paradigm, towards achieving an embodied generalist agent.

Video

FAC on Meta-world

The proposed Foundation Actor-Critic (FAC) is built upon DrQ-v2. We conduct experiments of FAC on 8 tasks from simulated robotics environments Meta-World.

(1) Minual human intervention: FAC can learn efficiently through interactions with the environment without manual-designed reward, under the guidance of the value/policy/success-reward prior knowledge.

(2) Sample efficient learning: FAC achieves 100% success rates for all the tasks. 7/8 of them require less than 200k frames except for bin-picking-v2. Moreover, it outperforms the baseline method with manual-designed rewards in both success rates and sample efficiency.

Curves of results
Video demos
  • bin-picking-v2
  • button-press-topdown-v2
  • door-open-v2
  • door-unlock-v2
  • drawer-close-v2
  • drawer-open-v2
  • hammer-v2
  • window-close-v2

FAC is robust to noisy priors

FAC can work well under much noisier policy prior (purple curve):

(1) We discretized the policy prior into {-1, 0, 1}, which makes the policy prior only contain rough directional information.

(2) Under the discretized policy prior, we use uniform noise as policy prior action at 20% or 50% probability.

Using the discretized policy with 50% noise, FAC can still reach 100% success rates in many environments. The results indicate that our proposed FAC is robust to the quality of the foundation prior. Moreover, the better the foundation prior is, the more sample efficient the FAC is.

Video Generation of Conditioned Diffusion Models

We fine-tuned the conditioned video diffusion model with 10 videos for each task. There are 16 frames of the generated videos, as follows. Under the limited fine-tuned data, the generated video frames are not clear, so that the distilled policy prior is not well. However, our FAC can work efficiently under such poor policy prior.

  • bin-picking-v2: pick the green bin from the red box to the table.
  • button-press-topdown-v2: press down the red button with the red robotic arm.
  • door-open-v2: open the door by turning the handle.
  • door-unlock-v2: unlock the door with the red robotic arm.
  • drawer-close-v2: close the green drawer with the red robotic arm.
  • drawer-open-v2: open the green drawer with the red robotic arm.
  • hammer-v2: pushing the nail into the wall by the hammer.
  • window-close-v2: close the window with red robotic arm.

Distilled Policy on Meta-world

We distill a policy prior model from the diffusion model and a trained inverse dynamics model from DrQ-v2 replay buffer. Steps are as follows:

(1) we generate 1000 videos from the diffusion model for bin-picking-v2 and 100 videos for the others;

(2) use the inverse dynamics model to label actions for the videos;

(3) train a policy prior model from the labelled dataset by supervised training.

  • bin-picking-v2
  • button-press-topdown-v2
  • door-open-v2
  • door-unlock-v2
  • drawer-close-v2
  • drawer-open-v2
  • hammer-v2
  • window-close-v2

Related Source

The baseline actor-critic method is DrQ-v2; the value prior model is acquired from VIP; the policy prior model is following UniPi; the conditioned video diffusion model is acquired from Seer.

BibTeX

@misc{ye2023foundation,
      title={Foundation Reinforcement Learning: towards Embodied Generalist Agents with Foundation Prior Assistance},
      author={Weirui Ye and Yunsheng Zhang and Mengchen Wang and Shengjie Wang and Xianfan Gu and Pieter Abbeel and Yang Gao},
      year={2023},
      eprint={2310.02635},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}