For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal

ICML 2023

1Tsinghua University, 2AWS AI, Amazon, 3Shanghai Qi Zhi Institute
Description of the image


(Top) Most prior works that leverage pre-trained vision models as frozen perception modules for motor control only compare a few models using a single fixed policy learning algorithm. (Bottom) We find that using different policy learning algorithms results in significant changes in the rankings of 14 different vision models, i.e., the effectiveness of a vision model is algorithm-dependent.

Abstract

In recent years, increasing attention has been directed to leveraging pre-trained vision models for motor control. While existing works mainly emphasize the importance of this pre-training phase, the arguably equally important role played by downstream policy learning during control-specific fine-tuning is often neglected. It thus remains unclear if pre-trained vision models are consistent in their effectiveness under different control policies. To bridge this gap in understanding, we conduct a comprehensive study on 14 pretrained vision models using 3 distinct classes of policy learning methods, including reinforcement learning (RL), imitation learning through behavior cloning (BC), and imitation learning with a visual reward function (VRF). Our study yields a series of intriguing results, including the discovery that the effectiveness of pre-training is highly dependent on the choice of the downstream policy learning algorithm. We show that conventionally accepted evaluation based on RL methods is highly variable and therefore unreliable, and further advocate for using more robust methods like VRF and BC. To facilitate more universal evaluations of pre-trained models and their policy learning methods in the future, we also release a benchmark of 21 tasks across 3 different environments alongside our work.

Main Findings

  • Lack of consistently performant models. The effectiveness of a pre-trained vision model is highly dependent on the downstream policy learning method. The vision of a universal pre-trained model with the best performance on all control tasks is yet to be realized.
  • Point out directions for reliable evaluation methods. Due to high variability, RL is not a robust evaluation method. We show that the consistent results of VRF and BC make them reliable evaluation methods in our benchmark of pre-trained vision models for motor control.
  • Deeper dive into properties of vision models enables us to obtain metrics, such as linear probing loss and k-NN classification accuracy, that have substantive predictive power for downstream control policies.



3 Policy Learning Methods

We consider 3 policy learning algorithms: (i) reinforcement learning (RL), (ii) imitation learning through behavior cloning (BC), and (iii) imitation learning with a visual reward function (VRF). The first two approaches (RL and BC) are widely used in the existing literature and treat pre-trained features as representations that encode environment-related information. The last approach (VRF) is an inverse reinforcement learning (IRL) paradigm we adopt which requires that the pre-trained features also capture a high-level notion of task progress, an idea that remains largely underexplored.

Interpolate start reference image.


14 Pre-Trained Vision Models

We investigate the efficacy of 14 "off-the-shelf" pre-trained vision models covering different architecture (ResNet and ViT) and prevalent pre-training methods (contrastive learning, self-distillation, language-supervised and masked image modeling).

Model Highlights
MoCo v2 Contrastive learning, momentum encoder
SwAV Contrast online cluster assignments
SimSiam Without negative pairs
DenseCL Dense contrastive learning, learn local features
PixPro Pixel-level pretext task, learn local features
VICRegL Learn global and local features
VFS Encode temporal dynamics
R3M Learn visual representations for robotics
VIP Learn representations and reward for robotics
MoCo v3 Contrastive learning for ViT
DINO Self-distillation with no labels
MAE Masked image modeling (MIM)
iBOT Combine self-distillation with MIM
CLIP Language-supervised pre-training



21 Tasks Across 3 Robotic Manipulation Environments

We run extensive experiments on 21 simulated tasks across 3 robot manipulation environments: Meta-World (8 tasks), Robosuite (8 tasks), and Franka-Kitchen (5 tasks).




Reinforcement Learning

We observe significant inconsistency in RL results across different training seeds. This high variability can be attributed to the inherent randomness stemming from several sources such as exploratory choices made during training, stochasticity in the task, and randomly initialized parameters. Therefore, RL itself is not suitable as a downstream policy learning method to evaluate different pre-trained vision models.

Description of the image


Imitation Learning through Behavior Cloning

Without control-specific adaptation, BC still benefits from the latest benchmark-leading models in the vision community (e.g. VICRegL and iBOT), due to their inherent ability to capture more environment-relevant information such as object locations and joint positions.

Description of the image


Imitation Learning with Visual Reward Functions

Different vision models yield the most consistent performance when using VRF, which requires the vision model to learn global features and capture a notion of task progress. MAE is a noticeable underperforming outlier, likely due to the fact that it suffers from the anisotropic problem.

Description of the image


Understanding Properties of Vision Models

We observe a strong inverse correlation between linear probing loss and BC success rate, suggesting that the linear probing protocol can be a valuable and intuitive alternative for evaluating vision models for motor control. We further find a metric that is highly predictive of VRF performance: ImageNet k-NN classification accuracy.

Description of the image

BibTeX

@article{hu2023pre,
      title={For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal},
      author={Hu, Yingdong and Wang, Renhao and Li, Li Erran and Gao, Yang},
      journal={arXiv preprint arXiv:2304.04591},
      year={2023}
    }