For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal

In recent years, increasing attention has been directed to leveraging pre-trained vision models for motor control. While existing works mainly emphasize the importance of this pre-training phase, the arguably equally important role played by downstream policy learning during control-specific fine-tuning is often neglected. It thus remains unclear if pre-trained vision models are consistent in their effectiveness under different control policies. To bridge this gap in understanding, we conduct a comprehensive study on 14 pretrained vision models using 3 distinct classes of policy learning methods, including reinforcement learning (RL), imitation learning through behavior cloning (BC), and imitation learning with a visual reward function (VRF). Our study yields a series of intriguing results, including the discovery that the effectiveness of pre-training is highly dependent on the choice of the downstream policy learning algorithm. We show that conventionally accepted evaluation based on RL methods is highly variable and therefore unreliable, and further advocate for using more robust methods like VRF and BC. To facilitate more universal evaluations of pre-trained models and their policy learning methods in the future, we also release a benchmark of 21 tasks across 3 different environments alongside our work.

Lack of consistently performant models. The effectiveness of a pre-trained vision model is highly dependent on the downstream policy learning method. The vision of a universal pre-trained model with the best performance on all control tasks is yet to be realized.
Point out directions for reliable evaluation methods. Due to high variability, RL is not a robust evaluation method. We show that the consistent results of VRF and BC make them reliable evaluation methods in our benchmark of pre-trained vision models for motor control.
Deeper dive into properties of vision models enables us to obtain metrics, such as linear probing loss and k-NN classification accuracy, that have substantive predictive power for downstream control policies.

We consider 3 policy learning algorithms: (i) reinforcement learning (RL), (ii) imitation learning through behavior cloning (BC), and (iii) imitation learning with a visual reward function (VRF). The ﬁrst two approaches (RL and BC) are widely used in the existing literature and treat pre-trained features as representations that encode environment-related information. The last approach (VRF) is an inverse reinforcement learning (IRL) paradigm we adopt which requires that the pre-trained features also capture a high-level notion of task progress, an idea that remains largely underexplored.

We investigate the efficacy of 14 "off-the-shelf" pre-trained vision models covering different architecture (ResNet and ViT) and prevalent pre-training methods (contrastive learning, self-distillation, language-supervised and masked image modeling).

Model	Highlights
MoCo v2	Contrastive learning, momentum encoder
SwAV	Contrast online cluster assignments
SimSiam	Without negative pairs
DenseCL	Dense contrastive learning, learn local features
PixPro	Pixel-level pretext task, learn local features
VICRegL	Learn global and local features
VFS	Encode temporal dynamics
R3M	Learn visual representations for robotics
VIP	Learn representations and reward for robotics
MoCo v3	Contrastive learning for ViT
DINO	Self-distillation with no labels
MAE	Masked image modeling (MIM)
iBOT	Combine self-distillation with MIM
CLIP	Language-supervised pre-training

We run extensive experiments on 21 simulated tasks across 3 robot manipulation environments: Meta-World (8 tasks), Robosuite (8 tasks), and Franka-Kitchen (5 tasks).

We observe significant inconsistency in RL results across different training seeds. This high variability can be attributed to the inherent randomness stemming from several sources such as exploratory choices made during training, stochasticity in the task, and randomly initialized parameters. Therefore, RL itself is not suitable as a downstream policy learning method to evaluate different pre-trained vision models.

Without control-specific adaptation, BC still benefits from the latest benchmark-leading models in the vision community (e.g. VICRegL and iBOT), due to their inherent ability to capture more environment-relevant information such as object locations and joint positions.

Different vision models yield the most consistent performance when using VRF, which requires the vision model to learn global features and capture a notion of task progress. MAE is a noticeable underperforming outlier, likely due to the fact that it suffers from the anisotropic problem.

We observe a strong inverse correlation between linear probing loss and BC success rate, suggesting that the linear probing protocol can be a valuable and intuitive alternative for evaluating vision models for motor control. We further find a metric that is highly predictive of VRF performance: ImageNet k-NN classification accuracy.

BibTeX

@article{hu2023pre,
      title={For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal},
      author={Hu, Yingdong and Wang, Renhao and Li, Li Erran and Gao, Yang},
      journal={arXiv preprint arXiv:2304.04591},
      year={2023}
    }

For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal

ICML 2023

Abstract

Main Findings

3 Policy Learning Methods

14 Pre-Trained Vision Models

21 Tasks Across 3 Robotic Manipulation Environments

Reinforcement Learning

Imitation Learning through Behavior Cloning

Imitation Learning with Visual Reward Functions

Understanding Properties of Vision Models

BibTeX