We assemble CortexBench from 7 benchmarks and systematically evaluate existing visual representation models. We then train a single new model
Visual Cortex-1 (VC-1), compare it to the best prior result on each benchmark (above), and adapt it to specific domains.
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual ‘foundation models’ for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.
Overview of CortexBench. We assemble relevant datasets and visual representation learning algorithms to produce
candidate Visual Cortex models, which are then evaluated using either reinforcement or imitation learning on a set of highly diverse tasks.
We evaluated several pre-trained visual representations (PVRs) on CortexBench to assess their consistent performance across tasks. The models
included CLIP, R3M, MVP, and VIP, representing different architectures, pre-training objectives, and datasets. We also examined the necessity
of pre-training and the limitations of end-to-end learning. Our findings indicate that no single model excelled in all tasks, with R3M performing
best on Adroit, MetaWorld, and DMControl, MVP (ViT-L) on Trifinger, ImageNav, and Mobile Pick, and CLIP on ObjectNav. These results demonstrate
the variance in performance of existing PVRs on CortexBench and underscore the lack of a single, strong-performing artificial visual cortex for embodied AI.
After discovering that no single artificial visual cortex exists yet, we conduct our own investigations by building a pretrained visual encoder. We fix the pre-training objective – MAE and then vary the composition of the pre-training dataset and the size of the visual backbone (ViT-B with 86M parameters and ViT-L with 307M parameters). We conduct our experiments on CortexBench and try to answer three main questions:
Scaling Dataset Size
Scaling Model Size
Ranking of all Models
Finally, we study if adapting VC-1 can lead to improved results on CortexBench. We believe adaptation can be useful for at least these two reasons:
We try the following adaptation strategies:
Adaptation Plots: Adapting VC-1 with end-to-end fine-tuning or self-supervised learning (MAE) on in-domain data leads to substantial gains in performance.
This work introduced CortexBench, which comprises of 17 different embodied AI (EAI) task spanning locomotion, indoor navigation, and dexterous and mobile manipulation. Enabled by CortexBench, we performed the most comprehensive study to-date of visual foundation models for EAI. Specifically, we evaluated state-of-art open-sourced foundation models and find that we do not yet have a strong backbone for all tasks. However, models trained via masked auto-encoders (MAEs) are the most promising. Our study also finds that naively scaling model size and pretraining data diversity does not improve performance universally across all tasks, but does so on average. Finally, we find that adapting our largest pre-trained model (VC-1) results in performance that is competitive with or outperforms the best known results on all benchmarks in CortexBench.
One of our primary contentions is that in order for the research community to make progress on foundation models for EAI, we need to develop strong benchmarks – for a PVR to be foundational, it must be broadly applicable. Furthermore, as a community we should converge on best practices and strive towards a rigorous reproducible experimental methodology; we hope CortexBench will help the community make progress towards that.
@inproceedings{vc2023,
title = {Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?},
author = {Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and
Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and
Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
year = {2023},
eprint = {2303.18240},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}