We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual ‘foundation models’ for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.
After discovering that no single artificial visual cortex exists yet, we conduct our own investigations by building a pretrained visual encoder. We fix the pre-training objective – MAE and then vary the composition of the pre-training dataset and the size of the visual backbone (ViT-B with 86M parameters and ViT-L with 307M parameters). We conduct our experiments on CortexBench and try to answer three main questions:
Finally, we study if adapting VC-1 can lead to improved results on CortexBench. We believe adaptation can be useful for at least these two reasons:
We try the following adaptation strategies:
This work introduced CortexBench, which comprises of 17 different embodied AI (EAI) task spanning locomotion, indoor navigation, and dexterous and mobile manipulation. Enabled by CortexBench, we performed the most comprehensive study to-date of visual foundation models for EAI. Specifically, we evaluated state-of-art open-sourced foundation models and find that we do not yet have a strong backbone for all tasks. However, models trained via masked auto-encoders (MAEs) are the most promising. Our study also finds that naively scaling model size and pretraining data diversity does not improve performance universally across all tasks, but does so on average. Finally, we find that adapting our largest pre-trained model (VC-1) results in performance that is competitive with or outperforms the best known results on all benchmarks in CortexBench.
One of our primary contentions is that in order for the research community to make progress on foundation models for EAI, we need to develop strong benchmarks – for a PVR to be foundational, it must be broadly applicable. Furthermore, as a community we should converge on best practices and strive towards a rigorous reproducible experimental methodology; we hope CortexBench will help the community make progress towards that.
@inproceedings{vc2023,
title = {Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?},
author = {Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and
Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and
Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
year = {2023},
eprint = {2303.18240},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}