Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Arjun Majumdar^1☆, Karmesh Yadav^2☆, Sergio Arnaud^2☆, Jason Ma³, Claire Chen⁴, Sneha Silwal², Aryan Jain⁵, Vincent-Pierre Berges², Tingfan Wu², Jay Vakil², Pieter Abbeel⁵, Jitendra Malik^{5 2}, Dhruv Batra^{1 2}, Yixin Lin^2†, Oleksandr Maksymets^2†, Aravind Rajeswaran^2†, Franziska Meier^2†

¹Georgia Institute of Technology, ²Meta AI, ³University of Pennsylvania, ⁴Stanford University, ⁵UC Berkeley

^☆Equal Contribution ^†Equal Contribution

arXiv Blog Code Models Benchmark Dataset

We assemble CortexBench from 7 benchmarks and systematically evaluate existing visual representation models. We then train a single new model Visual Cortex-1 (VC-1), compare it to the best prior result on each benchmark (above), and adapt it to specific domains.

Abstract

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual ‘foundation models’ for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.

Overview

Overview of CortexBench. We assemble relevant datasets and visual representation learning algorithms to produce candidate Visual Cortex models, which are then evaluated using either reinforcement or imitation learning on a set of highly diverse tasks.

Evaluating Existing Models

We evaluated several pre-trained visual representations (PVRs) on CortexBench to assess their consistent performance across tasks. The models included CLIP, R3M, MVP, and VIP, representing different architectures, pre-training objectives, and datasets. We also examined the necessity of pre-training and the limitations of end-to-end learning. Our findings indicate that no single model excelled in all tasks, with R3M performing best on Adroit, MetaWorld, and DMControl, MVP (ViT-L) on Trifinger, ImageNav, and Mobile Pick, and CLIP on ObjectNav. These results demonstrate the variance in performance of existing PVRs on CortexBench and underscore the lack of a single, strong-performing artificial visual cortex for embodied AI.

Experiments

After discovering that no single artificial visual cortex exists yet, we conduct our own investigations by building a pretrained visual encoder. We fix the pre-training objective – MAE and then vary the composition of the pre-training dataset and the size of the visual backbone (ViT-B with 86M parameters and ViT-L with 307M parameters). We conduct our experiments on CortexBench and try to answer three main questions:

(1) What is the impact of scaling dataset size and diversity?
(2) How does the inclusion of less-relevant datasets influence the performance of PVRs on embodied AI tasks?
(3) How does VC-1 compare to existing PVRs?

Scaling Dataset Size

Scaling Model Size

Ranking of all Models

Scaling experiments: Visualizing model performance averaged across all benchmarks. Overall, we demonstrate modest but positive scaling trends in both (a) scaling model size, and (b) dataset diversity. c) Average ranking across all benchmarks. We compare existing PVR models and scaling models by showcasing their ranking across all benchmarks, VC-1 (MAE Ego4D+MNI) achieves the highest average rank.

Adapting VC-1

Finally, we study if adapting VC-1 can lead to improved results on CortexBench. We believe adaptation can be useful for at least these two reasons:

(1) Since VC-1 was trained with MAE, it captures features that are generally useful for reconstructing images. Adaptation can specialize the visual backbone to extract features required for performing specific downstream tasks such as object rearrangement.
(2) Adaptation can also help mitigate domain-gap that might exist between pre-training and evaluation settings. Domain gap specifically exists in our setup since VC-1 was pretrained with human video data while our downstream evaluation in CortexBench uses simulated tasks.

We try the following adaptation strategies:

(1) End-to-end (E2E) fine-tuning using a task-specific loss function
(2) MAE adaptation using task specific data

Adaptation Plots: Adapting VC-1 with end-to-end fine-tuning or self-supervised learning (MAE) on in-domain data leads to substantial gains in performance.

Interpreting VC-1

We investigate the visual representations of VC-1 by visualizing the last attention layer of the pretrained Vision Transformer. This enables us to obtain a better understanding of the regions of the image that receive the most weight in the learned representation, just before it is passed downstream to the policy. Our analysis indicates that models trained using masked auto-encoding tend to prioritize edges, boundaries, and complex visual features in the scene. Interestingly, once we finetune VC-1, the attention gets localized specifically to the region of interest and as the scene progresses, the model's attention shifts from the surrounding area of the object to the object itself and ultimately to the end-effector.

Frozen Model Attention Visualization

Finetuned Model Attention Visualization

Conclusion

This work introduced CortexBench, which comprises of 17 different embodied AI (EAI) task spanning locomotion, indoor navigation, and dexterous and mobile manipulation. Enabled by CortexBench, we performed the most comprehensive study to-date of visual foundation models for EAI. Specifically, we evaluated state-of-art open-sourced foundation models and find that we do not yet have a strong backbone for all tasks. However, models trained via masked auto-encoders (MAEs) are the most promising. Our study also finds that naively scaling model size and pretraining data diversity does not improve performance universally across all tasks, but does so on average. Finally, we find that adapting our largest pre-trained model (VC-1) results in performance that is competitive with or outperforms the best known results on all benchmarks in CortexBench.

One of our primary contentions is that in order for the research community to make progress on foundation models for EAI, we need to develop strong benchmarks – for a PVR to be foundational, it must be broadly applicable. Furthermore, as a community we should converge on best practices and strive towards a rigorous reproducible experimental methodology; we hope CortexBench will help the community make progress towards that.

BibTeX


    @inproceedings{vc2023,
        title         = {Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?}, 
        author        = {Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and
                         Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and
                         Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
        year          = {2023},
        eprint        = {2303.18240},
        archivePrefix = {arXiv},
        primaryClass  = {cs.CV}
    }