Task Vectors are Cross-Modal

UC Berkeley

TLDR: Task representations in VLMs are consistent across modality (text, image) and specification (example, instruction).




Abstract

We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications.


What is a task vector?



In the in-context learning (ICL) paradigm, given a set of examples, the model has to learn the mapping from inputs to outputs. Prior research has demonstrated that LLMs implicitly compress this mapping into a latent activation, called the task vector (Hendel et al., 2023; Todd et al., 2024). This means one can separately specify (left) and apply (right) a task by patching the task vector. We analyze this phenomenon in VLMs, where we find that different specifications like text versus image ICL induce similar task vectors, thereby enabling cross-modal patching.


We study task vectors in six cross-modal settings.

Task:
Click the underlined text for dropdown menu

The capital city of the country: Greece : Athens : Athens
The last word of the official currency of the country: Italy : Euro : Euro
The scientific name of the animal’s species in latin: Gray Wolf : Canis lupus : Canis lupus
The term for the baby of the animal: Common Dolphin : calf : calf
The color of the food: Persimmon : orange : orange
The flavor descriptor of the food: Strawberry : sweet : sweet

We design six tasks inspired by the text ICL examples proposed in prior work, where we add alternative specifications such as instructions and image examples.


How do ICL examples affect the representation across model layers?

Modality: Task:
Click the underlined text for dropdown menu


We find that, to complete the task, the model processes the token representation in three phases: input, task, and answer. We use logit lens (nostalgebraist, 2020) to decode the layer representation for the token before the model output and visualize the probability that this representation decodes to a pre-defined input, task, or answer token. Each plot showcases a different set of examples, which vary by modality and task.


Cross-modal patching can outperform unimodal few-shot prompting.

Model: Task:
Click the underlined text for dropdown menu

We compare the task accuracy across different methods for specifying and applying the task on image queries. We observe that cross-modal examples are more effective than unimodal ones (Text ICL xPatch vs. Image ICL Patch). However, we only see this benefit when the examples are patched as a task vector, not given in the same context window as a few-shot prompt (Text ICL xPatch vs. Text ICL xBase). We think this is the case because task vectors, regardless of how they are derived, arrive at similar representations, making diverse example types more in-domain for the model.


Qualitative Examples


Relevant Readings

If you like this work, these other projects might also interest you.

Acknowledgements

We would like to thank Jiahai Feng, Stephanie Fu, Alexander Pan, and Alberto Hojel for helpful discussions and feedback on the paper.

BibTeX


    @article{luo2024tvacm,
      title={Task Vectors are Cross-Modal}, 
      author={Grace Luo and Trevor Darrell and Amir Bar},
      journal={arXiv preprint arXiv:2410.22330}
      year={2024}
    }