TLDR: Task representations in VLMs are consistent across modality (text, image) and specification (example, instruction).
We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications.
In the in-context learning (ICL) paradigm, given a set of examples, the model has to learn the mapping from inputs to outputs. Prior research has demonstrated that LLMs implicitly compress this mapping into a latent activation, called the task vector (Hendel et al., 2023; Todd et al., 2024). This means one can separately specify (left) and apply (right) a task by patching the task vector. We analyze this phenomenon in VLMs, where we find that different specifications like text versus image ICL induce similar task vectors, thereby enabling cross-modal patching.
Instruction | Text ICL Example | Image ICL Example |
---|---|---|
The capital city of the country: | Greece : Athens | : Athens |
The last word of the official currency of the country: | Italy : Euro | : Euro |
The scientific name of the animal’s species in latin: | Gray Wolf : Canis lupus | : Canis lupus |
The term for the baby of the animal: | Common Dolphin : calf | : calf |
The color of the food: | Persimmon : orange | : orange |
The flavor descriptor of the food: | Strawberry : sweet | : sweet |
We design six tasks inspired by the text ICL examples proposed in prior work, where we add alternative specifications such as instructions and image examples.
We find that, to complete the task, the model processes the token representation in three phases: input, task, and answer. We use logit lens (nostalgebraist, 2020) to decode the layer representation for the token before the model output and visualize the probability that this representation decodes to a pre-defined input, task, or answer token. Each plot showcases a different set of examples, which vary by modality and task.
We compare the task accuracy across different methods for specifying and applying the task on image queries. We observe that cross-modal examples are more effective than unimodal ones (Text ICL xPatch vs. Image ICL Patch). However, we only see this benefit when the examples are patched as a task vector, not given in the same context window as a few-shot prompt (Text ICL xPatch vs. Text ICL xBase). We think this is the case because task vectors, regardless of how they are derived, arrive at similar representations, making diverse example types more in-domain for the model.
If you like this work, these other projects might also interest you.
Task Vectors in LLMs and Vision Models: Hendel et al., 2023; Todd et al., 2024; Hojel et al., 2024
Viewing Task Vectors as Example Compression: Huang et al., 2024
Representational Isomorphism and Convergence: Abdou et al., 2021; Patel & Pavlick, 2022; Pavlick, 2023; Huh et al., 2024
We would like to thank Jiahai Feng, Stephanie Fu, Alexander Pan, and Alberto Hojel for helpful discussions and feedback on the paper.
@article{luo2024tvacm,
title={Task Vectors are Cross-Modal},
author={Grace Luo and Trevor Darrell and Amir Bar},
journal={arXiv preprint arXiv:2410.22330}
year={2024}
}