Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

¹Technical University of Munich, ²Ludwig Maximilian University of Munich ³Munich Center for Machine Learning (MCML) ⁴Siemens AG
^*Equal contribution

Abstract

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing the image and text under- standing tasks.

They are often trained in a contrastive manner on a large and diverse corpus of images and cor- responding text captions scraped from the internet. De- spite this, VLMs often struggle with compositional reason- ing tasks which require a fine-grained understanding of the complex interactions of objects and their attributes.

This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined nega- tive examples might not be difficult for the model to discrim- inate from the positive. An alternative to mining would be negative sample generation 2) But existing generative ap- proaches primarily focus on generating hard negative texts associated with a given image. Mining in the other di- rection, i.e., generating negative image samples associated with a given text has been ignored.

To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leverag- ing these generative hard negative samples, we significantly enhance VLMs’ performance in tasks involving multimodal compositional reasoning.

Framework

Left: The portion to the left of the red dotted line demonstrates the process for determining segmentation masks of all objects in the scene. Tag2Text model is first utilized to generate a list of tags for all objects in the scene. Segmentation masks from the source image are created for all the individual tags (Masks for the seagull and water tags are shown). Note that the human-annotated source caption may not contain all the identified tags. Therefore, the caption generated by Tag2Text for the source image is also utilized.

Right: The portion to the right of the red dotted line figure corresponds to the process of generating images having subtle variations from the source image. For this, we use the Stable Diffusion, the inpainting mode generates the variations by using segmentation mask and new item portrayal as prompt. The item portrayals are produced using ChatGPT.

Samples

We showcase a few examples from our generated validation set. Our approach is advantageous in that we can generate a diverse dataset with challenging negative examples. For instance, the images in the first row depict scenarios that are highly unlikely in the real world, since an ice cream cart will never appear at an airport for aircraft maintenance. These examples serve as a true test of the model’s understanding of the cart concept.

BibTeX

@inproceedings{sahin2024enhancing, title={Enhancing multimodal compositional reasoning of visual language models with generative negative mining}, author={Sahin, Ugur and Li, Hang and Khan, Qadeer and Cremers, Daniel and Tresp, Volker}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, pages={5563--5573}, year={2024} }