Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu

Introduction

Vision and Language Foundation Models (VLMs) pretrained on large-scale image-text data have consistently demonstrated impressive performance across a wide range of well-established evaluating tasks, i.e. image classification , image captioning , visual question answering and cross-modal image-text retrieval . Their remarkable performance is gradually convincing the community that these currently available VLMs are almost robust and powerful enough to be transferred to a broad spectrum of downstream tasks, either through finetuning or even in a zero-shot manner.

However, recent research has shattered this captivating illusion, revealing that even state-of-the-art VLMs exhibit significant limitations in understanding visual-linguistic concepts that require fine-grained compositional reasoning, especially in tasks involving object attributes or inter-object relationships . This raises a crucial question: to what extent and in what aspects are VLMs excelling or struggling? To answer this, previous effort evaluates fine-grained capabilities of VLMs through the image-to-text matching task, as shown in Fig. 1(a). This involves providing a query image and retrieving the matching text from a set of confusing candidates, differing subtly in texts. For example, when assessing the counting ability, it is crucial to ensure that quantity is the unique variable and other clues are kept the same. Therefore, a straightforward way to do so is modifying quantity and adjective words used in the texts. While manipulating texts to construct confusing candidate sets has been well-studied due to the sparsity of the text space and advancements in Large Language Models (LLMs) , the visual side remains relatively under explored, largely due to the complexity of visual signals and the absence of powerful tools. We posit that exploring the visual dimension in a fine-grained manner is also essential for a comprehensive understanding of VLMs.

Motivated by the great progress achieved in generative models, we present an effective framework for generating high-quality image candidates that are suitable for evaluating the performance of VLMs. This framework ensures that images within the same candidate set only differ in the specified property of interest, while all other properties remain consistent. We break down this task into several simple and manageable steps. As illustrated in Fig. 2, we start by utilizing a text-to-image model to generate images featuring a single object. Then, a segmenter is employed to separate the objects from their backgrounds, yielding a library of foreground instance spanning various categories. From there, we select instances and arrange them on a blank canvas (manipulating attributes such as size, position, existence, and quantity of a specific object at this stage is straightforward). Finally, we use an inpainting model to fill the missing background portions, producing an photo-realistic image. It is worth noting that, during the inpainting process, we design a progressive background filling strategy, effectively ensuring consistency in the background across all images in the same candidate set.

Empowered by this data construction pipeline, we carefully develop a new benchmark, named as SPEC, to evaluate the proficiency of VLMs in comprehending fine-grained concepts including Size, Position, Existence and Counting. We systematical test four VLMs on this newly created test bed. Surprisingly, even state-of-the-art models perform at chance-level, exposing significant performance deficiencies. Following this, we implement a straightforward approach to remedy this by incorporating hard negative examples (i.e., confusing images or texts within the same candidate set) into the same training batch. This encourages the model to discern subtle differences among candidate examples, leading to a significant improvement in performance on SEPC while preserving the original zero-shot capability. Furthermore, to demonstrate the model’s generalization ability, we conduct additional tests on two existing datasets , which also focus on compositional reasoning. The consistent improvement further validates that our method effectively guide the model to acquire essential and transferable comprehending abilities at a finer granularity. Our main contributions are:

A progressive data constructing pipeline. We present a progressive data construction pipeline designed for creating a candidate image set. Within each candidate set, images vary exclusively in a specified attribute while ensuring consistency across other aspects. Such data are valuable for conducting text-to-image matching tasks, as depicted in Fig. 1(b), offering a visual perspective for evaluating VLMs.

A carefully curated benchmark: SPEC. We meticulously craft a novel benchmark, SPEC, with a specific focus on evaluating VLMs’ understanding of fine-grained visual-linguistic concepts, encompassing object size, position, existence, and count. The introduction of SPEC enables a symmetrical evaluation of VLMs from both image and text perspectives, addressing the previous lack of image-centric testing data.

A simple and effective remedy. We evaluate four VLMs on SPEC, revealing significant limitations. In response, we propose a method to enhance the understanding of fine-grained visual-linguistic concepts. Experimental results indicate notable improvement not only on SPEC but also consistent results on two additional datasets, while preserving zero-shot capability.

Related Work

Vision and Language Models (VLMs). Models such as CLIP , ALIGN , CyCLIP and CoCa have demonstrated impressive performance across a wide range of downstream tasks. These models include two separate unimodal encoders, each designed to extract representations for images and text. To achieve alignment between the two modalities, they typically employ a huge number of image-text pairs for contrastive learning. By pretraining on 400M noisy data, CLIP achieves a top-1 accuracy on ImageNet-1K comparable to that of ResNet-50 , even though it is never specifically trained on ImageNet and is evaluated in a zero-shot manner. However, as highlighted in recent work , these advancements are primarily attributed to the simplicity of evaluation tasks which requires no reasoning or compositional capabilities. The performance of these models are limited on tasks that require fine-grained understanding .

Benchmarking VLMs in Finer Granularity. To assess the model’s understanding of nuanced visual-linguistic concepts, several new benchmarks have been proposed. Winoground is curated by experts with a focus on compositional understanding. VALSE and VL-Checklist investigate several linguistic phenomena by transforming real captions into confusing alternatives. ARO diagnoses VLMs’ comprehension in attribution, relation and ordering. Eqben assesses whether the model is sensitive to visual semantic changes. Most existing benchmarks solely focus on subtle textual changes , as crafting confusing text candidates is straightforward that can be achieved through either LLMs or simple rules. Winoground and Eqben are most relevant to us, since we focus on minimal semantic changes in both images and texts, enabling a more comprehensive evaluation across modalities. However, the scale of Winoground is restricted by its costly curation, and the image diversity of Eqben is limited by virtual engines. In contrast, our data construction pipeline is scalable and can produce diverse images.

Enhancing VLMs for Fine-grained Understanding. To mitigate challenges in fine-grained recognition, various approaches have been explored. Syn-CLIP utilize data synthesized by 3D simulation engines to enhance the model’s understanding of concepts beyond nouns. EQSIM incorporates an additional regularization loss to generalize VLMs to nuanced multimodal compositions. TSVLC and ViLEM introduce negative texts generated by LLMs to inject fine-grained knowledge. Construct-VL addresses these challenges from a continual learning perspective. These methods compel the model to focus on subtle differences by introducing confusing texts as hard negatives. However, we argue that the absence of visual hard negatives limits its performance. Thus, we introduce hard negatives for both modalities, simultaneously enhancing the visual and textual encoders.

Synthesize: Data Construction Pipeline

Our goal is to build a set of perplexing image, wherein each image differs solely in a specified attribute while ensuring consistency in all other aspects. We first emphasize the importance of preserving consistency among candidate images for effective evaluation (Sec. 3.1). To address this, we break down this problem and introduce a progressive data construction pipeline (Sec. 3.2). Then, we carefully devise a benchmark that centers on evaluating VLMs’ grasp of fine-grained visual-linguistic concepts (Sec. 3.3).

As illustrated in Fig. 3, when conducting a matching task with the textual query “a photo of two dogs”, the model might mistakenly choose the image on the right (which actually contains three dogs). However, attributing this error solely to counting difficulty is not convincing. The model might select the right-side image due to its more photo-realistic appearance or better alignment with the word “dog”, as the left-side image has a cartoon style. Conversely, if the model correctly selects the right-side image for the query “a photo of three dogs”, we also cannot assert that the model is proficient in counting. These obscure ambiguities arise because there is no guarantee of the uniqueness of changing factors among candidate images during evaluation. Therefore, ensuring consistency among candidates in all aspects except the one under investigation is crucial. To this end, we propose a progressive data construction method, which will be elaborated in detail as follows.

2 Progressive Data Construction

The data construction framework is illustrated in Fig. 2, comprising four progressive steps. Initially, we generate images featuring a single prominent object. Then, we isolate the foreground from the background, resulting in a library of foreground objects spanning various categories. Subsequently, we select objects from this library and arrange them on a blank canvas, adjusting their attributes and relationships. Lastly, we carefully fill in the missing background, ensuring consistency among different candidates.

We initiate the process by utilizing a generation model to obtain a collection of images, each featuring a single and prominent object corresponding to a specific category. Due to the progress in text-to-image generation , these images display a high level of photo-realistic and diverse content. In practice, we use Stable-Diffusion-XL 1.0 as our generator, and prompt it with “a photo of a single and fully visible [class name]”. The emphasis on “single and fully visible” is crucial, as the model might otherwise generate images with multiple objects or encounter occlusion issues, as observed in . The [class name] represents a specific category from 80 classes of COCO .

2.2 Isolating Objects from the Background

For ease in subsequent processes, we need to separate the objects from the backgrounds where they are embedded. To accomplish this, we first utilize an open-set detector, Grounding-DINO to outline the regions containing the objects. Subsequently, we prompt SAM with this bounding box as to obtain the final segmentation results. Thus far, we have established a library containing instances from various categories. These instances are background-free, allowing for composition on a blank canvas, while their attributes and relations can be controlled.

2.3 Arranging Objects on the Canvas

Recall that our goal is to manipulate a specific visual-linguistic concept of an image. Currently, this task appears straightforward when we exclude the background from consideration. We can retrieve instances from the library and arrange them on a blank canvas according to our specifications. For example, we can flexibly control the quantity of an object through duplication operations, modify their sizes via resizing, and determine whether an object exists and specify the position of an existing object. This process resembles the concept of copy-paste , with a notable distinction: we paste objects onto a blank background and placing emphasis on controlling properties such as the size, position, existence, and quantity of each individual object.

2.4 Infilling the Missing Background

We have constructed images with differences in specified attributes, however, they currently lack a suitable background. To fill the missing area, we employ a inpainting model which demonstrates proficiency in filling large holes. It is worth noting that generating backgrounds individually for each image would result in significant differences in the backgrounds, posing a challenge to maintaining consistency among candidate images. To overcome this, we introduce a strategy where images within the same candidate set share a common and consistent background during the inpainting process. As depicted in the upper part of Fig. 4, to present the giraffe in different positions, we start by surrounding it with an initial background. Then, we relocate this initial image on the canvas as required, and fill the remaining blank space. Similarly, in the lower part, we first embed the zebra into a reasonable environment. Following that, we expand the scene horizontally, introduce the elephant, and fill the blank areas through inpainting. In summary, we begin the process by generating an initial background, which is then expanded to the surroundings, effectively ensuring consistency among the candidate images. Additional examples, such as adjusting the size or quantity of an object, can be found in the supplementary.

3 SPEC Benchmark

Utilizing the data engine outlined in Sec. 3.2, we carefully devise the SPEC benchmark with the goal of assessing the performance of VLMs in comprehending object size, position, existence and count. An overview of the SPEC benchmark is presented in Fig. 5. SPEC contains six subsets, which will be elaborated as follows:

Absolute Size reflects how large an object is in comparison to the entire image. We categorize this into three level: large, medium, or small, and define them following:

where $P$ denotes the proportion of the space occupied by the object relative to the area of the entire image. A safety threshold is deliberately introduced between these three levels to prevent ambiguity.

Relative Size focuses on the size relationship between two objects. It is categorized as follows: object A is [smaller than, equal to, larger than] object B, and we measure this following:

where $R=\frac{S_{A}}{S_{B}}$ is the ratio of the areas of object A and object B.

Absolute Position signifies the location of an object relative to the image. We partition the image into a 3 $\times$ 3 grid, defining nine possible positions: top-left, top, top-right, left, center, right, bottom-left, bottom, and bottom-right. The absolute position of an object is determined based on the grid in which its center point resides.

Relative Position describes the spatial relationship between two objects. We consider four common spatial relationships: A is [to the left of, to the right of, above, below] B. The position relationship between objects is defined based on the relative positions of their center points.

Existence indicates whether an object appears in a given image, expressed using existential quantifiers: There is [no, at least one] object in the image.

Count represents the number of occurrences of an object, providing a metric for the model’s quantitative understanding. Due to potential occlusion issues with a large number of objects, we restrict our consideration to the range of 1 to 9: “there are [one, two, $\cdots$ , nine] object(s) in the image”.

Data Format. The basic unit of SPEC is an individual test case, wherein each test case comprises two components: an image candidate set, which differs only in certain visual aspects, and a text candidate set, which differs only in the corresponding language descriptions. We formally represent a test case as:

where $\mathcal{I}$ and $\mathcal{T}$ represent the image and text candidate sets, respectively. The i-th image $I_{i}$ is paired with the i-th text $T_{i}$ , i.e., they mutually describe each other. $K$ is the semantic cardinality of the test case, determined by the definition of each subset. For instance, if a test case belongs to the absolute size subset, then $K=3$ (representing the three semantics: large, medium, small). In Fig. 5, we present examplar test cases for each subset, and more examples can be found in the supplementary.

In Tab. 1, we conduct a comprehensive comparison of SPEC with four similar benchmarks. Visual HN indicates the presence of image hard negatives, which are essential for the text-to-image matching task. Notably, VALSE and ARO concentrate solely on text hard negatives, neglecting their visual counterparts. Scalability assesses whether the data construction method can be scaled up. For instance, Winoground is limited to 400 examples due to high costs for manual collection. Additionally, Text-realistic evaluates the grammatical correctness of the texts, where ARO involves directly swapping the positions of two words without considering grammar. Similarly, Image-realistic indicates the realism of the images. Eqben incorporates images rendered using a virtual engine, compromising their quality. In contrast, images from SPEC, while synthesized, leverage advanced image generation models and our effective data construction pipeline, resulting in photo-realistic images. Finally, we compare the number of Candidates in each test case. In all datasets except SPEC, each example features only two candidates, i.e., identifying the correct item from two candidates, which is relatively straightforward. In contrast, tasks in SPEC are more challenging, and the semantic space covered by the candidate set is more comprehensive. For example, in the case of relative position, the candidates include all four semantic directions (up, down, left, right), and in absolute position and count, it extends to nine candidates. As will be discussed below, more confusing candidates can be readily used as hard negatives to improve current VLMs.

Diagnose: Probing VLMs on SPEC

We conduct a systematical evaluation of four state-of-the-art VLMs: CLIP , BLIP , FLAVA and CoCa using our newly proposed SPEC benchmark, aiming to uncover that to what extent and in what aspect are they excelling or suffering.

Evaluation protocols. We measure the performance on SPEC using two metrics: $\textsc{I}2\textsc{Tacc}$ and $\textsc{T}2\textsc{Iacc}$ , representing the accuracy of image-to-text and text-to-image matching task, respectively:

where $D$ contains $|D|$ test cases, $h(I_{j},\mathcal{T}_{i})$ equals to 1 if and only if $I_{j}$ correctly find its matched text $T_{j}$ from the candidate set $\mathcal{T}_{i}$ , otherwise it is set to 0. Similarly, $g(T_{j},\mathcal{I}_{i})$ equals 1 if and only if $T_{j}$ correctly finds its matched image $I_{j}$ from the candidate set $\mathcal{I}_{i}$ , otherwise, it is set to 0.

2 Key Insights from SPEC Results

We evaluate four VLMs using the SPEC benchmark, and their results are summarized in Tab. 2. We find that all the models exhibit a limited accuracy close to random chance, from which we gain the following insights:

Even state-of-the-art VLMs perform at chance level. From the results, we surprisingly find that even the most advanced VLMs achieve only a marginal advantage compared to random chance, which sharply contrasts with their impressive performance on common tasks. For instance, CLIP demonstrates a mere 33.4 $\%$ t2iacc for relative size recognition, while the chance-level accuracy is 33.3 $\%$ . While BLIP performs reasonably well in absolute size, surpassing random level by around 9.7 $\%$ , the i2tacc on relative size is 0.7 $\%$ lagged behind. CoCa and FLAVA also exhibit significant weaknesses in performance. The last row presents the performance of our improved model, demonstrating significant advancements across all metrics (as will be introduced in Sec. 5).

The challenge arises from the task itself, not the data. One might attribute the poor performance of VLMs to the data quality or distribution. To address this concern, we conduct an additional experiment. Specifically, we perform classification experiments using the SPEC dataset to assess the model’s understanding of nouns or object categories. In this context, the models exhibit impressive performance, achieving approximately 90 $\%$ Top-1 accuracy in the 80-class classification task (denoted as cls in Tab. 2). This aligns well with earlier findings that VLMs struggle in compositional reasoning while excelling in object category recognition. The remarkable accuracy of the models in the object classification task confirms the high quality of SPEC data. This also validates that the challenges faced by VLMs stem from the tasks which require fine-grained recognition rather than issues with the data itself.

3 Discussion on Model Limitations

We attribute the poor performance of vision and language models on SPEC to to their pretraining methods, specifically, the inherent limitation in standard contrastive loss. The conventional contrastive learning involves randomly sampling batches of images and texts, requiring the model to identify matching pairs within the batch. This task is intended to facilitate alignment between text and image spaces. However, as highlighted in prior studies , the substantial differences between items in a randomly sampled batch allows the model to effortlessly complete this task. It can easily achieves this by focusing solely on nouns in the text and object categories in the images through a shortcut . This leads the model biased towards noun concepts, neglecting other finer-grained concepts. This is the reason why these models demonstrate poor performance on fine-grained tasks that demanding understanding concepts beyond nouns.

Optimize: A Simple but Effective Remedy

We experiment with CLIP and propose a remedy to enhance its performance in fine-grained understanding.

CLIP consists of an visual encoder to extract image embedding: $e_{I}=\mathcal{E}_{I}(I)$ and a textual encoder to extract text embedding: $e_{T}=\mathcal{E}_{T}(T)$ . The similarity score between an image $I$ and a text $T$ is computed following:

In order to guide CLIP to focus on fine-grained visual-linguistic concepts, we incorporate confusing images and text as hard negatives within the same batch. This requires CLIP to pull positives closer and push hard negatives away, thereby enhancing its ability to discern nuanced visual and textual differences. Specifically, we introduce an hard negative aware contrastive loss $\mathcal{L}_{hn}=\mathcal{L}_{hn}^{\textsc{i2t}}+\mathcal{L}_{hn}^{\textsc{t2i}}$ , which comprises an image-to-text and a text-to-image term:

where $\mathcal{I}$ and $\mathcal{T}$ represent the trivial images and texts, respectively, while $\mathcal{I}^{hn}$ and $\mathcal{T}^{hn}$ denote non-trivial hard negatives. In our implementation, these hard negative examples are constructed using the data pipeline described in Sec. 3.2, and more details are in the supplementary.

To preserve the inherent zero-shot capability of CLIP, we also leverage the conventional image-text pairs from LAION-400M . We apply the standard contrastive loss of CLIP to these data, introducing an additional loss term $\mathcal{L}_{clip}$ . The overall loss consists of two terms:

where $\lambda$ is a hyperparameter that balancing these two terms. Training on this multi-task loss enables improving the performance of CLIP in fine-grained understanding while maintaining its zero-shot capability.

2 Experiments

Training details. We experiment with the ViT-B/32 variant of CLIP, and resume from the OpenAI pretrained checkpoint . We finetune for 1,000 steps using a cosine schedule with an initial learning rate of $1e\text{-}6$ and use 800 steps for warm up. The batch size of LAION data is set to 2048, and the batch size of hard negative data is set to 768. The weight $\lambda$ of the hard negative aware loss is set to 0.2.

Main results. We utilize the SPEC benchmark to assess the understanding of model in fine-grained concepts. In Tab. 3, we present the average I2Tacc and T2Iacc on all subsets of SPEC. To assess the zero-shot capability of the model, we also utilize the toolkit from ELEVATER to evaluate the performance on 9 classification datasets, and report the average accuracy. Compared to CLIP , our model demonstrates remarkable advancements with a 19.8 $\%$ boost in i2tacc, an 18.9 $\%$ improvement in t2iacc on SPEC, and a noteworthy 1.2 $\%$ enhancement in zero-shot accuracy. We also conduct ablation on different training configurations. From the results in Tab. 3, it can be observed that the introducing of $\mathcal{L}_{hn}$ significantly improves the performance on SPEC. Moreover, the $\mathcal{L}_{clip}$ plays a crucial role in preserving zero-shot performance. Without it, we observe a decline in accuracy by 5.1 $\%$ . With the combination of these two losses, we achieve substantial improvement in SPEC while maintaining the original zero-shot capability.

Validation on other fine-grained benchmarks. To assess whether our approach aids the model in acquiring fundamental visual-linguistic understanding or merely leads to overfitting on SPEC, we conduct evaluations on two additional benchmarks which also focus on the assessment of fine-grained concepts. ARO explores three aspects of vision-language understanding: object attributes, inter-object relations, and word ordering. Eqben focuses on minimal visual semantic changes, aiming to diagnose VLMs in understanding fine-grained concepts such as counting and location. In Tab. 4, we present the experimental results, demonstrating a clear improvement compared to CLIP on both datasets, For example, compared to CLIP, our method shows an average improvement of 2 $\%$ on Eqben and respective enhancements of 3.2 $\%$ , 9.8 $\%$ , and 7.4 $\%$ on the three subsets of ARO. The consistent improvement on these datasets demonstrates that our approach has facilitated the model in acquiring transferable fine-grained understanding capabilities.

Conclusion

In this study, we explored the comprehension abilities of Visual Language Models (VLMs) with respect to fine-grained visual-linguistic concepts. We first established an efficient pipeline to synthesize candidate images that exclusively differ in a particular visual attribute. Leveraging this pipeline, we created the SPEC benchmark to diagnose the comprehension proficiency of VLMs in terms of object size, position, existence, and count. Upon evaluating four leading VLMs using SPEC, we uncovered substantial performance limitations. To address this, we introduced an enhancement strategy that effectively optimizes the model for fine-grained understanding, while maintaining its original zero-shot capability.