InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo

cs.CV

Introduction

In recent years, the field of artificial intelligence has witnessed remarkable advancements, particularly in natural language processing (NLP) . The Generative Pre-trained Transformer (GPT) has successfully unified multiple NLP tasks by providing a single, coherent framework for diverse applications. Building on this success, our research aims to achieve a similar unification in the realm of computer vision, i.e. , developing a unifying framework capable of handling multiple vision tasks simultaneously. However, compared with NLP tasks, unifying computer vision tasks is more challenging due to the diversity of various tasks.

Diversity of Tasks and Outputs: Computer vision tasks encompass a wide range of applications, such as object recognition, segmentation, image generation, and keypoint detection, among others. Each of these tasks has a different output format, including coordinates, binary masks, images, and categories. This diversity makes it difficult to find a uniform representation for all tasks. In contrast, NLP tasks often have text-based outputs that can be more easily represented in a standard format.

Different Methodologies and Techniques: Computer vision tasks often require distinct methodologies and techniques depending on the specific problem being addressed. For example, image generation tasks are commonly dominated by Generative Adversarial Networks (GANs) and Denoising Diffusion Models (DDPM) , which are rarely used for image understanding tasks such as object recognition or image classification. Additionally, the output dimensionality of generative models is relatively higher, adding to the challenge of unifying these tasks. In contrast, NLP tasks tend to rely on a more consistent set of techniques, such as Transformer-based models , which can be applied across various NLP applications.

Continuous Input and Output: Both the input and output of computer vision tasks are usually continuous, like coordinates or images. This continuous nature makes it challenging to develop a unified approach that can accurately handle such data. If discretizing the continuous data using techniques like Vector Quantized-Variational AutoEncoders (VQ-VAE) , there will be quantization errors, leading to inaccuracies in the results. This issue is less prominent in NLP tasks, where the input and output data can be more easily discretized into text tokens.

In this paper, we take advantage of the DDPM and propose a novel approach to address these challenges by treating all computer vision tasks as image generation, specifically instructional image editing tasks. We instruct image editing tasks using a more natural and intuitive way that closely aligns with how humans process images. For instance, the instruction for a segmentation task could involve turning the pixels of an object in the image into a specific color, while the remaining pixels remain unchanged. The keypoint detection task can be described as placing an opaque colored circle at a specific position in the image. The instruction for a classification task could change the object to different colors according to its category. Compared with some methods that have attempted to formulate vision tasks as inpainting problems , our approach ensures accurate reflection of human intentions which simplifies the process of handling multiple vision tasks. At the same time, since the input and output of DDPM are continuous , discretization is unnecessary, which solves the problem of quantization error.

We mainly focus on three types of output formats: 3-channel RGB images, binary masks, and keypoints. These three outputs are sufficient to cover most vision tasks, such as semantic segmentation, referring segmentation, keypoint detection, image manipulation, and so on. Since the output of the denoising diffusion model is a 3-channel image, we propose a unified representation that encodes masks and keypoints into 3-channel images to handle various image understanding tasks. Then we use a post-processing module to extract the commonly used output format for evaluation.

During the training phase, we use a diverse set of tasks to train a single model uniformly. We also collect a new dataset for image editing. The experimental results demonstrate that our approach achieves good performance in each task. Furthermore, we observed that, compared to training individual models for each task, joint training of multiple tasks can enhance the generalization ability.

Remarkably, our model also exhibits the ability of AGI to a certain extent, as it can handle tasks not seen during the training phase, such as image detection and classification. Moreover, it performs better than previous methods on datasets that were not seen during training. This study thus presents a significant step towards the development of a generalist modeling interface for vision tasks, paving the way for future research in the quest for AGI in computer vision.

Related Work

Building a general-purpose model that is capable of solving any arbitrary task has been a longstanding desire for artificial intelligence research. There exists a substantial number of related works in the literature, aiming to unify a broad spectrum of tasks. We present a brief overview of recent efforts in this direction.

Vision Language Foundation Models. The vast amount of easily accessible web-scale image-text pairs has brought about a wave of research innovations in vision language foundation models . The pioneering works, CLIP and ALIGN , are trained with contrastive loss, showing impressive generalization capabilities for downstream tasks by aligning pairs of images and texts in a cross-modal shared embedding space. Subsequent efforts extend the image-text contrastive method to a broader spectrum, such as the image-text-label space proposed in UniCL and a wider range of tasks as well as modalities supported in Florence and INTERN . However, contrastive-based methods lack the ability to generate language, which limits their application in open-ended tasks such as captioning or visual question answering.

On the other hand, the success of large language models such as GPT series , PaLM , and LLaMA , has been attracting a lot of research interest in augmenting the large language models with visual capabilities. Mostly, these models cast a wide range of open-ended vision tasks as text prediction problems, mapping visual input content to language semantics to enable general-purpose visual and language understanding. BEIT3 unifies the pretraining task in a masked data modeling manner. CoCa and BLIP unifies contrastive learning and generative learning. Flamingo accepts arbitrarily interleaved visual data and text as input and generates text in an open-ended manner by learning on a broad diversity of vision language tasks. LLaVA exploits visual instruction tuning by converting image-text pairs into an instruction-following format. GLIP v2 and Kosmos v2 leverage grounded image-text pairs to further unlock the grounding capability of multimodal large language models. Our work differs from LLaVA in that, unlike open-ended visual tasks such as visual question answering that can be naturally formulated in an instruction-following format, we attempt to formulate vision tasks, such as segmentation and keypoint detection, into an instruction-following framework. This is challenging due to the unclear instructions and lack of specific guidelines in these tasks.

Vision Generalist Models. Seeking a unified model that, once trained, can be directly used to seamlessly address a wide variety of vision tasks, has been an enduring aspiration in the computer vision community. Multi-task learning has become more and more popular. The key challenge lies in the diversity and complexity of the various structure of task outputs. Currently, there are two major interfaces for output unification: language-like generation and image-resembling generation. Most existing attempts for vision generalists take inspiration from sequence-to-sequence models in the NLP field and model a sequence of discrete tokens through next token prediction . Pix2Seq v2 unifies object detection, instance segmentation, keypoint detection, and image captioning by quantizing the continuous image coordinates for the first three tasks. Unified IO further unifies dense structure outputs such as images, segmentation masks, and depth maps using a vector quantization variational auto-encoder (VQ-VAE) .

As quantization inevitably introduces information loss during discretization, another direction of unification aims to explore the image itself as a natural interface for vision generalists . Painter formulates the dense prediction task as a masked image inpainting problem and demonstrates in-context capability in vision tasks such as depth estimation, semantic segmentation, instance segmentation, keypoint detection, and image restoration. Recently, PromptDiffusion also exploits in-context visual learning with a text-guided diffusion model and integrates the learning of six different tasks, i.e., image-to-depth, image-to-HED, image-to-segmentation and vice versa. Our work also examines image-resembling generation. However, in contrast to in-context learning, Unlike previous works that also explore natural language instructions, our method introduces a more favorable instruction alignment compared to the implicit task intention deducted from in-context learning. Moreover, with such explicit instructions, we further unify semantic image editing tasks, which are crucial use cases in image-resembling generation.

Method

We present InstructDiffusion, a novel generalist modeling interface designed for a diverse range of vision tasks. By leveraging the Denoising Diffusion Probabilistic Model (DDPM), we treat all computer vision tasks as human-intuitive image manipulation processes with outputs in a flexible and interactive pixel space. Several existing multi-modal models, such as Flamingo and BLIP2 , inherently produce natural language as their target output, thereby restricting their capabilities to visual question answering and image captioning. In contrast, our approach posits that formulating various vision tasks, including segmentation, keypoint detection, and image synthesis as image-resembling generation processes, is more intuitive, straightforward, and readily assessable for human evaluation.

Our primary focus is on three output formats: 3-channel RGB images, binary masks, and key points. These outputs adequately encompass a wide range of vision tasks, including keypoint detection, semantic segmentation, referring segmentation, semantic image editing, and several image enhancement tasks such as deblurring, denoising, and watermark removal. We first discuss the essential instructional format design for the vision tasks currently covered in Section 3.1, followed by an in-depth explanation of the training data preparation to ensure optimal model performance in Section 3.2. Lastly, we describe a unified framework with a simple architecture in Section 3.3.

The unified modeling interface for all tasks is referred to as Instructional Image Editing. By denoting the training set as $\{\bm{x}^{i}\}$ , each training data $\bm{x}^{i}$ can be represented in the form of $\{c^{i},s^{i},t^{i}\}$ , where $c^{i}$ signifies the control instruction, while $s^{i}$ and $t^{i}$ represent the source and target images, respectively. Within this context, our method aims to generate a target image $t^{i}$ that adheres to the given instruction $c^{i}$ when provided with an input source image $s^{i}$ .

In the context of semantic image editing tasks, InstructPix2Pix is a recent representative work that demonstrates a natural fit. For other vision tasks, the challenge involves creating appropriate instructions and subsequently establishing a corresponding target image. Although natural language instruction has been utilized extensively in previous approaches, such as Pix2Seq and UnifiedIO , we contend that terms like ”semantic segmentation” or ”keypoint detection” are better perceived as indicators rather than instructions. In contrast, our approach involves providing highly detailed instructions, enabling the model to comprehend the instructions rather than merely model a fixed bias based on the indicator.

Keypoint detection. It endeavors to precisely locate key object components within an image, such as the left eye of a face, the right shoulder of an individual, or the nose of a dog. Traditionally, heatmap regression has served as the standard learning approach, where ground truth heatmaps are generated by overlaying 2D Gaussian kernels on all keypoints. In contrast, this work introduces a more natural and easily assessable output by providing extensively detailed instructions, thereby enhancing the overall process of keypoint detection in various applications. An exemplary instruction might be, ”Please use red to encircle the left shoulder of the man.” In this instance, the output image should exhibit a red circle at the corresponding location (i.e., the left shoulder of the man in the image), while the rest of the region remains unaltered. This innovative approach facilitates a more intuitive comprehension of the keypoint detection process while simultaneously refining the model’s capacity to understand the meaning of different object components.

Segmentation. For semantic and referring segmentation, the objective is to identify the region of a particular object within the input image. An illustrative example of this instruction would be ”apply a blue semi-transparent mask to the rightmost dog while maintaining the remainder unaltered.” Consequently, the resulting image is determined and features a blue mask on the appropriate dog. We require the mask to be semi-transparent instead of opaque, thereby facilitating the human evaluation of the predicted mask’s accuracy. Moreover, our experiments indicate that the semi-transparent mask also augments the segmentation performance.

Image enhancement and image editing. Image enhancement such as deblurring, denoising, and watermark removal inherently yields output images, and the same applies to image editing. Consequently, we only need to construct instructions which shall clearly specify the operation to be performed. Detailed examples include “Make the image much sharper” for image deblurring, “Please remove the watermark on the image” for watermark removal, and “add an apple in the woman’s hand” for image editing.

To enhance the diversity of instructions, we first manually write 10 instructions for each task. Then we use GPT-4 to rewrite and expand the diversity of these instructions, thereby mimicking user input to the system. Subsequently, one instruction is chosen at random during the training process. This approach, which incorporates diverse and intuitive instructions, has been observed to substantially augment the model’s multi-task fusion capabilities.

2 Training Data Construction

As a proof-of-concept, we focus on investigating whether different tasks benefit each other under such image-resembling unification, instead of scaling data as much as possible for optimal performance at the extreme limits. We adopt widely used publicly available datasets and construct the ground truth target image according to the instruction template. For example, we use COCO-Stuff for semantic segmentation and use COCO , MPII , CrowPose and AIC for keypoint detection. More details will be presented in Sec 4.1.

For image editing, InstructPix2Pix (IP2P) pioneered the use of a synthetic training dataset by leveraging GPT-3 for generating instructions and Prompt2Prompt for creating output images. However, the synthesized source and target images exhibit varying quality and non-negligible artifacts, with most instructions focusing on global style modifications rather than local alterations. Furthermore, MagicBrush introduced a dataset comprising over 10,000 manually annotated triples, but its size is limited when compared to other vision tasks. Consequently, in addition to existing datasets such as IP2P , GIER , GQA , and MagicBrush , we propose a novel dataset called Image Editing in the Wild (IEIW), which encompasses 159,000 image editing pairs that cover a wide range of semantic entities and diverse levels of semantic granularity. To expand the scope of image editing data, we assemble the IEIW dataset by drawing from the following three distinct resources:

Object removal. Object removal is a very common type of image editing. Inspired by Inst-Inpaint , we use the referring segmentation dataset PhraseCut to construct the instructional object removal data. PhraseCut offers images with referring phrases for corresponding regions. We set these regions as a mask and use LAMA to inpaint them, transforming them into instructional inpainting datasets. Notably, we also swap input and output images, and reverse the instructions like ”remove the blue bird on top of the tree” to ”add a blue bird on top of the tree” to further supplement data from the perspective of adding components.

Object replacement. We propose a data construction pipeline for generating training data that targets the scenario of substituting certain specific objects, which is another essential feature for image editing. To automatically generate training triplets, we rely on SA-1B and OpenImages datasets, which provide multiple regions in an image with semantic meaning. Specifically, we first build a gallery database consisting of diverse image patches based on those semantic-aware regions. Given a source image from OpenImages or SA-1B, we randomly select a semantic region, which is used as a query patch to retrieve its nearest neighbors from the aforementioned constructed gallery database. The retrieved similar patches are regarded as reference images to the source image, both of which are fed to PaintByExample for generating a target image. In this way, we obtain the source image as well as the modified target image. To produce instruction, we utilize an image captioning tool, such as BLIP2 , to yield the source caption as well as the target caption, and then generate a possible instruction through a large language model. For example, given the captions “a running dog” and “a cute cat with black and white stripes”, a possible instruction is “please change the running dog to a cute cat with black and white stripes”. We can generate quite an amount of paired data for training using this construction pipeline.

Web crawl. In order to achieve greater alignment with authentic user needs and enhance the overall user experience, we gather genuine user requests along with the corresponding outcomes delivered by seasoned Photoshop professionals sourced from the website. To ensure the accuracy and relevance of the data, we search in Google by utilizing the keyword ”photoshop request”. This approach enables us to amass a substantial dataset comprising over 23,000 data triplets, which further aids in refining our understanding of user requirements and reduces the domain gap between training and inference.

In order to guarantee the quality of the training data, we further utilize image quality assessment tools to eliminate substandard data. Specifically, we apply Aesthetics Score and GIQA as image quality evaluation metrics, specifically utilizing LAION-Aesthetics-Predictor for Aesthetics Score and constructing a KNN-GIQA model on LAION-600M images for calculating GIQA scores. We exclude two categories of data: i) target images with low-quality scores, and ii) a significant discrepancy in quality scores between the source image and its corresponding target image. Our findings indicate that this data-filtering process is of vital importance.

3 Unified Framework

Our framework is based on diffusion, as diffusion models have experienced significant success in modeling complex image distributions. As illustrated in Figure 2, our training procedure comprises three stages: pretraining adaptation, task-specific training, and instruction tuning.

Pretraining adaptation. Stable Diffusion (SD) is recognized as one of the most robust open-source text-to-image models currently accessible, prompting our decision to utilize Stable Diffusion v1.5 as the foundation for our work. Initially, stable diffusion operates as a mapping mechanism that converts textual captions into natural images. However, our desired images might encompass segmentation masks or keypoint indicators, which substantially deviate from typical natural images. Consequently, our preliminary phase involves fine-tuning the stable diffusion model and adjusting the diffusion output distribution.

Since we require diffusion models to be capable of generating images “with a foreground mask” or “with some special mark”, we employ existing segmentation or keypoint detection datasets to produce such data. The remaining challenge lies in the development of suitable captions that accurately depict these images while maintaining the intrinsic text-to-image generation capability. This is achieved by augmenting the original image caption with a suffix, such as ”with a few different color patches here and there” or ”surrounded with a red circle.” By fine-tuning the diffusion model with these modified image captions, we can theoretically empower the model to generate any images within the desired output domain.

Task-specific training. In the second stage, our goal is to further fine-tune the diffusion model, enhancing its comprehension of various instructions for different tasks. We follow InstructPix2Pix and inject source images by concatenating them with the noise input, subsequently expanding the input channels of the first layer. We train our model using all data containing various tasks. Since the amount of data for each task is quite different, in order to maintain a balance, we manually set different sampling weights for different databases. The number of effective training samples used for different tasks is shown in Table 1. For a data triplet ${s_{i},c_{i},t_{i}}$ , the diffusion process adds noise to the encoded latent $z=\mathcal{E}(t_{i})$ producing a noisy latent $z_{t}$ . We fine-tune the diffusion network $\epsilon_{\theta}$ by minimizing the following latent diffusion objective:

Human alignment. To further improve the quality of editing, we have followed the idea of instruction tuning from Large Language Models. In LLM literature, instruction tuning is used to teach the model to solve a task following the instruction. However, we conduct instruction tuning differently from that in LLM. For each sample in the benchmark, we generate different editing results using $20$ different sampling classifier-free guidance . Then, we ask subjects to select the best 0-2 edited images to formulate the instruction-tuning dataset. The whole dataset contains $1,000$ images. We use this dataset to further fine-tune our model for about 10 epochs.

Experiments

Training samples. Our model is trained on samples consisting of {instruction, source image, target image}, encompassing the aforementioned vision tasks, i.e., keypoint detection, semantic segmentation, referring segmentation, image enhancement including denoising, deblurring and watermark removal, and image editing. Specifically for keypoint detection, we adopt four classical datasets, namely COCO containing $149$ K images with each labeled $17$ keypoints, CrowdPose consisting of $35$ K images each with $14$ keypoints, MPII with $22$ K images labeled with $16$ keypoints, and AIC including $378$ K images annotated with $14$ keypoints. Throughout our training process, for each image, we employ a random selection of between 1 and 5 keypoints, and assign these keypoints with random colors. Accordingly, the instruction is produced through templates filled with the class of keypoints and the specific color, and the target image is generated by positioning small circles on the chosen keypoints, each circle taking on the color corresponding to its respective keypoint. For segmentation, we select COCO-Stuff as semantic segmentation training dataset while gRefCOCO and RefCOCO as referring segmentation training dataset. We collect a series of prompt templates with the help of large language models to serve as text instructions. An example is “place a color mask on object.” During training, we randomly select a color for “color” and replace “object” with the corresponding category name in semantic segmentation or referring in referring segmentation. The target image is generated by placing a mask using its corresponding color with a transparency of 0.5 over the object. For image enhancement, we focus on three tasks: deblurring, denoising, and watermark removal. For these tasks, we utilize the GoPro containing 2103 images and REDS dataset with 24,000 images for deblurring, the SIDD dataset composed of 320 images for denoising, and the CLWD dataset containing 60,000 images for watermark removal. Lastly for image editing, as mentioned in Sec. 3.2, we adopt 7 editing datasets, including filtered InstructPix2Pix dataset containing 561K samples, 8K samples in MagicBrush training dataset, GIER with 5K samples, GQA inpainting dataset with 131K samples, VGPhraseCut composed of 85K samples, our generated dataset with 51K produced samples, and an internal dataset representing real editing scenario, which contains 23K training triplets. To ensure balanced sampling across each task, we implemented distinct sampling weights due to the considerable variance in the number of training images across different datasets. Table 1 illustrates the number of effective training samples we used in our framework.

Implementation details. We utilize Stable Diffusion v1.5 as initialization to leverage a text-to-image generation prior. The input image resolution is preprocessed to $256\times 256$ , and the learning rate is fixed to $1\times 10^{-4}$ during training. In addition, we adopt an EMA rate of 0.9999 to stabilize the training. Our model is trained using a batch size of 3072 for a total of 200 epochs, which requires approximately 4 days of computation on 48 NVIDIA V100 GPUs. Once trained, our model is readily applicable and can be directly used for different vision tasks. For each task, we provide a comprehensive comparison and an in-depth analysis of its performance in the subsequent sections. During the human alignment stage, we use an EMA rate of 0.99 to help the model quickly adapt to the instruction-tuning dataset.

2 Keypoint Detection

We evaluate our model on both the close-set scenario using the COCO validation dataset as well as the open-set generalization capability over the unseen dataset: HumanArt dataset and AP-10K animal dataset . The HumanArt dataset is an artificial human dataset comprised of various forms such as cartoons, shadow plays, and murals, exhibiting a distinct data distribution compared to COCO dataset. The AP-10K Animal dataset is a collection of annotated animal keypoints, which effectively highlights the ability of our model to handle animal keypoints despite being trained only on human keypoint datasets. To enable a more detailed and thorough evaluation, it is essential to extract accurate pose coordinate information, namely precise horizontal and vertical coordinates, rather than simply marking the location with a distinct symbol. To achieve this, we employ a lightweight U-Net structure that post-processes the output image to generate a multi-channel heatmap. We employ the standard $AP$ (average precision) based on the OKS as our evaluation metrics. Additionally, we utilize the ground truth bounding boxes for all results. Notably, for the AP-10K animal dataset, in order to facilitate comparison with other methods, the OKS is calculated exclusively on the keypoints that overlap with the COCO annotated joints. However, it should be noted that our model possesses the capability to detect keypoints beyond the confines of the training dataset.

The results of the keypoint detection are presented in Table 2. Our approach outperforms other generalist models, Unified-IO and Painter , across all evaluated datasets. Particularly we demonstrate a significantly higher level of performance over HumanArt and AP-10K, indicating the powerful generalization ability of our framework. In comparison to methods specifically designed for keypoint detection, our unified model does not exceed their performance due to localization accuracy limitations. However, it showcases exceptional performance on the entirely unseen animal keypoints dataset, AP-10K. Figure 3 (a-c) display our results for car and animals keypoint detection. Our model can accurately detect the logo of the car and the keypoints of animals that have never appeared in the keypoint detection training dataset. Figure 3 (d) demonstrate our capability for referring keypoint detection, showcasing our versatile detection abilities.

3 Segmentation

Our primary focus lies in assessing the open-vocabulary capability of our model, particularly when evaluating images that contain unseen classes not present during the training phase. Therefore, besides the COCO-stuff , gRefCOCO and RefCOCO datasets, we conduct evaluation over additional eight datasets, i.e., RefCOCO+ , G-Ref , RefClef for referring segmentation, and ADE20K-150 , ADE20K-847 , Pascal Context-59 , Pascal Context-459 , Pascal VOC for semantic segmentation. Similar to keypoint detection, we employ a lightweight U-Net structure that post-processes the output image to extract the binary mask of each individual object. Adhering to the prevailing convention , we adopt cumulative IoU (cIoU) to measure the performance for referring segmentation. On the other hand, our approach involves predicting a mask for each semantic category individually. As a result, semantic segmentation can also be perceived as referring to segmentation based on semantic categories. Thus, we choose to utilize the mean of class-wise cumulative intersection over union (mcIoU) to quantify the performance of semantic segmentation.

Table 3 reports the results for referring segmentation. To the best of our knowledge, Unified-IO stands as the sole generalist model with the capability to perform referring segmentation. It can be seen that our model largely outperforms Unified-IO across almost all datasets. We also present methods that are specially designed for referring segmentation. Interestingly, our approach achieves an unexpectedly significant improvement over the RefClef dataset. Table 4 presents the quantitative comparison results of semantic segmentation. Both specialized models as well as our model have undergone training exclusively on the COCO-Stuff dataset. It is evident that our model not only surpasses specialized models in the close-set scenario, specifically the COCO-Stuff dataset, but also achieves comparable performance across other datasets that represent open-set scenarios. Notably, in the case of the VOC dataset, we observe a substantial improvement. When compared to generalist models, our approach outperforms other competitors by a considerable margin, except in the case of Painter on the ADE-150K dataset. This is largely attributable to Painter being specifically trained on this dataset. Interestingly, we notice that both Painter and PromptDiffusion lack awareness of the colors associated with unseen categories during evaluations in open-set scenarios. This is due to their reliance on example images to instruct the model regarding the color corresponding to each semantic. In contrast, our approach establishes the color corresponding to each semantic category through text instructions, resulting in significantly superior performance. Figure 4 illustrates several visual examples for referring segmentation to demonstrate our model’s capability.

4 Image Enhancement

We evaluate the low-level vision performance of our model using the widely employed benchmarks, i.e., GoPro , SIDD and CLWD dataset respectively for deblurring, denoising, and watermark removal task. The standard PSNR metric is adopted to measure the difference between the output processed image and the ground truth image. We evaluate our model’s deblurring capability on the GoPro benchmark with 1280 $\times$ 720 resolution, while for SIDD and CLWD, evaluations are conducted under 256 $\times$ 256 resolution to align with other works. The numerical comparison is reported in Table 5. We have made the following observations. Firstly, specialized models trained for image editing tasks tend to exhibit poor generalization when applied to image enhancement tasks. Secondly, the generalist model Painter performs better in the denoising task but encounters challenges when it comes to seamlessly integrating image editing tasks through in-context learning. Lastly, the performance of our model in image enhancement is constrained by the VAE model, which introduces information loss. We have conducted an experiment by feeding the ground truth image to the VAE model and calculating the PSNR for the output, which serves as an upper bound for our model and is indicated in parentheses.

We also present some “in-the-wild” visual results in Figure 5 to qualitatively show our model’s real-world applicability in low-level vision tasks. We can observe that the resulting images have been effectively processed in line with the provided instruction, which includes sharpening, denoising, and watermark removal.

5 Image Editing

To better demonstrate the editing quality of our method, we build a benchmark containing 1,000 samples. Each sample contains the source image, the caption of the source image provided by BLIP2 , the editing instruction, and the target caption of the edited image. We manually classify each sample into one of three distinct editing scenarios: replacement, removal, and addition. This meticulous categorization aims to provide a nuanced reflection of the model’s editing capabilities. We adopt two commonly used metrics, CLIP similarity (CLIP-Sim) and Aesthetic Predictor’s Score (AP) , to evaluate the editing results. CLIP-Sim measures the semantic similarity between an image and a text. We utilize BLIP2 to obtain the caption of the input image and invoke GPT-3.5-turbo to acquire the target caption of the edited image. The CLIP-Sim score is calculated between the edited image and the target caption. The AP score assesses the aesthetic quality of the generated images, a methodology akin to LAION-5B, which employs the CLIP+MLP Aesthetic Score Predictor. A higher quality score reflects better perceptual quality.

We report the numerical results in Table 5. It is important to emphasize that none of the existing generalist models have the capability to perform image editing. Compared with specific models, it is evident from the table that even with joint training, our model achieves superior CLIP-Sim compared to Instruct-Pix2Pix and on-par results with MagicBrush . When assessing the editing task, it is crucial to take into account not only semantic similarity and aesthetic quality but also factors such as background consistency and manipulation accuracy. Quantitative results may be somewhat constrained for comparison. For instance, a model that fails to make substantial edits and produces an image that remains almost unchanged could potentially receive a higher score than models that have successfully carried out meaningful edits.

We further illustrate several visual examples in Figure 6 compared with competitive baseline methods that have been shown impressive editing quality, including Instruct-Pix2Pix , MagicBrush , EDICT and Null-text Inversion . It is evident that our model effectively edits the image in line with the given instructions. For instance, by following the directives, our model can successfully eliminate magnets and stickers, convert a truck into a train, and transform a cat into a particular style. We showcase additional editing results of our model in Figure 7, further highlighting the remarkably precise editing quality achieved. Given a source image, our model is capable of successfully adding, removing, and replacing elements. The image undergoes precise editing based on the prompt while maintaining the integrity of the background and preserving intricate details.

6 The Benefit of Highly Detailed Instruction

Our hypothesis is that the ability to generalize is the skill of learning through understanding the specific meanings of individual elements rather than memorizing entire instructions. Unlike previous unified models like Pix2seq and Unified-IO , which simply treat natural language as task indicators, our approach employs detailed descriptions for each task as instructions. Such detailed instructions enable the model to understand comprehensively and then prioritize accurate execution instead of simple instructions that favor mimicking. To show this, we try to replace our instructions within our framework with simpler task indicators, such as ”semantic segmentation” and ”keypoint detection,” while assigning fixed colors to each keypoint or object class. As demonstrated in Table 6, the results of simple instructions are extremely poor, especially when handling new types of keypoints or novel object categories. This highlights that our detailed instructions provide enhanced flexibility and adaptability in the open domain.

7 The Benefit of Multi-task Training

Multi-task learning has grown increasingly popular, enabling models to achieve greater generalization by concurrently addressing multiple related tasks during training. This approach often results in improved model generalization performance compared to specialized single-task training. To further provide empirical validation for this observation, we experiment with our model when trained only on the segmentation dataset and report the performance difference in Figure 8. This illustration compares the results of our single-task model and multi-task model evaluated over four unseen test datasets. It is evident that our jointly trained model performs significantly better in open-domain testing scenarios compared to the specific models. Furthermore, we observe that this benefit also extends to image editing. In Figure 9, the visual comparison demonstrates that when integrated with other tasks, the model can more effectively discern which objects require editing, potentially benefiting from the integration of referring segmentation.

8 The Benefit of Human Alignment

Our model undergoes a subsequent fine-tuning phase using a filtered dataset with human alignment. In this evaluation, we examine its effectiveness and present the fine-tuning progress in Figure 10, which showcases the relationship between CLIP-Sim performance and the number of epochs. Initially, the CLIP-Sim score stands at 29.6. Remarkably, we observe a noticeable enhancement in image-text alignment, which increases from 29.6 to 29.9 over approximately 10 epochs. It is important to highlight the significance of this improvement, particularly considering that the dataset consists of only 1000 samples.

9 Generalization Capability to Unseen Tasks

We demonstrate that our model exhibits a degree of Artificial General Intelligence (AGI) capabilities by leveraging the wealth of tasks and diverse datasets through this highly detailed instruction-following format. We validate its capacity to handle tasks that were not part of its training repertoire, including image detection, classification, and even intricate fine-grained tasks like face alignment in Figure 11. In the context of detection and classification, we employ a prompt that resembles referring segmentation, enabling us to derive the bounding box coordinates by identifying the top, bottom, left, and right boundaries of the marked region. Moreover, we can also verify the class label using a versatile prompt structure. In the realm of face alignment, our approach involves directly instructing our model to encircle the specific facial region of interest, such as the nose or right ear. Remarkably, we have found that this approach performs admirably even when applied to animal faces. We argue that this underscores its versatility in adapting to new challenges beyond its initial training scope.

Discussion and conclusion

In conclusion, this paper presents InstructDiffusion, a novel and unifying framework for aligning computer vision tasks with human instructions. InstructDiffusion treats all computer vision tasks as image generation, with a focus on three types of output formats: 3-channel RGB images, binary masks, and keypoints. We demonstrated that our approach achieves good performance in individual tasks, and joint training of multiple tasks enhances the generalization ability. Remarkably, InstructDiffusion exhibits AGI capabilities to some extent, handling tasks not seen during training and outperforming previous methods on unseen datasets. This research marks a significant step towards a generalist modeling interface for vision tasks and sets the stage for future advancements in the pursuit of artificial general intelligence in computer vision.

In future work, we plan to focus on the following aspects to further improve the performance and capabilities of InstructDiffusion: 1) Improve the unified representation: We aim to explore alternative encoding schemes and techniques to better represent a more diverse range of outputs associated with various computer vision tasks. 2) Investigate the role of self-supervised and unsupervised learning: To enhance the generalization ability of InstructDiffusion, we will explore the use of self-supervised and unsupervised learning techniques to leverage large-scale unlabeled data for model training and adaptation.