OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu

cs.CV cs.AI

Introduction

The pursuit of Artificial General Intelligence (AGI) has intensified the demand for generative foundation models capable of handling a wide variety of tasks within a single framework. In the field of Natural Language Processing (NLP), Large Language Models (LLMs) have become exemplary in achieving this goal, demonstrating remarkable versatility across numerous language tasks such as question answering, text summarization, and code generation.

However, the field of visual generation has yet to reveal a counterpart that mirrors the universality of LLMs. Current image generation models have demonstrated proficiency in specialized tasks. For instance, in the text-to-image generation filed, state-of-the-art models such as the Stable Diffusion series [56; 52; 13], DALL-E , and Imagen have made significant strides. Meanwhile, many efforts have been proposed to extend and optimize the capabilities of diffusion models for specific tasks. Models like ControlNet and T2i-Adapter design an additional network plugged into the text-to-image diffusion model to support visual conditions. InstructPix2Pix is trained on a comprehensive dataset tailored for image editing tasks. Despite their strengths, those models are limited by their task-specific nature and do not exhibit the comprehensive perceptual understanding and generative capabilities required for a universal model in visual generation.

Is it possible to address various image generation tasks, such as text-to-image, image editing, controllable generation, and image restoration, within a single diffusion framework, akin to how GPT handles language tasks? If a universal model is available, the need for training additional modules (e.g., ControlNet, IP-Adapter, T2I-Adapter) in practical applications can be eliminated. Motivated by this potential, we explore a unified framework for image generation, named OmniGen.

Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders. OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions. To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format. Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks. We evaluate our model on multiple benchmarks, demonstrating its competitive text-to-image generation capabilities compared to existing models. Furthermore, our model inherently supports various image generation tasks, such as image editing, visual conditional generation, and subject-driven generation, which are beyond the reach of current diffusion models. Remarkably, the design of OmniGen allows for robust transfer learning across different scenarios, facilitating the handling of previously unseen tasks and domains, as well as giving birth to emerging abilities. Our contributions are summarized below:

We introduce OmniGen, a unified model for image generation that excels in multiple domains. OmniGen demonstrates competitive text-to-image generation capabilities and inherently supports a variety of downstream tasks such as controllable image generation and subject-driven generation. Furthermore, it is capable of performing classic computer vision tasks. To the best of our knowledge, OmniGen is the first image generation model to achieve such a comprehensive level of functionality.

We construct a comprehensive image generation dataset named X2I, which stands for "anything to image". This dataset includes a wide range of image generation tasks, all standardized into a unified format.

By unified training on the multi-task dataset, OmniGen can apply learned knowledge to tackle unseen tasks and domains, as well as exhibit new capabilities. Additionally, OmniGen shows a degree of reasoning capability.

The remainder of this paper is organized as follows: Section 2 details the model architecture, while Section 3 describes the dataset construction. Section 4 presents the model’s performance on various image generation tasks. In Section 5, we analyze the model’s emerging capabilities and reasoning abilities, and also explore the potential applications of CoT in image generation. Section 6 discusses the current limitations of the model. Section 7 reviews related work.

OmniGen

In this section, we present the details of OmniGen framework, including the model architecture and training method.

Principles. Current diffusion models are typically limited to common text-to-image tasks and can not perform a broader range of downstream image-generation tasks. To achieve real-world applications, users often need to design and integrate additional network structures to extend the capabilities of diffusion models, making the models highly cumbersome. Even worse, these additional parameter networks are usually task-specific and can not be reused for other tasks, unless more networks are designed and trained for different functions. To circumvent these issues, the design principles of OmniGen are as follows: 1). Universality: accepting any form of image and text inputs for various tasks; 2). Conciseness, avoiding overly complex structural designs and numerous additional components.

Network Architecture. As illustrated in Figure 2, the OmniGen framework adopts an architecture comprised of a Variational Autoencoder (VAE) and a pre-trained large transformer model. Specifically, VAE extracts continuous visual features from images, while the transformer model generates images based on input conditions. In this paper, we use the VAE from SDXL and freeze it during training. We use Phi-3 to initialize the transformer model, inheriting its excellent text processing capabilities. Unlike state-of-the-art diffusion models that require additional encoders to pre-process conditional information (such as clip text encoder and image encoder), OmniGen inherently encodes conditional information by itself, significantly simplifying the pipeline. Furthermore, OmniGen jointly models text and images within a single model, rather than independently modeling different input conditions with separate encoders as in existing works [67; 68; 70; 63; 9] which lacks interaction between different modality conditions.

Input Format. The input to the model can be multimodal interleaved text and images in free form. We utilize the tokenizer of Phi-3 to process text without any modifications. For images, we firstly employ a VAE with a simple linear layer to extract latent representations. Then, they are flattened into a sequence of visual tokens by linearly embedding each patch in the latent space. Following , we apply standard frequency-based positional embeddings to input visual tokens, and use the same method as SD3 to process images with varying aspect ratios. In addition, we encapsulate each image sequence with two special tokens: “” and “” before inserting it into the text tokens sequence. We also add the timestep embedding at the end of the input sequence.

Attention Mechanism. Different from text, which can be decomposed into discrete tokens to model, we argue that images should be modeled as a whole. Therefore, we modify the common causal attention mechanism in LLM, integrating it with the bidirectional attention as illustrated in Figure 2. Specifically, we apply causal attention to each element in the sequence, but apply bidirectional attention within each image sequence. This allows each patch to pay attention to other patches within the same image, while ensuring that each image can only attend to other images or text sequences that have appeared previously. .

Inference. During inference, we randomly sample a Gaussian noise and then apply the flow matching method to predict the target velocity, iterating multiple steps to obtain the final latent representation. Finally, we use a VAE to decode the latent representation into the predicted image. The default inference step is set to 50. Thanks to the attention mechanism, OmniGen can accelerate inference like LLMs by using kv-cache: storing previous and current key and value states of the input conditions on the GPU to compute attention without redundant computations.

2 Training Strategy

Train objective. In this work, we use rectified flow to optimize the parameters of model. Different from DDPM , flow matching conducts the forward process by linearly interpolating between noise and data in a straight line. At the step $t$ , $\mathbf{x}_{t}$ is defined as

where $\mathbf{x}$ is the original data, and $\boldsymbol{\epsilon}\sim\mathcal{N}(0,1)$ is the Gaussian noise. The model is trained to directly regress the target velocity given the noised data $\mathbf{x}_{t}$ , timestep $t$ , and condition information $c$ . Specifically, the objective is to minimize the mean squared error loss:

For image editing tasks, the objective is to modify specific regions of the input image while keeping other areas unchanged. Therefore, the difference between the input image and the target image is often small, which allows the model to learn an unexpected shortcut: simply copying the input image as the output to make the related training loss very low. To mitigate this phenomenon, we amplify the loss in the regions of the image where changes occur. More specifically, we calculate the loss weights for each region based on these latent representations of input image $\mathbf{x^{\prime}}$ and target image $\mathbf{x}$ :

Consequently, regions with alterations are assigned higher weights than those without changes, guiding the model to focus on the areas to be modified.

Training Pipeline. Following previous work [13; 18; 6], we gradually increase the image resolution during the training process. Low resolution is data-efficient, while high resolution can enhance the aesthetic quality of the generated images. Detailed information for each training stage is presented in Table 1. We adopt the AdamW with $\beta=(0.9,0.999)$ as the optimizer. All experiments are conducted on 104 A800 GPUs.

X2I Dataset

To achieve robust multi-task processing capabilities, it is essential to train models on large-scale and diverse datasets. However, in the field of image generation, a readily available large-scale and diverse dataset has yet to emerge. In this work, we have constructed a large-scale unified image generation dataset for the first time, which we refer to as the X2I dataset, meaning "anything to image". We have converted these data into a unified format, and Figure 3 presents some examples from the X2I dataset. The entire dataset comprises approximately 0.1 billion images. We will provide a detailed description of the composition of this dataset in the following sections.

The input for this subset of data is plain text. We have obtained multiple open-source datasets from various sources: Recap-DataComp (a subset of 56M images), SAM-LLaVA , ShareGPT4V , LAION-Aesthetic (a subset of 4M images), ALLaVA-4V , DOCCI , DenseFusion and JourneyDB . While these datasets are large in quantity, their image quality is not always high enough. In the early stages of training, we use them to learn a broad range of image-text matching relationships and diverse knowledge. After stage 3, we utilize our internal collection of 16 million high-quality images to enhance the aesthetic quality of the generated images. A lot of studies [13; 6] have demonstrated that synthetic detailed captions can greatly improve text-to-image models trained at scale. Therefore, we use the InternVL2 to create synthetic annotations for internal data and LAION-Aesthetic (the other datasets come with detailed text descriptions and do not require further annotation).

2 Multi-modal to Image

Different from most existing diffusion models, our model can accept more general and flexible multimodal instruction as conditions to guide the generation of images.

The input of this portion of data is arbitrarily interleaved text and images. We collect the data from multiple tasks and sources: image editing (SEED-Data-Edit , MagicBrush , and InstructPix2Pix ), human motion (Something-Something ), virtual try-on (HR-VITON and FashionTryon ), and style transfer (stylebooth ). We standardize all tasks into the input-output pair format as shown in Figure 3-(b).

The issue of utilizing additional visual conditions for finer-grained spatial control has garnered widespread attention[73; 33]. We employ the MultiGen dataset to learn this function, and select six representative visual conditions: Canny, HED, Depth, Skeleton, Bounding Box, and segmentation. These types of tasks take text prompts and specific visual conditions (such as segmentation maps, and human pose maps) as multi-modal inputs, then generate new images that comply with the text and image conditions.

2.2 Subject-driven Image Generation

We constructed both a large-scale foundational dataset (GRIT-Entity dataset) and a high-quality advanced dataset (Web Images dataset) for subject-driven image generation. For the GRIT-Entity dataset, we leveraged the GRIT dataset , which annotates object names within images. Using these annotations, we applied the Grounding DINO model for text-to-bounding-box grounding. Based on the bounding boxes, we employed SAM to segment the cropped images, obtaining object masks. We further used the MS-Diffusion model to repaint the object images, enhancing data quality. The process of data construction and the final instruction format are illustrated in Figure 4-(a). Through this method, we acquired 6 million pairs.

Although the GRIT-based approach provides a substantial amount of data, the input data extracted directly from original images can lead the model to fall into simple copy-paste patterns. To fully unleash the subject-driven image generation capability of OmniGen, we constructed a high-quality web images training dataset using natural images of well-known individuals. First, we sampled 20 million Alt-text entries from the Datacomp dataset and used spaCyhttps://github.com/explosion/spaCy for named entity recognition. We selected the most frequently occurring names and employed GPT-4o to filter out real, notable individuals, resulting in 2,000 names. Furthermore, we expanded the initial 2,000 names by including closely related individuals, resulting in approximately 10,000 name pairs. We then scraped images of these individuals and pairs from search engines. Due to the noise in web images, where scraped images may not contain the specified individuals, we designed a cross-verification strategy using InternVL to filter single and group images, as detailed in Figure 4-(b). The retained single and group images were then captioned with details such as attire and actions. Through additional instruction-based annotations, we successfully constructed a dataset of 533,000 image pairs. We present some examples in Figure 3-(c).

2.3 Computer Vision Tasks

We introduce classic computer vision tasks to enhance the image generation capabilities of the model. For low-level vision tasks (low-light image enhancement , deraining , deblurring , inpainting , outpainting and colorization ), where the annotation itself is an image, we only add text instructions, which were randomly sampled from instructions generated by GPT-4o. For high-level tasks, we choose to represent all annotations as images. We used LAION as the source image and annotations from as the target to construct image pairs (such as source image and its human pose mapping). The annotations include human pose, depth mapping, canny, and segmentation. Additionally, we also use several datasets for referring image segmentation, including RefCOCO , ADE20k , and ReasonSeg . As shown in Figure 3-(c), the input is the source image and a natural language expression, the output is an image with the corresponding object highlighted in blue.

The purpose of constructing these datasets is not merely to endow the model with these functionalities. We also aim to transfer the knowledge acquired from these traditional computer vision tasks to image generation tasks, thereby achieving more sophisticated image generation capabilities. Our experiments have also demonstrated that multi-task learning enables the model to exhibit emergent abilities.

3 Few-shot to Image

We constructed a few-shot to image dataset to stimulate the model’s in-context learning capabilities. Specifically, for each task described in the preceding sections, we randomly selected a few examples and combined the original input with these examples to form new inputs. The specific data format can be referenced in Figure 3-(e). Due to limitations in training resources, we opted to use only one example to enhance training efficiency.

Experimental Results

In this section, we show the results of OmniGen in image generation tasks and traditional vision tasks.

Figure 5 shows the results of the text-to-image task. It can be observed that OmniGen effectively follows the textual descriptions to generate images with arbitrary aspect ratios.

Figure 6 presents the outcomes of the subject-driven generation task. Our model can extract the required objects from the given reference images and generate new images accordingly. Furthermore, when the reference image contains multiple objects, the model can directly select the needed objects based on textual instructions (e.g., the cat in the figure) without requiring additional preprocessing steps such as image cropping or face recognition.

Figure 7 summarizes the results of other image generation tasks, demonstrating that the model can handle various downstream tasks based on multi-modal instructions.

1.2 Text to Image

Following , we evaluate text-to-image generation capability of OmniGen on the GenEval benchmark. We compared the performance of our model with the reported results of other popular image generation models, as summarized in Table 2. Surprisingly, our model achieved similar performance compared to the current state-of-the-art diffusion models, such as SD3, which underscores the effectiveness of our framework. The GenEval benchmark does not reflect the aesthetic quality of images, we will leave this aspect for future evaluation.

Notably, our model has only 3.8 billion parameters, whereas the SD3 model has a total of 12.7 billion parameters (more than three times that of ours). Current diffusion models typically adopt an encoder-decoder architecture, utilizing an additional encoder to encode textual conditions (this text encoder alone is larger than our entire model). In contrast, our model architecture is significantly simplified, eliminating the cost of an additional text encoder, thereby greatly enhancing the efficiency of parameter utilization. Besides, we employed only 0.1 billion image data, whereas SD3 used over 1 billion (more than ten times that of ours), highlighting the role of multitask data X2I in enhancing text-to-image capabilities.

1.3 Image Edit

We compare OmniGen with other state-of-the-art image editing models on EMU-Edit dataset, which includes seven different operations: background alteration, comprehensive image changes, style alteration, object removal, object addition, localized modifications, and color/texture alterations. We measure three metrics: 1) CLIP-I: CLIP image similarity between the source image and output image; 2) DINO: DINO similarity between the source image and output image; and 3) CLIP-T: CLIP text-image similarity between edited image and target caption. DINO and CLIP-I similarity scores measure the model’s ability to preserve elements from the source image, while CLIP-T measures how well the model followed the instructions. As shown in Table 2, our model significantly outperforms InstructPix2Pix , and exhibits comparable performance to the current state-of-the-art model: EMU-Edit .

1.4 DreamBooth

We evaluate the single-entity subject-driven generation capability on DreamBench . The DreamBench contains 750 prompts for 30 subjects (e.g., dog and toy). For each prompt, we generate 4 images, resulting in a comprehensive evaluation set of 3,000 images. Following Kosmos-G , we only select one image as input from the 4-7 provided images for each subject. We adopted DINO and CLIP-I from DreamBooth to assess subject fidelity, and CLIP-T for text fidelity. All results are summarized in Table 4. Compared to methods based on fine-tuning, our approach maintains a comparable level of text fidelity while better preserving the subject from the source image. Compared with models without fine-tuning, OmniGen significantly outperforms both Re-Imagen and Kosmos-G, and demonstrates superior subject fidelity relative to SuTI.

1.5 Visual Conditional Controls

Image-based prompts can provide detailed spatial conditioning controls for diffusion models. To evaluate this ability of OmniGen, we use the dataset and script from . This benchmark includes ADE20K test dataset for segmentation mask condition, and evaluation split of MultiGen-20M for canny edge map, hed edge map, and depth map condition. For each condition, the controllability is evaluated by measuring the similarity between the input conditions and the extracted conditions from generated images of diffusion models. The experimental results are shown in Table 5. We can find that our model achieves optimal results on segmentation mask and hed edge map conditions, and obtains competitive results for canny edge map and depth map conditions.

2 Computer Vision Tasks

We present several qualitative results of computer vision tasks in Figure 8. OmniGen can handle various low-level vision tasks such as deraining, deblurring, and inpainting. In the bottom of Figure 8, we can see that OmniGen is also able to handle high-level tasks, such as human pose recognition and depth estimation.

So far, we have demonstrated that our model can generate images well based on visual conditions while also extracting visual conditions from raw images. This motivates us to ponder: can we directly use the model to generate new images based on a reference image in only one step, instead of first using a processor to extract spatial condition information and then inputting it into the model for generation? Surprisingly, even without having encountered such a task before, OmniGen handles it admirably. As shown in Figure 9, the existing workflow for ControlNet involves using a detector to extract spatial condition information from the reference image, and then loading the corresponding control module to model the spatial condition information for image generation, which requires multiple network components and operations. Now, only based on our model, we can directly input the reference image and text instruction (e.g., Follow the depth mapping of this image to generate new image. The text description for new image is “…”) to generate an image in only one step without any additional intermediate procedures. It can be observed that the model comprehends the instruction well; when tasked with using the human pose from the reference image, it perfectly replicates the human pose, and when using depth mapping, it retains more details, such as the folds in clothing.

Further Analysis

LLMs demonstrate remarkable generalization capabilities, achieving impressive performance in previously unseen tasks and domains. Furthermore, they can boost performance through mechanisms such as in-context learning and chain of thought. We observe similar functionalities in OmniGen as well, and present our findings in this section.

By standardizing all tasks into a unified format and training on X2I dataset, OmniGen can acquire universal knowledge and allow knowledge transfer across different scenarios and tasks, thus enabling the generation capabilities on unseen tasks and domains. We illustrate several emerging capabilities using the following examples.

Task Composition. In real-world applications, user requirements often involve combinations of tasks. As shown in Figure 10-(a), our model is capable of simultaneously processing multiple instructions, including those for different tasks (Image inpainting and change the color of hair to white) as well as multiple instructions for the same task (Add a sunglasses to the man’s face, and change the color of clothes to blue). These results highlight our model’s versatility and potential for widespread adoption in the wild.

Implicit Combination of Tasks. In addition to explicit task combinations, our model is capable of performing multiple tasks implicitly through a single instruction. As demonstrated in Figure 9, upon receiving the input like “follow the human pose/depth mapping to generate the image: …”, our model can extract the relevant conditional information (such as human pose, depth mapping, etc.) from the reference image and generate a new image based on the captured condition. This process is implicit, with all processing completed internally within the model, thus only requiring the user to input a simple command. This negates the need for explicit conditional extraction using other models prior to input into the diffusion process, as is necessary with existing systems like ControlNet .

In-context Learning for Unseen Tasks and Domains. As illustrated in Figure 10-(b), by providing an example, the model is capable of successfully completing a novel task: generating images based on provided scribble data, which is not encountered during training. To explore whether in-context learning can boost existing abilities on new domains, we show several examples from the FSS dataset, which contains objects that have never been seen or annotated in previous datasets. In the left of Figure 10-(c), we can see that OmniGen is not familiar with the concepts of “pencil sharpeners” and “chess queens”, and it cannot identify them from images. However, when provided with an example, the model is capable of making accurate predictions, demonstrating that in-context learning can enhance the model’s generalization ability across different domains.

End-to-end Workflow. Users typically need to load multiple models and perform multi-step processing to ultimately generate a satisfactory image, making the workflow very cumbersome and costly. This complexity has led to the development of open-source tools and pipelines like ComfyUIhttps://github.com/comfyanonymous/ComfyUI. Our model possesses both excellent multi-modal understanding and image generation capabilities, enabling it to complete a lot of tasks without relying on external models, thereby significantly simplifying the workflow and saving the cost. For instance, as shown in Figure 6, users can specify specific objects within images containing multiple elements through textual instructions (“the cat is the one in |image_2|”), and generate new images based on these instructions without needing preliminary operations such as image cropping. As shown in Figure 9, OmniGen can directly generate images based on the conditional information contained in the reference image, without loading another model to preprocess the reference image.

2 Reasoning Ability

We have explored the reasoning capabilities of the model and presented the results in Figure 11. As shown in the left half of Figure 11, when given an instruction without explicitly specifying the object, such as “Where can I wash my hands? Please help me find the right place in |image_1|”, the model can recognize image contents and infer that a sink is needed. Consequently, the model identifies and indicates the area of the sink in the image. This functionality creates potential applications in the field of embodied intelligence, assisting intelligent agents in comprehending multi-modal instructions, locating necessary objects and planning subsequent actions. Moreover, the right half of Figure 11 demonstrates that after inferring the target object, the model can also perform editing operations on it. If no object matches, the model will refrain from editing any unrelated objects.

3 Chain of Thought

The Chain-of-Thought (CoT) method can significantly boost the performance of LLMs by decomposing the task into multiple steps and sequentially solving each step to obtain an accurate final answer. We consider whether a similar alternative can be applied to image generation. Inspired by the basic way of human drawing, we hope to mimic the step-by-step drawing process, iteratively refining the image from a blank canvas. We construct an anime image dataset and use the PAINTS-UNDOhttps://github.com/lllyasviel/Paints-UNDO model to simulate each stage of artwork creation. We select 8 representative frames to depict the gradual development of the final image. After filtering out inconsistent sequences, we fine-tuned the model on this dataset for 16,000 steps.

The results are visualized in Figure 12, alongside the outputs generated by the original model. For step-by-step generation, the input data consists of the current step’s image and text, and then the model predicts the image for the next step. It can be observed that the fine-tuned model successfully simulates the behavior of a human artist: drawing the basic outline, incrementally adding details, making careful modifications, and applying colors to the image. In this manner, users can modify the previous results to control the current output, thereby participating more actively in the image generation process, rather than passively waiting for the final image with a black-box diffusion model. Unfortunately, the quality of the final generated images does not surpass that of the original model. In the step-by-step generation approach, the model may incorporate erroneous modifications, leading to some disarray in the final image. This does not imply that the approach is unfeasible; currently, we only conduct a preliminary exploration, leaving further optimizations for future research. Based on the findings of previous work on LLMs, which indicate that process supervision significantly outperforms outcome supervision, we posit that supervising the drawing process of images is a promising direction that may assist the model in handling more complex and diverse scenes.

Limitations and Discussions

Figure 13 illustrates several typical failure cases of the current model. We summarize the limitations of the current model as follows:

Similar to existing diffusion models, OmniGen is sensitive to text prompts. Typically, detailed text descriptions result in higher-quality images.

The current model’s text rendering capabilities are limited; it can handle short text segments but fails to accurately generate longer texts. Additionally, due to resource constraints, the number of input images during training is limited to a maximum of three, preventing the model from handling long image sequences.

The generated images may contain erroneous details, especially small and delicate parts. In subject-driven generation tasks, facial features occasionally do not fully align. OmniGen also sometimes generates incorrect depictions of hands.

OmniGen cannot process unseen image types (e.g., image for surface normal estimation).

We believe that most limitations can be addressed by training the model on more related data. Moreover, compared to most models, fine-tuning OmniGen for downstream tasks is simpler, as it inherently supports various image generation tasks without the need for extensive efforts and costs to build additional networks.

Related Work

The generative foundation model serves as the core of many contemporary artificial intelligence systems, revolutionizing the way machines interact with humans. The GPT series [54; 48] have demonstrated that language models can learn numerous tasks via training on a large-scale dataset. Following this trend, the rise of large language models (LLMs) [43; 3; 1] has further showcased their versatility, adeptly performing various tasks such as question answering, text summarization, and code generation within a single framework. Beyond language, multimodal large language models [39; 12] have been proposed to integrate vision and language capabilities. For example, as a typical architecture and popular trend, LLaVA equips the LLM with visual perception and understanding capabilities by linking the vision encoder to the LLM through a connector layer. These models have shown impressive performance in vision-language understanding tasks. However, despite their ability to handle mixed text and image inputs, they lack the capability to generate images. The construction of a universal foundation model for image generation remains unclear and has not been fully explored. In this work, we propose a universal generative model that accepts arbitrary interleaved multimodal inputs and generates images, marking a significant stride towards a general-purpose image generation foundation model.

Recently, some works have explored unified models that support both text and image generation. In Chameleon , images and texts are both tokenized into the token sequence and modeled via discrete autoregressive modeling. Concurrent works such as TransFusion and Show-O unify diffusion and autoregressive methods into a single model, generating text autoregressively and images through diffusion. Nonetheless, like most existing diffusion models, they can only perform text-to-image tasks and cannot handle more complex and various visual generation tasks. The unification of tasks in visual generation remains unexplored. Unlike these efforts, our current focus is on the unification of diverse visual generation tasks. The model is capable of performing various tasks, including text-to-image generation, image editing, subject-driven generation, virtual try-on, image deblurring, human pose recognition, and more. To the best of our knowledge, this is the first model capable of unifying such a wide range of visual generation tasks. Building on this foundation, further expansion into text generation is planned as the next step in the research agenda.

2 Diffusion Model

Recent advancements in diffusion models have been remarkable, with notable contributions from the Stable Diffusion series [56; 52; 13], DALL-E , and Imagen . These models are predominantly designed for text-to-image generation tasks. To facilitate visual-conditioned generation, approaches such as ControlNet and T2i-Adapter introduce supplementary networks integrated into existing text-to-image models, thereby enabling them to accommodate image-based conditions. StyleShot incorporates a style-sensitive encoder to manipulate the style feature of the output images. InstructPix2Pix addresses image editing by augmenting the model with additional input channels. SEED-X and Kosmos-G employ an MLLM to replace the CLIP encoder in SD, improve the performance on the specific downstream task. However, these methods are task-specific, extending the capabilities of SD by modifying the model architecture. In contrast, OmniGen is a model that natively supports various image generative tasks, unifying all tasks into a single framework. Multi-task learning enhances the model’s capabilities and also leads to the emergence of new abilities. Furthermore, when addressing various real-world tasks, OmniGen no longer requires any preprocessing steps or assistance from other models.

There is some work exploring the unification of computer vision (CV) tasks [2; 65; 16; 21]. However, these efforts primarily focus on classic vision tasks and do not support general image generation tasks. Additionally, current models often underperform compared to those specifically designed and trained for corresponding tasks, limiting their practical applications in real-world scenarios. In our work, the introduction of CV tasks plays a crucial role in enabling the model to learn general knowledge, thereby enhancing its image-generation capabilities and fostering the emergence of new abilities. For instance, incorporating the human pose estimation task has led to the model’s ability to generate new images directly based on the pose of a reference image, without the need for an additional model to extract the human pose. At present, We are not obsessed with the pursuit of optimal scores on CV tasks, but leave the fine-tuning of CV task performance for future research.