Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

Introduction

Recently, the research on Multimodal Large Language Models (MLLMs) has garnered intense interest . By building upon the unprecedented intelligence of language-based LLMs , and integrating multimodal encoders at the input side and decoders at the output side, current MLLMs have developed powerful multimodal capabilities. Particularly, in the visual modality, the state-of-the-art (SoTA) MLLMs have now achieved a grand slam across the four major visual-language task groups, i.e., understanding , generating , segmenting , and editing . Central to this capability is the design of vision tokenization , which concerns how to effectively transform input visual signals into feature representations that are most conducive to being understood by LLMs. Specifically, existing vision tokenizers can be mainly categorized into two types: 1) tokenization via image patchifying (cf. Fig. 1(a)), and 2) tokenization via codebooks (cf. Fig. 1(b)).

While SoTA MLLMs have achieved promising performances across various tasks, a significant bottleneck remains with current visual tokenization methods, i.e., resulting in insufficient semantic alignments between language and vision tokens. On the language side, linguistic tokens (or words) are naturally discrete, representing well-encapsulated semantic units, whereas, on the vision side, visual pixels are inherently continuous data with no physical boundaries. Intuitively, language tokens should correspond to semantically encapsulated objects (or compositional regions) within an image. For instance, when “a dog” is mentioned, the “dog” token should correspond to the exact pixel region of the dog in the image. However, as illustrated in Fig. 1(a&b), both existing tokenization methods via image patchifying or codebooks divide the image into fixed patch squares, causing objects within the image to be split across different patches, thus disrupting the integrity of visual semantic units. This semantic fragmentation further might largely lead to the loss of high-frequency visual information, e.g., the object’s edges and lines. Ultimately, this misalignment between vision and language within MLLMs undermines the effective understanding of visual signals, significantly hindering progress in a range of vision-language tasks that require precise, fine-grained semantic alignment between vision and language elements.

To this end, this work proposes a Semantic-Equivalent Tokenizer (SeTok) for enhancing visual MLLMs, where we encourage the vision and language tokens to be semantically congruent. The core idea involves automatically grouping visual features from input images by applying a clustering algorithm , such that each unique cluster represents an encapsulated semantic unit within the vision. As illustrated in Figure 1(c), the red visual area aggregated by SeTok corresponds to a complete semantic concept—“person”, while the yellow area corresponds to the “surface board” concept. Furthermore, we recognize that tokenizing images into a fixed number of patches is impractical. From a semantic perspective, different images should contain varying numbers of semantically encapsulated objects, and the granularity of compositional regions also needs to be flexibly determined. For example, we only need to identify a person in the image, while at other times, we may need to delineate the person’s head precisely. This implies that it is more reasonable to determine the division of visual tokenization dynamically. To address this, we propose a dynamic clustering mechanism that iteratively determines cluster centers based on density peaks, assigning visual features to these centers until all features are allocated. The design of this mechanism allows for the dynamic determination of the number of concept visual tokens, rather than fixing the ratio or merely merging the top- $k$ visual tokens . Next, after clustering, we devise a cluster merger to aggregate the visual features within each cluster, that is dedicated to learning a complete visual semantic unit feature, including both high-frequency and low-frequency information.

We further build an MLLM equipped with our SeTok, named Setokim. Then, through pre-training and instruction-tuning on the multimodal data across diverse tasks (e.g., text-to-image generation, image editing, image QA), our model evolves into a versatile multimodal generalist and excels in all task groups demonstrated by extensive experiments, as shown in Figure 1(d). Compared with other methods that focus on vision tokenization, in-depth analyses reveal that SeTok enjoys the superiority of tokenizing vision input into more interpretable and semantic-complete vision tokens, achieving the enhancement of vision-language alignment. Moreover, SeTok can also be easily integrated into established MLLMs with minimal adaptation. Extensive experiments reveal that this integration significantly boosts task performance, meanwhile keeping stronger interpretability, and also reducing training costs and inference time.

Methodology

In this work, we aim to generate semantically complete and the least-loss vision tokens aligned with text tokens to facilitate fine-grained semantic interactions between vision and language, thereby enhancing the performance of MLLMs in various multimodal tasks. In pursuit of this goal, we propose constructing a semantic-equivalent tokenizer, which takes visual features extracted from a well-trained vision encoder as input and outputs semantically complete visual tokens, as illustrated in Figure 2. Coupled with the vision tokenizer, the vision input can be tokenized into semantically equivalent tokens to the text, which are then fed into the LLM together with the text tokens.

where $\text{KNN}(\bm{x}_{i,j},\bm{X})$ denotes the $K$ -nearest neighbors of $\bm{x}_{i,j}$ in $\bm{X}$ . We then measure the minimal distance $\delta_{i,j}$ between the feature $\bm{x}_{i,j}$ and other features with higher density:

Finally, we summarize the score $s_{i,j}$ of the feature by combining the local density $\rho_{i,j}$ and minimal distance $\delta_{i,j}$ as $\rho_{i,j}\times\delta_{i,j}$ . Based on the score, we select the location $(i,j)$ of the visual feature that has the highest score $s_{i,j}$ and has not yet been assigned to a cluster, and iteratively assign the visual feature into a certain cluster until a stopping condition is satisfied, at which point the final scope is added as an additional mask to explain any remaining visual embeddings. The detailed process is described formally in Appendix $\S$ C.

Cluster Merger.

2 Multimodal Training Scheme

In this work, we consider a three-stage training procedure. It can be summarized as follows. The implementation details, like training data, can be found in Table 7 in Appendix $\S$ D.

This stage aims to truly endow the tokenizer with the capability of tokenizing the input vision into a semantic complement and independent tokens that can capture the low-frequency semantic features and high-frequency pixel features. Therefore, we employ a vision decoder to decode realistic images by taking the tokenized visual tokens as inputs. Specifically, the vision tokens obtained from the tokenizer are fed into a learnable feature mapper (i.e., Q-former ) module as the inputs of the U-Net $\epsilon_{\theta}$ to infill the visual details, where the learnable module consists of four cross-attention layers to connect the visual tokenizer and the U-Net. Following , the parameters $\theta$ of this U-Net are optimized by $\epsilon$ prediction:

where $\epsilon$ is the unscaled noise, $t$ is the sampling step, $\bm{z}_{t}$ is latent noise at step $t$ . In addition, an accurate tokenizer should be equipped with precise delineation of the boundaries between tokens, i.e., the accuracy of the clustering. Meanwhile, we observe that attention masks correlate with object masks, but they often fail to align precisely with actual object boundaries. We conjecture that this misalignment stems from the vision encoder’s tendency to spatially dilate information about objects, thus obscuring precise boundary delineation. Consequently, we introduce to incorporate a mask decoder taking the vision tokens as input to decode the object mask. We follow to use binary cross-entropy loss $\mathcal{L}_{bce}$ and dice loss $\mathcal{L}_{dice}$ to compute the loss for masks. Moreover, we encompass an auxiliary mask consistency loss designed to encourage greater similarity between the cluster results and the object masks $\pi$ , thereby enhancing their alignment.

where $\pi\in^{h\times w\times k}$ are a set of the object masks, and $\sum_{n}\pi_{i,j,n}=1$ .

Overall, the training of the tokenizer involves two sub-phases: 1) in the first sub-phrase, we freeze the vision encoder and focus solely on training the learnable components, which include the cluster merger, the q-former, and keys and values within the U-Net. This stage is conducted with 50k training steps using datasets such as CC3M and LAION-AESTHETICS . 2) in the second sub-phase, we integrate a mask decoder to encourage fine-grained object boundary learning. This extended training is performed on the MSCOCO 2017 and SAM-1B datasets.

Stage-II: Multimodal Pretraining.

In this phase, we enhance the LLM to possess interleaved understanding and generation capabilities for vision-language tasks. Thus, on the one hand, we adopt next-token prediction-based cross-entropy loss for the textual content generation. Meanwhile, we employ embedding regression loss to train LLM to generate visual tokens, which are trained to reconstruct the features on the pre-trained vision tokenizer with a Mean Squared Error (MSE) loss. We add two special tokens ‘[Img]’ and ‘[/Img]’ to represent the beginning and the end of the query embeddings, and the ‘[Img]’ is trained to predict where an image emerges. We pre-train the LLM initialized from Vicuna-v1.5 on massive multimodal data, including 558K image-caption pairs from the LLaVA-filtered CC3M dataset and 695K sampled GPT-4V-responded captions from the ALLaVA dataset . In addition, we pretraining our model with the image-editing dataset, i.e., InstructPix2Pix , to enable the model to perform vision editing.

Stage-III: Instruction Tuning.

We perform multimodal instruction tuning through fine-tuning LLM using a LoRA module with both public datasets covering fine-grained visual QA, image generation and editing, and text-rich grounded datasets. Following , we utilize 665K single- and multi-turn conversations from the LLaVA-150K , ShareGPT4V , academic-task-oriented VQA datasets (e.g., VQAv2, GQA , OK-VQA , A-OKVQA ), and region-level perception (e.g., Visual Genome , RefCOCO ), to enhance the model’s vision understanding capabilities. Furthermore, we generate prompts that produce or edit images suitable for the conversation context, bringing 15K image-related generation and editing pairs based on the InstructPix2Pix dataset for image-related generation and editing. Appendix $\S$ D provides more training details.

Experiments and Analysis

In the experiments, we aim to quantify the performance of Setokim across various vision-language tasks primarily compared to MLLMs() focus on visual comprehension and generation, as well as with widely-used MLLMs () that specialize in segmentation or editing. Appendix $\S$ D details experimental setups and implementation specifics.

We first evaluate our model across a wide range of academic benchmarks, including image captioning and image QA datasets. As depicted in Table 1, our model achieves competitive performance in various vision-understanding tasks. Specifically, in the captioning tasks, our model outperforms both the specialist InstructBLIP and the previous SoTA MLLMs, LaVIT, and SEED, by achieving the highest CIDEr score. Moreover, our model demonstrates superior performance on various QA datasets, notably improving GQA by 3.9% and iconQA by 4.2%, suggesting that our approach has a distinct advantage in understanding the relationships and quantities between objects. Further, across various MLLM benchmarks shown in Table 2, our model consistently exhibits higher performance, underscoring its robust multimodal comprehension capabilities and mitigating the hallucination issues.

Here, we compare the impact of different clustering mechanisms on model performance, w.r.t, training time, computational overhead, and zero-shot performance on multimodal understanding tasks. As shown in Table 5, we can observe that tokenizers constructed using dynamic clustering mechanisms achieve superior overall performance compared to those with a fixed setup while simultaneously accelerating training time and reducing computational costs during inference. Specifically, we note that using a pre-defined threshold to dynamically determine the number of clusters can lead to variations in the threshold due to score inconsistencies, resulting in poor generalization across different datasets. In contrast to soft-clustering, which yields soft attention masks, our findings suggest that hard-clustering produces better results, as it may be because hard clustering leads to higher consistency of cluster outcomes , leading to more stable visual tokens and enhancing both the stability and performance of the model. When employing a fixed number of clusters, the key challenge is to determine the optimal number of clusters. As shown in Table 5, different datasets achieve optimal performance at varying numbers of clusters, with a uniform count across all datasets, resulting in suboptimal outcomes. Furthermore, we compare our mechanism to the current resampler approach , in which a fixed number of queries are used to group features via a cross-attention mechanism. Our experiments reveal that dynamic clustering still enjoys the superiority in multimodal understanding, likely because the fixed query tokens are not truly aligned with the number of semantic units in visual content and is essentially a soft cluster method.

Qualitative Analysis of Visual Understanding and Generation.

As illustrated in Figure 3, our model exhibits proficiency in intricate image understanding tasks, such as deciphering reversed text, exemplified by the word “stop”, and accurately identifying text “A NEW EXPERIENCE COMING YOUR WAY” that is partially covered. In tasks involving detailed image descriptions, our approach prioritizes object-level information within images, which substantially mitigates the incidence of hallucinatory responses commonly observed in MLLMs. Moreover, in text-to-image generation, our model demonstrates remarkable capabilities in synthesizing coherent images, which maintain a high degree of fidelity and relevance to the textual context, such as the “flower”, “fence” and “squirrel”.

Qualitative Analysis of Visual Segmentation.

We present the segmentation examples in Figure 4. It is easy to note that the attention mask closely aligns with the object mask, and our model shows superiority in achieving more accurate and detailed segmentation results compared to other LLM-based segmentation methods. Notably, as depicted in the second row of the figure, the visual token generated by our method encompasses all depicted fish, effectively achieving a complete segmentation of the fish in the scene. In contrast, other models produce only partial segmentation. This effectiveness of the segmentation highlights the precise content representation and improved interpretability of the visual tokens. Such visual tokens eventually enhance the vision-language understanding incorporated with the text tokens.

Qualitative Analysis of Visual Editing.

Here, we evaluate the efficacy of image manipulation using our model compared to the previous diffusion-based method MagicBrush , and various MLLMs including Emu-2-Gen , MGIE , and Mini-Gemini . As depicted in Figure 5, Setokim displays superior performance by closely adhering to the provided instructions and preserving intricate image details. For instance, our model seamlessly adds ‘tomato slices’ to an image without altering other elements on the pizza, while Emu-2-Gen and MGIE fall short. Furthermore, our model exhibits remarkable precision in changing the color of an umbrella while visual objects not intended for alteration retain a high level of consistency before and after editing. Additionally, Setokim demonstrates to precisely follow implicit user instructions to remove unusual elements from an image, i.e., the banana, preserving the surrounding context, whereas Emu-2-Gen mistakenly removes a telephone cord and MGIE fails to remove the banana properly, altering the cord’s texture. These examples underscore the effectiveness of Setokim for high-precision image manipulation, leveraging semantically equivalent visual tokens to achieve nuanced and context-aware results.

Qualitative Analysis of Visual Tokens.

In Figure 6, we demonstrate how input visual features are assigned to visual tokens after tokenization. Firstly, we observe that our tokenization method during clustering bears a resemblance to partial segmentations representing semantic complete units. For instance, in the first and second images 1 and 2, a person is represented as a single token; however, in the third image, the person is segmented into different tokens for the head and body. This segmentation is similar to the fifth image, where a baby is divided into distinct tokens for the hair, face, body, and hands. Furthermore, in scenarios involving multiple instances of the same class, such as the two persons in the final image, identical parts from all instances are amalgamated. This approach ensures that similar visual features across different entities are recognized and processed collectively, enhancing the coherence and efficiency of our model’s tokenization process.

Currently, benefiting from the emergent phenomenon, LLMs have demonstrated near-human-level intelligence in language processing . Simultaneously, researchers have been attempting to develop MLLMs by integrating multimodal encoders and decoders into LLMs . From the initial MLLMs that could only understand multimodal input signals to later versions supporting the generation of multimodal contents , MLLMs have shown powerful capabilities and a broader range of applications. Among all modalities, the integration of vision, known as visual MLLM, has received the most extensive research and application . The latest MLLM research has not only achieved both understanding and generation of visual content, but also developed more refined, pixel-level visual modeling, including segmentation and editing functions .

On the other hand, an increasing body of research indicates that visual tokenization significantly impacts MLLM capabilities in vision tasks. The fundamental approach involves encoding the input visual content into feature representations via a visual encoder (e.g., Clip-VIT ) and mapping these to an LLM, thus enabling a language-based LLM to understand vision. The corresponding method involves patchifying the original visual images of various sizes into smaller fixed-size patches , treating these as tokens, and encoding each patch/token to obtain corresponding embeddings, which are then fed into the LLM. Subsequent research , aiming to further unify the training objectives of language and visual modalities, proposed the introduction of codebook techniques, representing visual elements as codebooks. This allows visual training to be treated similarly to language training, i.e., conducting next token prediction . Unfortunately, whether in the above visual encoding or tokenization techniques, there is a significant bottleneck of MLLM performance: the integrity of visual semantic units, either visual objects or compositional regions, is compromised during the patchifying process. This results in a less effective semantic alignment between vision and language within the LLM. This paper is the first to propose a solution to this problem, introducing a novel Semantic Equivalent Tokenization for MLLM.

In addition, this work is also related to scene decomposition , which involves segmenting a scene into objects. Typically, these methods use a fixed number of query tokens and apply cross-attention to aggregate visual features implicitly. However, this fixed-token approach may not only correspond to the actual visual content but also requires complex network architectures and extensive data for optimization. When combined with LLMs, such complexity significantly increases computational resource demands. Conversely, we learn a dynamic number of semantic objects and do not require complex model structures for optimization, thereby enhancing resource efficiency and providing a more adaptable solution for integrating visual and language modalities.

In this paper, we present a semantic equivalent tokenization (SeTok) to enhance the semantic alignment between vision and text. Technically, we propose a dynamic clustering mechanism to automatically group the visual features into a variable number of semantic complete visual tokens. Then we equip an MLLM with SeTok and further propose a 3-stage training strategy. After training, our model showcases notable performance on a broad range of comprehension, generation, segmentation, and editing tasks, demonstrating our method’s efficacy.

Appendix A Ethic Statement

This work aims to build semantic equivalence tokenization to segment input images into semantic complete tokens to enhance the MLLMs in vision understanding, generation, segmentation, and editing capabilities. Here we discuss all the possible potential impacts of the proposed MLLM, Setokim.

The Setokim, limited by the quantity of fine-tuning data and the quality of the base models, may generate some low-quality content. Also, as a generative model, the LLM will produce hallucinated content in multimodal formats that may be harmful to society. We have reminded users to interpret the results with caution. Anyone who uses this LLM should obey the rules in a license. And also commercial use of our system is not allowed.

Data Privacy and Security

Our research utilizes datasets that are either publicly available or collected with explicit consent. We adhere to strict data privacy and security protocols to protect the information and ensure it is used solely for this research.

Bias Mitigation

Recognizing the potential for bias in AI models, particularly in vision-language tasks, we rigorously test our tokenizer across diverse datasets. This approach is designed to identify and mitigate biases that may affect the model’s performance or lead to unfair outcomes in its applications.

Appendix B Limitation

While Setokim has achieved further improvements across various language-driven vision tasks, becoming a zero-shot general specialist, it still faces several limitations.

The evaluation of our model is currently constrained to configurations with 7B and 13B parameters. As shown in , the performance of MLLMs is limited by the scale of the core backbone LLM. Despite the impressive results achieved, the potential benefits of employing significantly larger models, such as 65B or 130B, are worth exploring in future studies.

The resolution of Image.

Regarding image resolution, our model currently handles images up to a maximum resolution of 336x336. While there have been improvements in understanding visually fine-grained content, challenges remain when processing higher-resolution images. Therefore, enhancing the model to support images of any resolution could further improve its visual perception capabilities.

Hallucination.

Although our model has made some progress in mitigating hallucination through fine-grained vision-language alignment, as demonstrated in experiments on the POPE dataset, the issue of hallucinations remains inevitable. This area continues to pose challenges and is a crucial aspect for future exploration and enhancement.

Appendix C Detailed Clustering Algorithm

The formal clustering algorithm is described in Algorithm 1. Specifically, a scope $c=^{h\times w}$ is initialized to a matrix of ones $1^{h\times w}$ to track the degree to which visual embeddings have been assigned to clusters. In addition, the seed scores are initialized by combining the local density in Eq.(1) and distance in Eq.(2) to perform the selection of visual embeddings. At each iteration, a single embedding vector $\bm{x}_{i,j}$ is selected at the spatial location $(i,j)$ which corresponds to the argmax of the element-wise multiplication of the seed scores and the current scope. This ensures that cluster seeds are sampled from pixel embeddings that have not yet been assigned to clusters. An alpha mask $\alpha_{i}^{h\times w}$ is computed as the distance between the cluster seed embedding $\bm{x}_{i,j}$ and all individual pixel embeddings according to a distance kernel $\varphi$ . The output of the kernel $\varphi$ is one if two embeddings are identical and decreases to zero as the distance between a pair of embeddings increases. The associated attention mask $\bm{M}_{k}$ is obtained by the element-wise multiplication of the alpha masks by the current scope to ensure that the final set of attention masks is normalized. The scope is then updated by an element-wise multiplication with the complement of the alpha masks. This process is repeated until a stopping condition is satisfied, at which point the final scope is added as an additional mask to explain any remaining embeddings.

Appendix D Detailed Experiments Settings

Our model utilizes two types of vision encoders: Vit-G/14-336 and ConvNext-XXL. For the language model backbone, we choose Vicuna-v1.5 in both the 7B and 13B versions. The cluster merger within our architecture comprises two blocks, each equipped with a multi-head self-attention mechanism consisting of two heads, a feed-forward layer, and layers for residual connections and normalization. Regarding feature mapping, we initialize the backbone with the pre-trained BERT-base model. Additionally, SD v1.5 is used to set the initial parameters for the U-Net. This integrated approach ensures a robust foundation for processing and interpreting multimodal inputs. We employ LoRA and partial parameter fine-tuning to optimize the model. Here, we detail the training data utilized at each stage in Table 6.

D.2 Training Receipt

In Table 7, we list the detailed hyper-parameters setting at three stages. All the training of Setokim is conducted on 16 $\times$ A800 (80G) GPUs.

Appendix E Extended Experimental Analysis

Here, we conduct an ablation study to assess the design advantages of our model, with the results presented in Table 9. Initially, removing the position embedding from each cluster significantly impaired performance in visual understanding and editing tasks. Subsequently, replacing the cluster merger with an average visual representation for each cluster also led to a modest decline in fine-grained visual performance.

The Adaption of SeTok.

To investigate whether our proposed semantic-equivalent tokenizer could be transferred to established MLLMs, thereby enhancing their performance across various tasks, we select models that cater to varying resolutions: Emu-1 supporting 224*224, LLaVA-v1.5 supporting 336*336, and Yi-VL supporting 448*448. Then, these MLLMs are incorporated with our semantic-equivalent tokenizer behind the input connector of these existing models, where each token represents the average of tokens within a cluster. Furthermore, we sample 10k instruction data from LLaVA-150k to fine-tune the original and modified models. The experimental results are presented in Table 9. Firstly, a marked improvement across various tasks can be observed after the application of SeTok, indicating the effectiveness and necessity of aligning visual tokens semantically with text tokens. Furthermore, while the increase in resolution typically requires a higher number of visual tokens in the original MLLMs, the introduction of clustering results in a more consistent number of tokens being fed into the LLM. This consistency is crucial as the attention computation in LLMs exhibits a quadratic relationship concerning the token length; thus, this sparsification can accelerate the training process and significantly reduce computational costs during inference.

The Impact of Visual Encoder.

Here, we analyze the impact of using the same clustering mechanism on the effectiveness of vision encoders. Initially, as demonstrated in Figure 7(a), we observe that the transformer-based vision encoder exhibits superior performance in both understanding and generating visual content. Upon investigating the reasons, we found that visual features obtained from the ConvNext model do not easily differentiate focal points, as shown in Figure 7(b). This may be due to the local receptive fields in convolutional networks, resulting in each feature focusing solely on its own region and highly dispersed scores. In contrast, the score distribution of the ViT-G/14 is much more compact, implying a greater consistency or reliability in the scores it produces. Furthermore, employing the same number of clusters for both visual feature groups, further visualization of the clustering outcomes in Figure 7(c) reveals that clusters based on convolution are quite arbitrary and do not discern object-level information. Conversely, clusters derived from transformer-based models delineate object contours, enabling the generation of more precise object-centric visual tokens, and leading to better performance in various multimodal tasks.

The quality of Reconstruction.

In Figure 8, we visualize some reconstructed examples by the trained denoising U-Net. It can be seen that, given the tokenized visual tokens, the original input images can be successfully recovered. The reconstructed examples exhibit a high degree of the construction of the method.