Autoregressive Image Generation without Vector Quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, Kaiming He

Introduction

Autoregressive models are currently the de facto solution to generative models in natural language processing . These models predict the next word or token in a sequence based on the previous words as input. Given the discrete nature of languages, the inputs and outputs of these models are in a categorical, discrete-valued space. This prevailing approach has led to a widespread belief that autoregressive models are inherently linked to discrete representations.

As a result, research on generalizing autoregressive models to continuous-valued domains—most notably, image generation—has intensely focused on discretizing the data . A commonly adopted strategy is to train a discrete-valued tokenizer on images, which involves a finite vocabulary obtained by vector quantization (VQ) . Autoregressive models are then operated on the discrete-valued token space, analogous to their language counterparts.

In this work, we aim to address the following question: Is it necessary for autoregressive models to be coupled with vector-quantized representations? We note that the autoregressive nature, i.e., “predicting next tokens based on previous ones”, is independent of whether the values are discrete or continuous. What is needed is to model the per-token probability distribution, which can be measured by a loss function and used to draw samples from. Discrete-valued representations can be conveniently modeled by a categorical distribution, but it is not conceptually necessary. If alternative models for per-token probability distributions are presented, autoregressive models can be approached without vector quantization.

With this observation, we propose to model the per-token probability distribution by a diffusion procedure operating on continuous-valued domains. Our methodology leverages the principles of diffusion models for representing arbitrary probability distributions. Specifically, our method autoregressively predicts a vector $z$ for each token, which serves as a conditioning for a denoising network (e.g., a small MLP). The denoising diffusion procedure enables us to represent an underlying distribution $p(x|z)$ for the output $x$ (Figure 1). This small denoising network is trained jointly with the autoregressive model, with continuous-valued tokens as the input and target. Conceptually, this small prediction head, applied to each token, behaves like a loss function for measuring the quality of $z$ . We refer to this loss function as Diffusion Loss.

Our approach eliminates the need for discrete-valued tokenizers. Vector-quantized tokenizers are difficult to train and are sensitive to gradient approximation strategies . Their reconstruction quality often falls short compared to continuous-valued counterparts . Our approach allows autoregressive models to enjoy the benefits of higher-quality, non-quantized tokenizers.

To broaden the scope, we further unify standard autoregressive (AR) models and masked generative models into a generalized autoregressive framework (Figure 3). Conceptually, masked generative models predict multiple output tokens simultaneously in a randomized order, while still maintaining the autoregressive nature of “predicting next tokens based on known ones”. This leads to a masked autoregressive (MAR) model that can be seamlessly used with Diffusion Loss.

We demonstrate by experiments the effectiveness of Diffusion Loss across a wide variety of cases, including AR and MAR models. It eliminates the need for vector-quantized tokenizers and consistently improves generation quality. Our loss function can be flexibly applied with different types of tokenizers. Further, our method enjoys the advantage of the fast speed of sequence models. Our MAR model with Diffusion Loss can generate at a rate of $<$ 0.3 second per image while achieving a strong FID of $<$ 2.0 on ImageNet 256 $\times$ 256. Our best model can approach 1.55 FID.

The effectiveness of our method reveals a largely uncharted realm of image generation: modeling the interdependence of tokens by autoregression, jointly with the per-token distribution by diffusion. This is in contrast with typical latent diffusion models in which the diffusion process models the joint distribution of all tokens. Given the effectiveness, speed, and flexibility of our method, we hope that the Diffusion Loss will advance autoregressive image generation and be generalized to other domains in future research.

Related Work

. Pioneering efforts on autoregressive image models operate on sequences of pixels. Autoregression can be performed by RNNs , CNNs , and, most lately and popularly, Transformers . Motivated by language models, another series of works model images as discrete-valued tokens. Autoregressive and masked generative models can operate on the discrete-valued token space. But discrete tokenizers are difficult to train, which has recently drawn special focus .

Related to our work, the recent work on GIVT also focuses on continuous-valued tokens in sequence models. GIVT and our work both reveal the significance and potential of this direction. In GIVT, the token distribution is represented by Gaussian mixture models. It uses a pre-defined number of mixtures, which can limit the types of distributions it can represent. In contrast, our method leverages the effectiveness of the diffusion process for modeling arbitrary distributions.

Diffusion for Representation Learning

. The denoising diffusion process has been explored as a criterion for visual self-supervised learning. For example, DiffMAE replaces the L2 loss in the original MAE with a denoising diffusion decoder; DARL trains autoregressive models with a denoising diffusion patch decoder. These efforts have been focused on representation learning, rather than image generation. In their scenarios, generating diverse images is not a goal; these methods have not presented the capability of generating new images from scratch.

Diffusion for Policy Learning

. Our work is conceptually related to Diffusion Policy in robotics. In those scenarios, the distribution of taking an action is formulated as a denoising process on the robot observations, which can be pixels or latents . In image generation, we can think of generating a token as an “action” to take. Despite this conceptual connection, the diversity of the generated samples in robotics is less of a core consideration than it is for image generation.

Method

In a nutshell, our image generation approach is a sequence model operated on a tokenized latent space . But unlike previous methods that are based on vector-quantized tokenizers (e.g., variants of VQ-VAE ), we aim to use continuous-valued tokenizers (e.g., ). We propose Diffusion Loss that makes sequence models compatible with continuous-valued tokens.

In the context of generative modeling, this probability distribution must exhibit two essential properties. (i) A loss function that can measure the difference between the estimated and true distributions. In the case of categorical distribution, this can be simply done by the cross-entropy loss. (ii) A sampler that can draw samples from the distribution $x\sim p(x|z)$ at inference time. In the case of categorical distribution, this is often implemented as drawing a sample from $p(x|z)=\text{softmax}(Wz/\tau)$ , in which $\tau$ is a temperature that controls the diversity of the samples. Sampling from a categorical distribution can be approached by the Gumbel-max method or inverse transform sampling.

This analysis suggests that discrete-valued tokens are not necessary for autoregressive models. Instead, it is the requirement of modeling a distribution that is essential. A discrete-valued token space implies a categorical distribution, whose loss function and sampler are simple to define. What we actually need are a loss function and its corresponding sampler for distribution modeling.

2 Diffusion Loss

Denoising diffusion models offer an effective framework to model arbitrary distributions. But unlike common usages of diffusion models for representing the joint distribution of all pixels or all tokens, in our case, the diffusion model is for representing the distribution for each token.

. Following , the loss function of an underlying probability distribution $p(x|z)$ can be formulated as a denoising criterion:

It is worth noticing that the conditioning vector $z$ is produced by the autoregressive network: $z=f(\cdot)$ , as we will discuss later. The gradient of $z=f(\cdot)$ is propagated from the loss function in Eqn. (1). Conceptually, Eqn. (1) defines a loss function for training the network $f(\cdot)$ .

Sampler

. At inference time, it is required to draw samples from the distribution $p(x|z)$ . Sampling is done via a reverse diffusion procedure : $x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\varepsilon_{\theta}(x_{t}|t,z)\right)+\sigma_{t}\delta.$ Here $\delta$ is sampled from the Gaussian distribution $\mathcal{N}(\mathbf{0},\mathbf{I})$ and $\sigma_{t}$ is the noise level at time step $t$ . Starting with $x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , this procedure produces a sample $x_{0}$ such that $x_{0}\sim p(x|z)$ .

When using categorical distributions (Sec. 3.1), autoregressive models can enjoy the benefit of having a temperature $\tau$ for controlling sample diversity. In fact, existing literature, in both languages and images, has shown that temperature plays a critical role in autoregressive generation. It is desired for the diffusion sampler to offer a temperature counterpart. We adopt the temperature sampling presented in . Conceptually, with temperature $\tau$ , one may want to sample from the (renormalized) probability of $p(x|z)^{\frac{1}{\tau}}$ , whose score function is ${\frac{1}{\tau}}\nabla\log_{x}p(x|z)$ . In practice, suggests to either divide $\varepsilon_{\theta}$ by $\tau$ , or scale the noise by ${\tau}$ . We adopt the latter option: we scale $\sigma_{t}\delta$ in the sampler by $\tau$ . Intuitively, $\tau$ controls the sample diversity by adjusting the noise variance.

3 Diffusion Loss for Autoregressive Models

4 Unifying Autoregressive and Masked Generative Models

We show that masked generative models, e.g., MaskGIT and MAGE , can be generalized under the broad concept of autoregression, i.e., next token prediction.

. The concept of autoregression is orthogonal to network architectures: autoregression can be done by RNNs , CNNs , and Transformers . When using Transformers, although autoregressive models are popularly implemented by causal attention, we show that they can also be done by bidirectional attention. See Figure 2. Note that the goal of autoregression is to predict the next token given the previous tokens; it does not constrain how the previous tokens communicate with the next token.

We can adopt the bidirectional attention implementation as done in Masked Autoencoder (MAE) . See Figure 2(b). Specifically, we first apply an MAE-style encoderHere the terminology of encoder/decoder is in the sense of a general Autoencoder, following MAE . It is not related to whether the computation is casual/bidirectional in Transformers . on the known tokens (with positional embedding ). Then we concatenate the encoded sequence with mask tokens (with positional embedding added again), and map this sequence with an MAE-style decoder. The positional embedding on the mask tokens can let the decoder know at which positions are to be predicted. Unlike causal attention, here the loss is computed only on the unknown tokens .

With the MAE-style trick, we allow all known tokens to see each other, and also allow all unknown tokens to see all known tokens. This full attention introduces better communication across tokens than causal attention. At inference time, we can generate tokens (one or more per step) using this bidirectional formulation, which is a form of autoregression. As a compromise, we cannot use the key-value (kv) cache of causal attention to speed up inference. But as we can generate multiple tokens together, we can reduce generation steps to speed up inference. Full attention across tokens can significantly improve the quality and offer a better speed/accuracy trade-off.

Autoregressive models in random orders

. To connect to masked generative models , we consider an autoregressive variant in random orders. The model is given a randomly permuted sequence. This random permutation is different for each sample. See Figure 3(b). In this case, the position of the next token to be predicted needs to be accessible to the model. We adopt a strategy similar to MAE : we add positional embedding (that corresponds to the unshuffled positions) to the decoder layers, which can tell what positions to predict. This strategy is applicable for both causal and bidirectional versions.

As shown in Figure 3 (b)(c), random-order autoregression behaves like a special form of masked generation, in which one token is generated at a time. We elaborate on this as follows.

Masked autoregressive models

Here, $X^{k}=\{x^{i},x^{i+1}...,x^{j}\}$ is a set of tokens to be predicted at the $k$ -th step, with $\cup_{k}{X^{k}}=\{x^{1},...,x^{n}\}$ . In this sense, this is essentially “next set-of-tokens prediction”, and thus is also a general form of autoregression. We refer to this variant as Masked Autoregressive (MAR) models. MAR is a random-order autoregressive model that can predict multiple tokens simultaneously.

MAR is conceptually related to MAGE . However, MAR samples tokens by a temperature $\tau$ applied on the probability distribution of each token (which is the standard practice in generative language models like GPT). In contrast, MAGE (following MaskGIT ) applies a temperature for sampling the locations of the tokens to be predicted: this is not a fully randomized order, which creates a gap between training-time and inference-time behavior.

Implementation

This section describes our implementation. We note that the concepts introduced in this paper are general and not limited to specific implementations. More detailed specifics are in Appendix B.

. Our diffusion process follows . Our noise schedule has a cosine shape, with 1000 steps at training time; at inference time, it is resampled with fewer steps (by default, 100) . Our denoising network predicts the noise vector $\varepsilon$ . The loss can optionally include the variational lower bound term $\mathcal{L}_{\text{vlb}}$ . Diffusion Loss naturally supports classifier-free guidance (CFG) (detailed in Appendix B).

Denoising MLP

. We use a small MLP consisting of a few residual blocks for denoising. Each block sequentially applies a LayerNorm (LN) , a linear layer, SiLU , and another linear layer, merging with a residual connection. By default, we use 3 blocks and a width of 1024 channels. The denoising MLP is conditioned on a vector $z$ produced by the AR/MAR model (see Figure 1). The vector $z$ is added to the time embedding of the noise schedule time-step $t$ , which serves as the condition of the MLP in the LN layers via AdaLN .

2 Autoregressive and Masked Autoregressive Image Generation

. We use the publicly available tokenizers provided by LDM . Our experiments will involve their VQ-16 and KL-16 versions . VQ-16 is a VQ-GAN , i.e., VQ-VAE with GAN loss and perceptual loss ; KL-16 is its counterpart regularized by Kullback–Leibler (KL) divergence, without vector quantization. 16 denotes the tokenizer strides.

Transformer

. Our architecture follows the Transformer implementation in ViT . Given a sequence of tokens from a tokenizer, we add positional embedding and append the class tokens [cls]; then we process the sequence by a Transformer. By default, our Transformer has 32 blocks and a width of 1024, which we refer to as the Large size or -L ( $\scriptstyle\sim$ 400M parameters).

Autoregressive baseline

. Causal attention is implemented following the common practice of GPT (Figure 2(a)). The input sequence is shifted by one token (here, [cls]). Triangular masking is applied to the attention matrix. At inference time, temperature ( $\tau$ ) sampling is applied. We use kv-cache for efficient inference.

Masked autoregressive models

. With bidirectional attention (Figure 2(b)), we can predict any number of unknown tokens given any number of known tokens. At training time, we randomly sample a masking ratio in [0.7, 1.0]: e.g., 0.7 means 70% tokens are unknown. Because the sampled sequence can be very short, we always pad 64 [cls] tokens at the start of the encoder sequence, which improves the stability and capacity of our encoding. As in Figure 2, mask tokens [m] are introduced in the decoder, with positional embedding added. For simplicity, unlike , we let the encoder and decoder have the same size: each has half of all blocks (e.g., 16 in MAR-L).

At inference, MAR performs “next set-of-tokens prediction”. It progressively reduces the masking ratio from 1.0 to 0 with a cosine schedule . By default, we use 64 steps in this schedule. Temperature ( $\tau$ ) sampling is applied. Unlike , MAR always uses fully randomized orders.

Experiments

We experiment on ImageNet at a resolution of 256 $\times$ 256. We evaluate FID and IS , and provide Precision and Recall as references following common practice . We follow the evaluation suite provided by .

. We first compare continuous-valued tokens with Diffusion Loss and standard discrete-valued tokens with cross-entropy loss (Table 1). For fair comparisons, the tokenizers (“VQ-16” and “KL-16”) are both downloaded from the LDM codebase . These are popularly used tokenizers (e.g., ).

The comparisons are in four variants of AR/MAR. As shown in Table 1, Diffusion Loss consistently outperforms the cross-entropy counterpart in all cases. Specifically, in MAR (e.g., the default), using Diffusion Loss can reduce FID by relatively $\scriptstyle\sim$ 50%-60%. This is because the continuous-valued KL-16 has smaller compression loss than VQ-16 (discussed next in Table 2), and also because a diffusion process models distributions more effectively than categorical ones.

In the following ablations, unless specified, we follow the “default” MAR setting in Table 1.

Flexibility of Diffusion Loss

. One significant advantage of Diffusion Loss is its flexibility with various tokenizers. We compare several publicly available tokenizers in Table 2.

Diffusion Loss can be easily used even given a VQ tokenizer. We simply treat the continuous-valued latent before the VQ layer as the tokens. This variant gives us 7.82 FID (w/o CFG), compared favorably with 8.79 FID (Table 1) of cross-entropy loss using the same VQ tokenizer. This suggests the better capability of diffusion for modeling distributions.

This variant also enables us to compare the VQ-16 and KL-16 tokenizers using the same loss. As shown in Table 2, VQ-16 has a much worse reconstruction FID (rFID) than KL-16, which consequently leads to a much worse generation FID (e.g., 7.82 vs. 3.50 in Table 2).

Interestingly, Diffusion Loss also enables us to use tokenizers with mismatched strides. In Table 2, we study a KL-8 tokenizer whose stride is 8 and output sequence length is 32 $\times$ 32. Without increasing the sequence length of the generator, we group 2 $\times$ 2 tokens into a new token. Despite the mismatch, we are able to obtain decent results, e.g., KL-8 gives us 2.05 FID, vs. KL-16’s 1.98 FID. Further, this property allows us to investigate other tokenizers, e.g., Consistency Decoder , a non-VQ tokenizer of a different architecture/stride designed for different goals.

For comprehensiveness, we also train a KL-16 tokenizer on ImageNet using the code of , noting that the original KL-16 in was trained on OpenImages . The comparison is in the last row of Table 2. We use this tokenizer in the following explorations.

Denoising MLP in Diffusion Loss

. We investigate the denoising MLP in Table 4. Even a very small MLP (e.g., 2M) can lead to competitive results. As expected, increasing the MLP width helps improve the generation quality; we have explored increasing the depth and had similar observations. Note that our default MLP size (1024 width, 21M) adds only $\scriptstyle\sim$ 5% extra parameters to the MAR-L model. During inference, the diffusion sampler has a decent cost of $\scriptstyle\sim$ 10% overall running time. Increasing the MLP width has negligible extra cost in our implementation (Table 4), partially because the main overhead is not about computation but memory communication.

Sampling Steps of Diffusion Loss. Our diffusion process follows the common practice of DDPM : we train with a 1000-step noise schedule but inference with fewer steps. Figure 6 shows that using 100 diffusion steps at inference is sufficient to achieve a strong generation quality.

Temperature of Diffusion Loss. In the case of cross-entropy loss, the temperature is of central importance. Diffusion Loss also offers a temperature counterpart for controlling the diversity and fidelity. Figure 6 shows the influence of the temperature $\tau$ in the diffusion sampler (see Sec. 3.2) at inference time. The temperature $\tau$ plays an important role in our models, similar to the observations on cross-entropy-based counterparts (note that the cross-entropy results in Table 1 are with their optimal temperatures).

2 Properties of Generalized Autoregressive Models

. Table 1 is also a comparison on the AR/MAR variants, which we discuss next. First, replacing the raster order in AR with random order has a significant gain, e.g., reducing FID from 19.23 to 13.07 (w/o CFG). Next, replacing the causal attention with the bidirectional counterpart leads to another massive gain, e.g., reducing FID from 13.07 to 3.43 (w/o CFG).

The random-order, bidirectional AR is essentially a form of MAR that predicts one token at a time. Predicting multiple tokens (‘ $>$ 1’) at each step can effectively reduce the number of autoregressive steps. In Table 1, we show that the MAR variant with 64 steps slightly trades off generation quality. A more comprehensive trade-off comparison is discussed next.

Speed/accuracy Trade-off

. Following MaskGIT , our MAR enjoys the flexibility of predicting multiple tokens at a time. This is controlled by the number of autoregressive steps at inference time. Figure 7 plots the speed/accuracy trade-off. MAR has a better trade-off than its AR counterpart, noting that AR is with the efficient kv-cache.

With Diffusion Loss, MAR also shows a favorable trade-off in comparison with the recently popular Diffusion Transformer (DiT) . As a latent diffusion model, DiT models the interdependence across all tokens by the diffusion process. The speed/accuracy trade-off of DiT is mainly controlled by its diffusion steps. Unlike our diffusion process on a small MLP, the diffusion process of DiT involves the entire Transformer architecture. Our method is more accurate and faster. Notably, our method can generate at a rate of $<$ 0.3 second per image with a strong FID of $<$ 2.0.

3 Benchmarking with Previous Systems

We compare with the leading systems in Table 3. We explore various model sizes (see Appendix B) and train for 800 epochs. Similar to autoregressive language models , we observe encouraging scaling behavior. Further investigation into scaling could be promising. Regarding metrics, we report 2.35 FID without CFG, largely outperforming other token-based methods. Our best entry has 1.55 FID and compares favorably with leading systems. Figure 8 shows qualitative results.

Discussion and Conclusion

The effectiveness of Diffusion Loss on various autoregressive models suggests new opportunities: modeling the interdependence of tokens by autoregression, jointly with the per-token distribution by diffusion. This is unlike the common usage of diffusion that models the joint distribution of all tokens. Our strong results on image generation suggest that autoregressive models or their extensions are powerful tools beyond language modeling. These models do not need to be constrained by vector-quantized representations. We hope our work will motivate the research community to explore sequence models with continuous-valued representations in other domains.

References

Appendix A Limitations and Broader Impacts

Limitations. Beyond demonstrating the potential of our method for image generation, this paper acknowledges its limitations.

First of all, our image generation system can produce images with noticeable artifacts (Figure 10). This limitation is commonly observed in existing methods, especially when trained on controlled, academic data (e.g., ImageNet). Research-driven models trained on ImageNet still have a noticeable gap in visual quality in comparison with commercial models trained on massive data.

Second, our image generation system relies on existing pre-trained tokenizers. The quality of our system can be limited by the quality of these tokenizers. Pre-training better tokenizers is beyond the scope of this paper. Nevertheless, we hope our work will make it easier to use continuous-valued tokenizers to be developed in the future.

Last, we note that given the limited computational resources, we have primarily tested our method on the ImageNet benchmark. Further validation is needed to assess the scalability and robustness of our approach in more diverse and real-world scenarios.

Broader Impacts. Our primary aim is to advance the fundamental research on generative models, and we believe it will be beneficial to this field. An immediate application of our method is to extend it to large visual generation models, e.g., text-to-image or text-to-video generation. Our approach has the potential to significantly reduce the training and inference cost of these large models. At the same time, our method may suggest the opportunity to replace traditional loss functions with Diffusion Loss in many applications. On the negative side, our method learns statistics from the training dataset, and as such may reflect the bias in the data; the image generation system may be misused to generate disinformation, which warrants further consideration.

Appendix B Additional Implementation Details

. To support CFG , at training time, the class condition is replaced with a dummy class token for 10 $\%$ of the samples . At inference time, the model is run with the given class token and the dummy token, providing two outputs $z_{c}$ and $z_{u}$ . The predicted noise $\varepsilon$ is then modified as: $\varepsilon=\varepsilon_{\theta}(x_{t}|t,z_{u})+\omega\cdot(\varepsilon_{\theta}(x_{t}|t,z_{c})-\varepsilon_{\theta}(x_{t}|t,z_{u}))$ , where $\omega$ is the guidance scale. At inference time, we use a CFG schedule following . We sweep the optimal guidance scale and temperature combination for each model.

Training

. By default, the models are trained using the AdamW optimizer for 400 epochs. The weight decay and momenta for AdamW are 0.02 and (0.9, 0.95). We use a batch size of 2048 and a learning rate (lr) of 8e-4. Our models with Diffusion Loss are trained with a 100-epoch linear lr warmup , followed by a constant lr schedule. The cross-entropy counterparts are trained with a cosine lr schedule, which works better for them. Following , we maintain the exponential moving average (EMA) of the model parameters with a momentum of 0.9999.

Implementation Details of Table 3

. To explore our method’s scaling behavior, we study three model sizes described as follows. In addition to MAR-L, we explore a smaller model (MAR-B) and a larger model (MAR-H). MAR-B, -L, and -H respectively have 24, 32, 40 Transformer blocks and a width of 768, 1024, and 1280. In Table 3 specifically, the denoising MLP respectively has 6, 8, 12 blocks and a width of 1024, 1280, and 1536. The training length is increased to 800 epochs. At inference time, we run 256 autoregressive steps to achieve the best results.

Pseudo-code of Diffusion Loss

Compute Resources

. Our training is mainly done on 16 servers with 8 V100 GPUs each. Training a 400 epochs MAR-L model takes $\sim$ 2.6 days on these GPUs. As a comparison, training a DiT-XL/2 and LDM-4 model for the same number of epochs on this cluster takes 4.6 and 9.5 days, respectively.

Appendix C Comparison between MAR and MAGE

MAR (regardless of the loss used) is conceptually related to MAGE . Besides implementation differences (e.g., architecture specifics, hyper-parameters), a major conceptual difference between MAR and MAGE is in the scanning order at inference time. In MAGE, following MaskGIT , the locations of the next tokens to be predicted are determined on-the-fly by the sample confidence at each location, i.e., the more confident locations are more likely to be selected at each step . In contrast, MAR adopts a fully randomized order, and its temperature sampling is applied to each token. Table 4 compares this difference in controlled settings. The first line is our MAR implementation but using MAGE’s on-the-fly ordering strategy, which has similar results as the simpler random order counterpart. Fully randomized ordering can make the training and inference process consistent regarding the distribution of orders; it also allows us to adopt token-wise temperature sampling in a way similar to autoregressive language models (e.g., GPT ).

Appendix D Additional Comparisons

Following previous works, we also report results on ImageNet at a resolution of 512 $\times$ 512, compared with leading systems (Table 5). For simplicity, we use the KL-16 tokenizer, which gives a sequence length of 32 $\times$ 32 on a 512 $\times$ 512 image. Other settings follow the MAR-L configuration described in Table 3. Our method achieves an FID of 2.74 without CFG and 1.73 with CFG. Our results are competitive with those of previous systems. Due to limited resources, we have not trained the larger MAR-H on ImageNet 512 $\times$ 512, which is expected to have better results.

D.2 L2 Loss vs. Diff Loss

A naïve baseline for continuous-valued tokens is to compute the Mean Squared Error (MSE, i.e., L2) loss directly between the predictions and the target tokens. In the case of a raster-order AR model, using the L2 loss introduces no randomness and thus cannot generate diverse samples. In the case of the MAR models with the L2 loss, the only randomness is the sequence order; the prediction at a location is deterministic for any given order. In our experiment, we have trained an MAR model with the L2 loss, which as expected leads to a disastrous FID score ( $>$ 100).

We thank Congyue Deng and Xinlei Chen for helpful discussion. We thank Google TPU Research Cloud (TRC) for granting us access to TPUs, and Google Cloud Platform for supporting GPU resources.