Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. While AR models have been highly successful in the text domain, they have been found suboptimal for processing images, videos, and audio due to the high correlation between adjacent tokens which waste inference-time compute by separately predicting each one. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in the text domain alone. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model, which is capable of jointly processing text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models of similar capacity, demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off of inference time versus generation quality.
UniDisc is a unified multimodal discrete diffusion model that can jointly process and generate text and images. First, each modality is converted into a sequnece of discrete tokens and we randomly replace a subset of these tokens with the [MASK] token according to a noise schedule and denoted in the figure with grey boxes. We jointly denoise the image and text and supervise with a weighted cross-entropy loss. At inference time we begin with a set of [MASK] tokens and iteratively unmask tokens.
UniDisc can automatically improve a user provided image and caption. We adopt a best-of-n sampling strategy with n distinct noise masks. We unroll each generation until completion and use the model's own likelihood to determine select the best generation.
We augment real images by overlaying random objects from the COCO dataset. Similarly, we augment captions by asking an LLM to generate purposely incorrect variations. We then randomly mask the image and text inputs and unmask as described above, automatically removing these undesired image artifacts and generating the correct caption. There is no human intervention or masking in any examples. In the final row, we fix the text prompt, and only allow updates to the image.
To take advantage of this, we design a novel multimodal caching mechanism that allows UniDisc to reuse the same denoising steps for specific modalities, reducing overall inference time.
We maintain different noising schedules for image and text tokens, effectively setting a larger \(dt_{\text{image}}\) and a smaller \(dt_{\text{text}}\).
Intermediate Steps during Joint Infilling of Image and Text. UniDisc jointly infills both image and text during generation.
To quantitatively analyze the generation order, we use an language-grounded segmentation model (Grounded SAM 2) to segment the image given the text prompt. We then record the order of token decoding when using confidence-based sampling and plot the progression of each region. We observe that the model generates uniformly over concepts and modalities. In AR this is not possible as the model must generate in a specific order (e.g., text first, then raster-order), and thus the model cannot jointly reason over modalities and multiple parts of the image.
L2 distance between unconditional and conditional logits over the course of generation.
Effect of classifier-free guidance on UniDisc, from left to right, starting with \(w=0\), increasing to \(w=8\).
Caption: Crab meditating, surfboard, orange sun setting, rainbow clouds, zen beach.
We analyze the quality of the generation versus time and observe a tradeoff between latency and throughput when comparing UniDisc and AR models. KV caching in AR models results in higher throughput as the batch size increases. This tradeoff can be explained by looking at the number of function evaluations (NFEs) and the cost of each in both cases. In AR generation w/KV caching, we have a fixed NFE, but each forward pass is substantially less expensive than in the NAR case. In contrast, in NAR, we can use substantially fewer NFEs, but each is more costly. Modern GPUs only reach peak throughput at larger batch sizes, or, in other words, as we decrease the batch size, the difference in computation per function evaluation diminishes, resulting in NAR having favorable performance.
We demonstrate the ability of UniDisc to perform zero-shot flexible resolution generation thanks to due to the use of RoPE embeddings on both text and image tokens. This model was fine-tuned on 512x512 images, but is able to generate at 1024x1024 without further training.
Text | Chameleon Perplexity | GPT2 Perplexity |
---|---|---|
"ICLR is globally renowned for presenting..." (Continued) | 32.836 | 35.780 |
"This is simple. This is simple." (Repeated) | 8.423 | 3.930 |
"Words Words Words Words" (Repeated) | 2.226 | 3.583 |
"AAAAAAAAAAA" (Repeated) | 2.732 | 1.904 |
" " (Spaces Repeated) | 80.240 | 1.095 |