Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

Jun 1, 2026·

Binxu Wang

Jingxuan Fan

Xu Pan

· 0 min read

Abstract

We investigate mechanistic interpretability in Diffusion Transformers (DiTs) for text-to-image generation, focusing on how these models generate correct spatial relations between objects. Training DiTs of varying sizes with different text encoders on a task requiring generation of images with two objects whose attributes and spatial relations match text prompts, we discover that spatial-relation information is passed to image tokens through a two-stage circuit involving separate cross-attention heads when using random text embeddings. With pretrained T5 encoders, the DiT employs a different circuit that leverages information fusion in the text tokens. Notably, while both approaches achieved similar in-domain performance, their robustness to out-of-domain perturbations differed significantly, suggesting challenges in generating correct spatial relations in real-world applications.

Type

Conference paper

Publication

Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Last updated on Jun 1, 2026

Diffusion Interpretability Computer Vision Science of AI

Authors

Binxu Wang

Research Fellow

Where the Score Lives: A Wavelet View of Diffusion May 1, 2026 →