Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

TL;DR

Two DiTs trained on the same task can reach the same accuracy via fundamentally different internal circuits. Swapping the text encoder — random token embeddings (RTE) vs. pretrained T5 — re-routes how spatial relation information reaches image tokens.

RTE-DiT learns a clean two-stage circuit: a relation head tags image positions via QK interactions, an object head renders shapes into the tagged regions. T5-DiT bypasses the relation token entirely — spatial layout is decoded from contextualized object-token embeddings (especially the second shape word). The T5 circuit collapses under small prompt perturbations; the RTE circuit does not.

Overview

Text-to-image diffusion models routinely fail to place objects in the correct spatial configuration — "a red square to the left of a blue circle" comes out swapped or merged. We ask mechanistically: when a Diffusion Transformer does get spatial relations right, how does it do it, and why does the same architecture fail under tiny prompt changes?

We train PixArt-style DiTs from scratch on a controlled two-object relational dataset, varying only the text encoder. Using a scalable attention synopsis method we identify the specific cross-attention heads that carry relational information, trace the circuit end-to-end, and verify each step with weight-space analysis, ablations, and causal embedding manipulations.

Two-object relational dataset and DiT setup

Figure 1. The controlled two-object setup: prompts of the form [color1] [shape1] [relation] [color2] [shape2] over 3 shapes, 2 colors, and 8 spatial relations. We train DiT-B (and smaller variants) from scratch with T5-XXL, random token embeddings (RTE), or RTE without positional encoding.

Key Findings

Same accuracy, different circuits. Both RTE-DiT and T5-DiT reach ~84% spatial-relation accuracy, but their internal information flow is incompatible.
Spatial relations are learned last. Training dynamics show colors, then shapes, then attribute binding, then spatial relations — the relational structure is the slowest to converge.
RTE-DiT: a clean two-stage circuit. A spatial relation head (e.g. L2H8) writes a positional tag into image tokens via QK interaction with relation features; an object generation head reads the tag and renders the shape.
T5-DiT: information fusion in text tokens. Spatial information leaks out of the relation word and into the contextual embedding of the second shape token. The model effectively reads "where + what" from a single token.
Robustness diverges. Inserting an innocuous filler word (e.g. "the") drops T5-DiT spatial accuracy from 81% → 50% while RTE-DiT holds at ~84%.
Vector arithmetic controls layout. Linear edits to factorized T5 embeddings causally swap the generated spatial relation while preserving object identity — direct evidence the relation is encoded in a linear subspace of the contextual embedding.

Attention Synopsis & Circuit Discovery

Cross-attention in a DiT produces millions of maps over training. Our attention synopsis aggregates these by token category (object words, relation word, color words, filler) and averages over diffusion timesteps, surfacing the small number of heads that consistently route relational information.

Figure 3a (interactive). Attention-Map Synopsis.

Click Play to begin.

Figure 3a. Attention-map synopsis: cross-attention maps over millions of (sample, timestep) pairs are aggregated by token category (color / shape / relation / filler) and averaged, so each (layer, head) gets a four-vector category profile. A handful of heads cluster on the relation and shape categories — these become the circuit-analysis candidates.

Show the static Figure 3 from the paper

Figure 3 (original). Attention synopsis across all heads and layers. A handful of heads carry the bulk of relation- and object-specific signal; these become the candidates for circuit-level analysis.

Relation Circuits in DiTs

Figure 6 (interactive). Step through the circuit; toggle the encoder to compare.

Click Play to trace the RTE-DiT circuit.

Figure 6. End-to-end circuit comparison. RTE-DiT (default): a two-stage circuit where a relation head writes a positional tag via QK with position embeddings, and an object head reads that tag to render the correct shape. T5-DiT: contextual fusion collapses the same computation into a single fused token (here "circle"), which can drive the image directly — explaining why ablating the relation word has little effect, but inserting filler words breaks the layout.

Show the static Figure 6 from the paper

Figure 6 (original). The static circuit schematic as it appears in the paper.

RTE-DiT: two-stage relation → object pipeline

Figure 4. The spatial relation head implements a QK circuit that aligns positional embeddings with relation semantics, producing a spatial gradient on image tokens.

Figure 5. The object generation head mediates between shape tokens and the tagged image regions, completing the two-stage circuit.

sample seed:

Animated attention maps of the relation and object heads across denoising timesteps, for multiple spatial relations

Figure 5c (animated). Cross-attention maps of the relation head and object head over diffusion denoising timesteps, swept across the 8 spatial relations. The relation head writes a position-dependent gradient on image tokens early in denoising; the object head then reads that gradient and routes shape information to the correct positions — the same two-stage circuit summarized in Fig 6, here in motion. Use the seed toggle above to view different sample seeds.

T5-DiT: information fusion in object tokens

T5-DiT vector arithmetic on contextual embeddings

Figure 7. In T5-DiT, ablating the relation token leaves spatial accuracy largely intact — the information has already been fused into the contextual embedding of the second shape token. Linear arithmetic on factorized embeddings causally manipulates the generated spatial relation.

Why this matters

The same downstream task is solvable by qualitatively different mechanisms — and which mechanism a model converges to depends on the representational substrate of the text encoder, not the diffusion backbone. T5's rich contextual fusion is an efficient solution in-distribution but creates a brittle dependence on full-sequence context; RTE's tokens, lacking pretraining priors, force the DiT to construct an explicit, modular relation circuit that generalizes more robustly.

This is a concrete case study in how pretrained encoders shape downstream circuits, and a methodological template — attention synopsis plus targeted causal interventions — for dissecting compositional behavior in diffusion models.

Citation

@inproceedings{wang2026circuit,
  title     = {Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers},
  author    = {Wang, Binxu and Fan, Jingxuan and Pan, Xu},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

We thank Martin Wattenberg, Yonatan Belinkov, and Thomas Fel for feedback, and participants of the NEMI Workshop and the Mechanistic Interpretability Workshop at NeurIPS 2025 for helpful discussion. This work was supported by the Kempner Research Fellowship and the Schwartz Fellowship, with compute from the Kempner Institute cluster.