CVPR 2026 β˜… Highlight

Circuit Mechanisms for Spatial Relation Generation
in Diffusion Transformers

Kempner Institute & Harvard University  Β·  *Equal contribution

TL;DR

Two DiTs trained on the same task can reach the same accuracy via fundamentally different internal circuits. Swapping the text encoder β€” random token embeddings (RTE) vs. pretrained T5 β€” re-routes how spatial relation information reaches image tokens.

RTE-DiT learns a clean two-stage circuit: a relation head tags image positions via QK interactions, an object head renders shapes into the tagged regions. T5-DiT bypasses the relation token entirely β€” spatial layout is decoded from contextualized object-token embeddings (especially the second shape word). The T5 circuit collapses under small prompt perturbations; the RTE circuit does not.

Overview

Text-to-image diffusion models routinely fail to place objects in the correct spatial configuration β€” "a red square to the left of a blue circle" comes out swapped or merged. We ask mechanistically: when a Diffusion Transformer does get spatial relations right, how does it do it, and why does the same architecture fail under tiny prompt changes?

We train PixArt-style DiTs from scratch on a controlled two-object relational dataset, varying only the text encoder. Using a scalable attention synopsis method we identify the specific cross-attention heads that carry relational information, trace the circuit end-to-end, and verify each step with weight-space analysis, ablations, and causal embedding manipulations.

Two-object relational dataset and DiT setup
Figure 1. The controlled two-object setup: prompts of the form [color1] [shape1] [relation] [color2] [shape2] over 3 shapes, 2 colors, and 8 spatial relations. We train DiT-B (and smaller variants) from scratch with T5-XXL, random token embeddings (RTE), or RTE without positional encoding.

Key Findings

Attention Synopsis & Circuit Discovery

Cross-attention in a DiT produces millions of maps over training. Our attention synopsis aggregates these by token category (object words, relation word, color words, filler) and averages over diffusion timesteps, surfacing the small number of heads that consistently route relational information.

Figure 3a (interactive). Attention-Map Synopsis.
attention map (image tokens Γ— text tokens) average by (img token type) Γ— (text token type) synopsis tensor (layer Γ— head Γ— t) synopsis grid (layer Γ— head) obj1 bg obj2 red square is left of blue circle obj1 β†’ color1 [red] 0.45 obj1 β†’ shape1 [square] 0.65 obj1 β†’ rel [left of] 0.75 head layer t (timestep) L0 L1 L2 L3 L4 L5 H0 H1 H2 H3 H4 H5 aggregate avg t
Click Play to begin.
Figure 3a. Attention-map synopsis: cross-attention maps over millions of (sample, timestep) pairs are aggregated by token category (color / shape / relation / filler) and averaged, so each (layer, head) gets a four-vector category profile. A handful of heads cluster on the relation and shape categories β€” these become the circuit-analysis candidates.
Figure 3b (interactive). Weight-Space Head Screening.
QK weight-space screening produced gradient β‰ˆ reference synopsis grid (layer Γ— head) for "left" v_rel E_pos[i] cross attn head W_K W_Q produced gradient β‰ˆ ? reference (left) cos sim β‰ˆ 0.97 (avg over 8 relations) score L0 L1 L2 L3 L4 L5 H0 H1 H2 H3 H4 H5
Click Play to begin.
Figure 3b. Weight-space head screening: for each (layer, head) we directly inspect the QK weight interaction WQT WK and measure its alignment with positional and relation feature directions β€” finding the same candidate heads as the attention-map synopsis, without running a single forward pass.
Show the static Figure 3 from the paper
Original static Figure 3 from the paper
Figure 3 (original). Attention synopsis across all heads and layers. A handful of heads carry the bulk of relation- and object-specific signal; these become the candidates for circuit-level analysis.

Relation Circuits in DiTs

Figure 6 (interactive). Step through the circuit; toggle the encoder to compare.
Text tokens Cross-attention Image tokens "red" "square" "is" Relation word β€” carries spatial information"left of" "blue" Second shape β€” in T5, also encodes the relation"circle" Spatial Relation Head β€” QK forms position-dependent attention, VO writes a positional tag Relation Head QK VO Object Generation Head β€” QK matches tagged positions to shape tokens, VO writes shape values Object Head QK VO K K,V Q O K K,V Q O attn scores "becoming obj1" tag "becoming obj2" tag attn β†’ square attn β†’ circle context fuses into the last token ("circle") v_rel β€” relation component fused into the circle token v_rel T5 Relation Head β€” Q from image, K from v_rel; gradient-producing attention Relation Head QK VO aligned + anti-aligned heads T5 Object Head β€” reads the obj1 tag and renders shape values Object Head QK VO K K,V Q O K K,V Q O attn scores (E_pos Q Β· K v_rel) "becoming obj1" tag "becoming obj2" tag attn β†’ square attn β†’ circle
Click Play to trace the RTE-DiT circuit.
Figure 6. End-to-end circuit comparison. RTE-DiT (default): a two-stage circuit where a relation head writes a positional tag via QK with position embeddings, and an object head reads that tag to render the correct shape. T5-DiT: contextual fusion collapses the same computation into a single fused token (here "circle"), which can drive the image directly β€” explaining why ablating the relation word has little effect, but inserting filler words breaks the layout.
Show the static Figure 6 from the paper
Original static Figure 6 from the paper
Figure 6 (original). The static circuit schematic as it appears in the paper.

RTE-DiT: two-stage relation β†’ object pipeline

Relation head in RTE-DiT
Figure 4. The spatial relation head implements a QK circuit that aligns positional embeddings with relation semantics, producing a spatial gradient on image tokens.
Object generation head in RTE-DiT
Figure 5. The object generation head mediates between shape tokens and the tagged image regions, completing the two-stage circuit.

T5-DiT: information fusion in object tokens

T5-DiT vector arithmetic on contextual embeddings
Figure 7. In T5-DiT, ablating the relation token leaves spatial accuracy largely intact β€” the information has already been fused into the contextual embedding of the second shape token. Linear arithmetic on factorized embeddings causally manipulates the generated spatial relation.

Why this matters

The same downstream task is solvable by qualitatively different mechanisms β€” and which mechanism a model converges to depends on the representational substrate of the text encoder, not the diffusion backbone. T5's rich contextual fusion is an efficient solution in-distribution but creates a brittle dependence on full-sequence context; RTE's tokens, lacking pretraining priors, force the DiT to construct an explicit, modular relation circuit that generalizes more robustly.

This is a concrete case study in how pretrained encoders shape downstream circuits, and a methodological template β€” attention synopsis plus targeted causal interventions β€” for dissecting compositional behavior in diffusion models.

Citation

@inproceedings{wang2026circuit,
  title     = {Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers},
  author    = {Wang, Binxu and Fan, Jingxuan and Pan, Xu},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

We thank Martin Wattenberg, Yonatan Belinkov, and Thomas Fel for feedback, and participants of the NEMI Workshop and the Mechanistic Interpretability Workshop at NeurIPS 2025 for helpful discussion. This work was supported by the Kempner Research Fellowship and the Schwartz Fellowship, with compute from the Kempner Institute cluster.