Poster at NEMI 2025 Workshop on Circuit underlying relational object generation in DiT

Name: Poster at NEMI 2025 Workshop on Circuit underlying relational object generation in DiT
Start: 2025-08-22T00:00:00Z
Location: Northeastern University, Curry Student Center

Aug 22, 2025·

Binxu Wang

· 1 min read

Abstract

‘Diffusion models and their flow-matching variants dominate text-to-image (T2I) generation, yet many pre-trained models struggle to interpret spatial relationships between objects in prompts.

We built a synthetic dataset of paired text–image examples depicting two objects in simple spatial relations (e.g., “red square to the right of blue circle”) and trained PixArt-style Diffusion Transformers (DiTs) from scratch with different prompt encodings: T5, random token embeddings (RTE), and random embeddings with positional encoding (RTEP). Surprisingly, embedding choice had a strong effect: RTEP models learned relations far more reliably than T5 or plain RTE.

Using a scalable search over millions of attention maps, we identified a consistent “relational head” in RTEP-trained models: a cross-attention head linking image-position encodings to relational word embeddings (Query-key, QK), tagging the canvas location for the first object (Value-output, VO). A downstream head then reads this tag to retrieve that object’s shape and color attributes (QK) and translates the words into visual appearance (VO). This mechanism underlies strong compositional generalization to novel object relation combination. Ablating the relational head degrades relational accuracy. This two-head circuit parallels positional signaling in developmental biology, where molecular gradients specify location before identity.

The relational head emerged robustly across model scales with RTEP, but not with T5, suggesting that the symmetry of random embeddings facilitates its learning, whereas contextual word embeddings could hinder it. Our results reveal a simple, interpretable attention circuit for mapping relational words to spatial layouts, and highlight how text embedding design can affect learning and generalization of object relationships in T2I models.’

Date

Aug 22, 2025 12:00 AM

Event

2nd New England Mechanistic Interpretability (NEMI) Workshop

Location

Northeastern University, Curry Student Center

360 Huntington Ave, Boston, MA 1101

Presenting a poster on “The attention mechanism underlying relational object generation in text-to-image diffusion transformers” at the 2nd New England Mechanistic Interpretability (NEMI) Workshop.

Session: Poster Session 1 (11:45 AM - 1:00 PM)

Last updated on Aug 22, 2025

Mechanistic Interpretability Transformer Diffusion Science of AI NEMI

Authors

Binxu Wang

Research Fellow

← SfN 2025 Nanosymposium Nov 15, 2025

IAIFI Summer Workshop Talk Aug 11, 2025 →