Leveraging the Syntactic Structure of the Text Prompt to Enhance Object-Attribute Binding in Image Generation
METADATA ONLY
Loading...
Author / Producer
Date
2024
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
Current diffusion models can generate photorealistic images from text prompts but often struggle to correctly associate the attributes mentioned in the text with the appropriate objects in the image. To address this issue, we propose focused cross-attention (FCA), which controls visual attention maps using syntactic constraints from the input sentence. Additionally, the syntactic structure of the prompt aids in disentangling the multimodal CLIP embeddings commonly used in text-to-image (T2I) generation. The resulting DisCLIP embeddings and FCA can be easily integrated into state-of-the-art diffusion models without requiring additional training. We demonstrate significant improvements in T2I generation, particularly in the accurate binding of attributes to objects, across multiple datasets.
Permanent link
Publication status
published
External links
Editor
Book title
LGM3A '24: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications
Journal / series
Volume
Pages / Article No.
6 - 10
Publisher
Association for Computing Machinery
Event
2nd Workshop on Large Generative Models Meet Multimodal Applications (LGM3A 2024)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Focused Cross-Attention; Disentangle CLIP; Text-to-Image Generation; Diffusion Models