Leveraging the Syntactic Structure of the Text Prompt to Enhance Object-Attribute Binding in Image Generation


METADATA ONLY
Loading...

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

Current diffusion models can generate photorealistic images from text prompts but often struggle to correctly associate the attributes mentioned in the text with the appropriate objects in the image. To address this issue, we propose focused cross-attention (FCA), which controls visual attention maps using syntactic constraints from the input sentence. Additionally, the syntactic structure of the prompt aids in disentangling the multimodal CLIP embeddings commonly used in text-to-image (T2I) generation. The resulting DisCLIP embeddings and FCA can be easily integrated into state-of-the-art diffusion models without requiring additional training. We demonstrate significant improvements in T2I generation, particularly in the accurate binding of attributes to objects, across multiple datasets.

Publication status

published

Editor

Book title

LGM3A '24: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications

Journal / series

Volume

Pages / Article No.

6 - 10

Publisher

Association for Computing Machinery

Event

2nd Workshop on Large Generative Models Meet Multimodal Applications (LGM3A 2024)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Focused Cross-Attention; Disentangle CLIP; Text-to-Image Generation; Diffusion Models

Organisational unit

Notes

Funding

Related publications and datasets