Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible? | HackerNoon
Briefly

This paper explores how multimodal large language models (MLLMs) can enhance image generation through diffusion synthesis, leveraging inherent modality-specific information for better outcomes.
We propose to query MLLMs using learned embeddings, allowing MLLMs' enriched semantics to guide diffusion conditioning, expanding the potential for multimodal comprehension.
While multimodal creation has been used to enhance comprehension in language tasks, its reverse potential—improving comprehension through creation—remains largely unexplored in current literature.
Our approach employs Score Distillation Sampling techniques to model distributions in pixel space, advocating for a unique synergy between image generation and language understanding.
Read at Hackernoon
[
|
]