DREAMLLM is a universal learning framework that enhances MLLM's capabilities in both comprehension and creation, leveraging a causal decoder-only LLM with integrated visual representations.
Our framework utilizes Vicuna as a foundational model and OpenAI's CLIP-Large for visual encoding, allowing for advanced multimodal comprehension and image synthesis.
By employing Stable Diffusion as the image decoder, DREAMLLM effectively integrates text and image generation, achieving a seamless synthesis process that emphasizes the synergy between modalities.
The architecture demonstrates a novel interleaved generative training approach, optimizing the interaction between text comprehension and image creation to enhance overall multimodal performance.
Collection
[
|
...
]