MMLMs combine a large language model with a vision foundation model to outperform existing foundation models.
Key aspects for MLLM design include image resolution, visual encoder loss, pre-training data choices, and the importance of interleaved and text-only training data.
Collection
[
|
...
]