Meta's Chameleon AI Model Outperforms GPT-4 on Mixed Image-Text Tasks
Briefly

Chameleon uses a single token-based representation for both text and image, setting it apart from models with separate encoders. Its end-to-end training on mixed sequences of text and image data led to superior output preferred by human judges in trials.
Meta's Chameleon models, 7B and 34B, pre-trained on four trillion mixed tokens and fine-tuned for alignment, achieved state-of-the-art results in visual question answering and image captioning. The model's fusion approach allowed for joint reasoning over text and image, providing new possibilities for multimodal interaction.
Read at InfoQ
[
|
]