Can AI Explain a Joke? Not Quite - But It's Learning Fast | HackerNoon
Briefly

This segment of the study analyzes various AI models to evaluate their effectiveness in explainable visual entailment tasks. The researchers employed both off-the-shelf and fine-tuned models, focusing primarily on LLaVA-1.6, a high-performing visual language model that integrates large language models with vision encoders. Different configurations of LLaVA were tested, including zero-shot conditions as well as utilizing Compositional Chain-of-Thought Prompting, highlighting its potential without the need for extensive fine-tuning.
We empirically study how several baseline models perform on the task of explainable visual entailment, investigating both off-the-shelf and finetuned model performances.
LLaVA is one of the simplest, yet one of the most high-performing VLM architectures currently available, utilizing a pretrained large language model aligned with vision encoders.
Read at Hackernoon
[
|
]