Study shows vision-language models can't handle queries with negation words
Briefly

MIT researchers discovered that vision-language models are prone to misidentifying cases involving negation, such as absence of symptoms in medical diagnoses. This flaw can have significant consequences, especially when relying on these models for patients' health assessments. In testing, the models performed similarly to random guessing when it came to identifying negation in image captions. The team proposed a solution by retraining models with a new dataset that includes negation, which showed improved accuracy. However, they caution more work is necessary to ensure the reliability of these models in high-stakes applications.
"Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences," says Kumail Alhamoud, MIT graduate student.
The researchers showed that retraining a vision-language model with a new dataset led to significant performance improvements, especially in retrieving images that don't contain certain objects.
Initial tests revealed these models often perform as well as a random guess when identifying negation in image captions, highlighting a critical flaw.
The study highlights a crucial shortcoming of vision-language models regarding negation that could lead to serious implications in medical settings.
Read at ScienceDaily
[
|
]