Agoda Builds Multimodal Content System to Bridge Images and Reviews in Travel Discovery
Briefly

Agoda Builds Multimodal Content System to Bridge Images and Reviews in Travel Discovery
"Agoda has built a multimodal content system that unifies hotel images and guest reviews into a shared topic-based structure. The goal is to connect visual content and written feedback so users can better understand hotel attributes in a consistent way across images and reviews. The system operates at a very large scale, processing more than 700 million images along with multilingual reviews in over 40 languages."
"The core redesign introduces a shared topic taxonomy that replaces fragmented pipelines with a unified semantic layer. Previously, images and reviews were processed separately with independent ranking and retrieval logic, which made it difficult to correlate what users saw in photos with what was described in reviews. This led to an inconsistent interpretation of hotel features across modalities. By introducing topics such as Pool, Breakfast, Room Quality, and Location as shared anchors, the system maps both visual and textual signals into a common representation space."
"Images are processed using classification models that generate semantic labels such as pool, beach view, and breakfast area, which are normalized into canonical topics. In parallel, reviews are processed through NLP pipelines that extract key phrases, representative snippets, and sentiment signals, all aligned to the same topic taxonomy. This enables each topic to function as a pre-aggregated multimodal package containing curated images, multilingual review excerpts, and sentiment metadata, avoiding runtime joins by precomputing associations offline and serving them through a low-latency retrieval layer."
Agoda built a multimodal content system that connects hotel images and guest reviews through a shared topic-based structure. The system processes more than 700 million images and multilingual reviews in over 40 languages. A shared topic taxonomy replaces separate image and review pipelines, enabling correlation between what appears in photos and what is described in reviews. Topics such as Pool, Breakfast, Room Quality, and Location act as common anchors for both modalities. Image classification produces semantic labels that are normalized into canonical topics, while review NLP extracts key phrases, representative snippets, and sentiment aligned to the same taxonomy. Each topic is served as a pre-aggregated multimodal package with curated images, multilingual excerpts, and sentiment metadata using a low-latency retrieval layer.
Read at InfoQ
Unable to calculate read time
[
|
]