ScreenAI, based on the PaLI architecture achieves state-of-the-art performance on tasks like answering questions about UI screens and infographics, summarizing or navigating screens. It outperformed other models on benchmarks like Chart QA, DocVQA, and InfographicVQA.
Google released three new evaluation datasets for screen-based question-answering (QA) models to assist the research community in developing and assessing similar models, aiming for more comprehensive benchmarking.
While acknowledging ScreenAI's top performance, Google mentioned the need for further research to close the gap with larger models like GPT-4 and Gemini on some tasks. The release of datasets encourages more research and benchmarking.
ScreenAI, utilizing the Pathways Language and Image model (PaLI) architecture, combines a Vision Transformer (ViT) with an encoder-decoder Large Language Model (LLM) like T5. The modification in ViT's patching step accommodates varying resolutions and aspect ratios in UIs and infographics.
Collection
[
|
...
]